Harmonized Data Sync
Business Context
Having timely patient data tagged with key entities (medications/conditions/etc.) app unblocks custom criteria building and patient analytics, which are critical for our users to successfully perform feasibility on their protocols and sites.
To achieve this, we need to provide a standardized view of the population with entities that can be used for custom criteria building and patient analytics.
We currently have a Harmonize Agent that tags patient data entities with their corresponding CUI codes. This agent is run manually based on demand and provides joining keys to the FHIR entities.
Current pipeline:
graph TD
subgraph Harmonization
A[Source Data] --> B[Harmonization Agent]
B --> C[Harmonization Mapping]
end
D[Source Data] --> E[Harmonization DBT Models]
C --> E
E --> F[Tenants' Harmonized Entities]
Challenges
- The data is not being integrated into the production database.
- There are different Harmonization DBT Models for each tenant.
- The harmonization process lacks automation.
Proposed Solution
Providing Harmonized Data in the Production Database
The first step is to model the harmonized entities tables for the production database. The entities we are currently working with include:
- Medications
- Conditions
- Allergy
- Lab Results (will be addressed in a future design)
- Procedures
Each entity table will have its own attributes, CUI and patient_id.
erDiagram
PATIENT ||--}| MEDICATION : _
PATIENT {
string patient_id
int site_id
string etc
}
MEDICATION {
string id
string dosage
string form
string drug_name
string cui
string patient_id
}
Patient and Medication tables will be tenant-specific. While the Patient table already exists in production, the entity tables will be created in dbt and later migrated to the application database.
Entity models will contain both patient-specific information and harmonized entity data. We've chosen to include harmonized data here to avoid future joins between harmonized entities and patient data. Although this means duplicating some data (e.g., Medication Name), the benefits of improved performance and easier maintenance outweigh the costs, as we'd need to maintain and provide the same harmonized data across all tenant databases.
Moving Data to the Production Database
WRITE TRUNCATE on relational databases is generally not advisable due to the full lock required on the table. Instead, we can maintain a checkpoint table in BigQuery to track the last timestamp of data moved to production. Using dbt snapshots, we can identify changes and only move data newer than the last timestamp.
Data will be writting through the data acess layer (API). We have to study the possibility of bulk PATCH and POST to speed things up if needed.
Standardizing Harmonization DBT Models
Since tenants use different EHRs with unique data models, we need to create a unified representation for the application database. The pipeline will have three stages:
- Modeling: Transform EHR data into a normalized, single model (current _modeled models) and group it
- Harmonization: Convert normalized data into a unified model (current Harmonization DBT Models)
- Reverse-ETL: Transform normalized data to production database schema and validate before migration
graph TD
subgraph Modeling
subgraph EHR 1
ehr1_source_data[Source Data] --> ehr1_normalized_entities[Modeled Entities]
end
subgraph EHR 2
ehr2_source_data[Source Data] --> ehr2_normalized_entities[Modeled Entities]
end
ehr1_normalized_entities --> grouped_normalized_entities[Grouped Modeled Entities]
ehr2_normalized_entities --> grouped_normalized_entities[Grouped Normalized Entities]
end
subgraph Harmonization
grouped_normalized_entities --> Harmonization_Agent[Enrich with Harmonized Entities]
end
Harmonization_Agent --> production_schema[Transform to Production Schema and Test]
subgraph Reverse-ETL
production_schema --> production_database[Production Database]
end
Automating the Harmonization Process
Upcoming project
Before running the harmonization agent, we can query existing data to identify entities with valid mappings. This allows us to run harmonization only on previously unmapped entities or those with a low similarity score. Once this is achieved, we can run the harmonization agent whenever new patient data is available on the tenant's DAG.
To facilitate this, we might consider moving the harmonization agent to the LLM-Agents API to replicate the behavior observed in the Matching pipeline.
Data Models
The following models represent the core entities in our harmonized data structure. We've intentionally simplified the schema by:
- Removing original_ids, encounter_id, code and code_system fields to focus on harmonized data
- Adding strategic indexes on CUI and date fields to optimize query performance
- Standardizing common fields across entities
erDiagram
PROCEDURE {
string id "PK"
string patient_id "IDX"
string name
date date_started "IDX"
string cui "IDX"
}
MEDICATION {
string id "PK"
string patient_id "IDX"
string is_active
date date_started "IDX"
date last_updated
date date_stopped "IDX"
string dosage_instruction
string name
string dosage
string form
string cui "IDX"
}
CONDITION {
string id "PK"
string patient_id "IDX"
string category
string name
string cui "IDX"
date date_started "IDX"
date date_stopped "IDX"
}
ALLERGY {
string id "PK"
string patient_id "IDX"
string name
string severity
string clinical_status
string cui "IDX"
date date_started "IDX"
date last_observed
date date_stopped "IDX"
}
Key considerations:
- Indexes are placed on CUI fields to optimize front-end queries using harmonized codes
- Date fields are indexed to support efficient filtering and condition building
- The harmonized_ prefix will be removed from column names in the production database
- Original source system identifiers are excluded to maintain a clean, harmonized view