Harmonized Data Sync

Business Context

Having timely patient data tagged with key entities (medications/conditions/etc.) app unblocks custom criteria building and patient analytics, which are critical for our users to successfully perform feasibility on their protocols and sites.

To achieve this, we need to provide a standardized view of the population with entities that can be used for custom criteria building and patient analytics.

We currently have a Harmonize Agent that tags patient data entities with their corresponding CUI codes. This agent is run manually based on demand and provides joining keys to the FHIR entities.

Current pipeline:

graph TD
    subgraph Harmonization
        A[Source Data] --> B[Harmonization Agent]
        B --> C[Harmonization Mapping]
    end
    D[Source Data] --> E[Harmonization DBT Models]
    C --> E
    E --> F[Tenants' Harmonized Entities]

Challenges

The data is not being integrated into the production database.
There are different Harmonization DBT Models for each tenant.
The harmonization process lacks automation.

Proposed Solution

Providing Harmonized Data in the Production Database

The first step is to model the harmonized entities tables for the production database. The entities we are currently working with include:

Medications
Conditions
Allergy
Lab Results (will be addressed in a future design)
Procedures

Each entity table will have its own attributes, CUI and patient_id.

erDiagram
    PATIENT ||--}| MEDICATION : _
    PATIENT {
        string patient_id
        int site_id
        string etc
    }
    MEDICATION {
        string id
        string dosage
        string form
        string drug_name
        string cui
        string patient_id
    }

Patient and Medication tables will be tenant-specific. While the Patient table already exists in production, the entity tables will be created in dbt and later migrated to the application database.

Entity models will contain both patient-specific information and harmonized entity data. We've chosen to include harmonized data here to avoid future joins between harmonized entities and patient data. Although this means duplicating some data (e.g., Medication Name), the benefits of improved performance and easier maintenance outweigh the costs, as we'd need to maintain and provide the same harmonized data across all tenant databases.

Moving Data to the Production Database

WRITE TRUNCATE on relational databases is generally not advisable due to the full lock required on the table. Instead, we can maintain a checkpoint table in BigQuery to track the last timestamp of data moved to production. Using dbt snapshots, we can identify changes and only move data newer than the last timestamp.

Data will be writting through the data acess layer (API). We have to study the possibility of bulk PATCH and POST to speed things up if needed.

Standardizing Harmonization DBT Models

Since tenants use different EHRs with unique data models, we need to create a unified representation for the application database. The pipeline will have three stages:

Modeling: Transform EHR data into a normalized, single model (current _modeled models) and group it
Harmonization: Convert normalized data into a unified model (current Harmonization DBT Models)
Reverse-ETL: Transform normalized data to production database schema and validate before migration

graph TD

    subgraph Modeling
        subgraph EHR 1
            ehr1_source_data[Source Data] --> ehr1_normalized_entities[Modeled Entities]
        end

        subgraph EHR 2
            ehr2_source_data[Source Data] --> ehr2_normalized_entities[Modeled Entities]
        end
        ehr1_normalized_entities --> grouped_normalized_entities[Grouped Modeled Entities]
        ehr2_normalized_entities --> grouped_normalized_entities[Grouped Normalized Entities]
    end

    subgraph Harmonization
    grouped_normalized_entities --> Harmonization_Agent[Enrich with Harmonized Entities]
    end
    Harmonization_Agent --> production_schema[Transform to Production Schema and Test]

    subgraph Reverse-ETL
        production_schema --> production_database[Production Database]
    end

Automating the Harmonization Process

Upcoming project

Before running the harmonization agent, we can query existing data to identify entities with valid mappings. This allows us to run harmonization only on previously unmapped entities or those with a low similarity score. Once this is achieved, we can run the harmonization agent whenever new patient data is available on the tenant's DAG.

To facilitate this, we might consider moving the harmonization agent to the LLM-Agents API to replicate the behavior observed in the Matching pipeline.

Data Models

The following models represent the core entities in our harmonized data structure. We've intentionally simplified the schema by:

Removing original_ids, encounter_id, code and code_system fields to focus on harmonized data
Adding strategic indexes on CUI and date fields to optimize query performance
Standardizing common fields across entities

erDiagram
    PROCEDURE {
        string id "PK"
        string patient_id "IDX"
        string name
        date date_started "IDX"
        string cui "IDX"
    }
    MEDICATION {
        string id "PK"
        string patient_id "IDX"
        string is_active
        date date_started "IDX"
        date last_updated
        date date_stopped "IDX"
        string dosage_instruction
        string name
        string dosage
        string form
        string cui "IDX"
    }
    CONDITION {
        string id "PK"
        string patient_id "IDX"
        string category
        string name
        string cui "IDX"
        date date_started "IDX"
        date date_stopped "IDX"
    }
    ALLERGY {
        string id "PK"
        string patient_id "IDX"
        string name
        string severity
        string clinical_status
        string cui "IDX"
        date date_started "IDX"
        date last_observed
        date date_stopped "IDX"
    }

Key considerations:

Indexes are placed on CUI fields to optimize front-end queries using harmonized codes
Date fields are indexed to support efficient filtering and condition building
The harmonized_ prefix will be removed from column names in the production database
Original source system identifiers are excluded to maintain a clean, harmonized view