Skip to content

Architecture Design Document: Multi-tenant Data Sync Pipeline

Executive Summary

This document outlines the architecture for an automated data synchronization pipeline that replicates data from multiple tenant PostgreSQL databases to BigQuery. The system utilizes Apache Airflow for orchestration, dlt (data load tool) for ETL operations, and Cloud Run for scalable execution. The pipeline includes comprehensive validation to ensure data integrity and consistency between source and target systems.

Motivation and Goals

Why This Matters

  • Enables automated matching capabilities across our platform
  • Reduces manual data processing overhead
  • Establishes foundation for tenant-specific analytics
  • Improves data accessibility while maintaining security boundaries
  • Provides reliable testing environment for safe development

Goals

  • Automated daily full data sync for each tenant
  • Robust validation of data integrity
  • Minimal operational overhead
  • Cost-effective execution using serverless infrastructure
  • Comprehensive test coverage for multi-tenant scenarios

Non-Goals

  • Real-time data replication
  • Partial/incremental updates
  • Historical data versioning

Proposal

High-Level Architecture

graph LR
    subgraph Production
        A[Tenant-Specific DAGs]
        A --> D[DLT Tasks]
        D --> F[Validation Task]
    end

Key Components

  1. DAG Generator

  2. Queries tenant registry for active accounts

  3. Generates tenant-specific DAGs dynamically
  4. Environment-aware configuration for test/prod databases
  5. Testing: Validates DAG generation with mock tenants

  6. ETL Task

  7. Runs on Cloud Run for scalability

  8. Uses dlt for data extraction and loading
  9. Implements dlt's 'staging-optimized' replace strategy
  10. Testing: Uses dedicated test database with sample data

  11. Validation Task

  12. Validates record counts

  13. Optional MD5 checksum validation
  14. Optional Schema validation
  15. Testing: Verifies data integrity across test scenarios

Development and Testing Infrastructure

  • Separate Postgres container to hold the postgres application database for testing
  • Factory patterns for generating test data
  • Environment-aware configuration switching

Implementation Details

```python:dags/tenant_sync_dag.py def get_tenant_connections() -> dict: """Returns a mapping of tenant_ids to their database connections.""" env = Variable.get('ENVIRONMENT', 'development')

if env == 'development':
    return {
        'test_tenant': 'postgresql://test_user:test_pass@test-postgres:5432/test_tenant_db'
    }

# In production, query tenant registry for active accounts
tenant_registry = PostgresHook(postgres_conn_id='tenant_registry')
tenant_connections = {}

with tenant_registry.get_conn() as conn:
    with conn.cursor() as cur:
        cur.execute("SELECT tenant_id, db_host, db_name FROM tenant_registry WHERE is_active = true")
        for tenant_id, db_host, db_name in cur.fetchall():
            tenant_connections[tenant_id] = (
                f'postgresql://prod_user:prod_pass@{db_host}/{db_name}'
            )

return tenant_connections

def create_tenant_dag(tenant_id: str, db_connection: str): """Creates environment-aware DAG instance.""" dag = DAG( f'sync_tenant_{tenant_id}', schedule_interval='@daily', start_date=datetime(2024, 1, 1), catchup=False, tags=['tenant_sync'] )

with dag:
    sync_task = CloudRunOperator(
        task_id='sync_data',
        project_id='your-project',
        location='us-central1',
        service_name='etl-service',
        env_vars={
            'TENANT_ID': tenant_id,
            'DB_CONNECTION': db_connection,
            'REPLACE_STRATEGY': 'staging-optimized'
        }
    )

Create DAGs for each tenant

tenant_connections = get_tenant_connections() for tenant_id, db_connection in tenant_connections.items(): globals()[f'dag_{tenant_id}'] = create_tenant_dag(tenant_id, db_connection) ```

Pros

  • Serverless execution minimizes infrastructure costs
  • Full refresh ensures data consistency
  • Comprehensive validation ensures data quality
  • Automated tenant onboarding
  • Safe testing environment for development

Cons

  • dlt limitation with empty tables
  • Full refresh may be resource intensive
  • Potential for longer execution times with large datasets
  • Additional infrastructure needed for test environment