MATIH Platform is in active MVP development. Documentation reflects current implementation status.
11. Pipelines & Data Engineering
Database Replication

Database Replication

Database replication pipelines copy entire tables or datasets from source databases to the MATIH data lake. Unlike CDC which captures changes incrementally, replication pipelines support full snapshots, schema-aware incremental syncs, and partition-based refresh strategies.


Replication Strategies

StrategyDescriptionUse Case
Full snapshotComplete table copy on each runSmall tables, reference data
Incremental (watermark)Rows newer than last watermarkLarge tables with update timestamps
Partition refreshReplace specific partitionsDate-partitioned fact tables
Hash-diffCompare row hashes to detect changesTables without timestamps (SCD Type 2)

Pipeline Definition

metadata:
  name: crm-replication
  version: "1.0.0"
  owner: data-engineering
 
sources:
  crm_customers:
    type: jdbc
    connection: ${CRM_DB_CONNECTION}
    table: customers
    mode: incremental
    watermark_column: last_modified
    fetch_size: 10000
 
  crm_orders:
    type: jdbc
    connection: ${CRM_DB_CONNECTION}
    table: orders
    mode: partition
    partition_column: order_date
    partition_range: "last_7_days"
 
sinks:
  data_lake:
    type: iceberg
    catalog: polaris
    database: replica
    mode: merge_on_key
    merge_keys: [customer_id]
 
orchestration:
  engine: airflow
  schedule: "0 */4 * * *"

Merge-on-Key Strategy

The merge_on_key sink mode performs an upsert operation on the target Iceberg table:

MERGE INTO replica.customers AS target
USING staging.customers_batch AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Schema Evolution

When source schemas change, the replication pipeline detects new or modified columns:

Change TypeBehavior
New column addedAutomatically add column to target (if auto_evolve: true)
Column type widenedApply compatible type promotion
Column removedMark as nullable in target, preserve historical data
Column renamedRequires manual mapping configuration

Parallel Replication

For large tables, the operator splits extraction across multiple workers:

ParameterDefaultDescription
num_partitions4Number of parallel extraction threads
partition_columnauto-detectedColumn to split ranges on
fetch_size5000JDBC fetch size per partition

Data Validation

Post-replication validation checks ensure data integrity:

CheckDescription
Row countSource count matches target count
ChecksumHash of key columns matches between source and target
Schema driftTarget schema matches source schema
Null ratesNo unexpected NULL increases in key columns

Related Pages