Database Replication

Database replication pipelines copy entire tables or datasets from source databases to the MATIH data lake. Unlike CDC which captures changes incrementally, replication pipelines support full snapshots, schema-aware incremental syncs, and partition-based refresh strategies.

Replication Strategies

Strategy	Description	Use Case
Full snapshot	Complete table copy on each run	Small tables, reference data
Incremental (watermark)	Rows newer than last watermark	Large tables with update timestamps
Partition refresh	Replace specific partitions	Date-partitioned fact tables
Hash-diff	Compare row hashes to detect changes	Tables without timestamps (SCD Type 2)

Pipeline Definition

metadata:
  name: crm-replication
  version: "1.0.0"
  owner: data-engineering
 
sources:
  crm_customers:
    type: jdbc
    connection: ${CRM_DB_CONNECTION}
    table: customers
    mode: incremental
    watermark_column: last_modified
    fetch_size: 10000
 
  crm_orders:
    type: jdbc
    connection: ${CRM_DB_CONNECTION}
    table: orders
    mode: partition
    partition_column: order_date
    partition_range: "last_7_days"
 
sinks:
  data_lake:
    type: iceberg
    catalog: polaris
    database: replica
    mode: merge_on_key
    merge_keys: [customer_id]
 
orchestration:
  engine: airflow
  schedule: "0 */4 * * *"

Merge-on-Key Strategy

The merge_on_key sink mode performs an upsert operation on the target Iceberg table:

MERGE INTO replica.customers AS target
USING staging.customers_batch AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Schema Evolution

When source schemas change, the replication pipeline detects new or modified columns:

Change Type	Behavior
New column added	Automatically add column to target (if `auto_evolve: true`)
Column type widened	Apply compatible type promotion
Column removed	Mark as nullable in target, preserve historical data
Column renamed	Requires manual mapping configuration

Parallel Replication

For large tables, the operator splits extraction across multiple workers:

Parameter	Default	Description
`num_partitions`	4	Number of parallel extraction threads
`partition_column`	auto-detected	Column to split ranges on
`fetch_size`	5000	JDBC fetch size per partition

Data Validation

Post-replication validation checks ensure data integrity:

Check	Description
Row count	Source count matches target count
Checksum	Hash of key columns matches between source and target
Schema drift	Target schema matches source schema
Null rates	No unexpected NULL increases in key columns

Batch Ingestion -- Single-table extraction
Change Data Capture -- Real-time change capture
Schema Registry -- Schema management

Change Data Capture API Connectors