Database Replication
Database replication pipelines copy entire tables or datasets from source databases to the MATIH data lake. Unlike CDC which captures changes incrementally, replication pipelines support full snapshots, schema-aware incremental syncs, and partition-based refresh strategies.
Replication Strategies
| Strategy | Description | Use Case |
|---|---|---|
| Full snapshot | Complete table copy on each run | Small tables, reference data |
| Incremental (watermark) | Rows newer than last watermark | Large tables with update timestamps |
| Partition refresh | Replace specific partitions | Date-partitioned fact tables |
| Hash-diff | Compare row hashes to detect changes | Tables without timestamps (SCD Type 2) |
Pipeline Definition
metadata:
name: crm-replication
version: "1.0.0"
owner: data-engineering
sources:
crm_customers:
type: jdbc
connection: ${CRM_DB_CONNECTION}
table: customers
mode: incremental
watermark_column: last_modified
fetch_size: 10000
crm_orders:
type: jdbc
connection: ${CRM_DB_CONNECTION}
table: orders
mode: partition
partition_column: order_date
partition_range: "last_7_days"
sinks:
data_lake:
type: iceberg
catalog: polaris
database: replica
mode: merge_on_key
merge_keys: [customer_id]
orchestration:
engine: airflow
schedule: "0 */4 * * *"Merge-on-Key Strategy
The merge_on_key sink mode performs an upsert operation on the target Iceberg table:
MERGE INTO replica.customers AS target
USING staging.customers_batch AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *Schema Evolution
When source schemas change, the replication pipeline detects new or modified columns:
| Change Type | Behavior |
|---|---|
| New column added | Automatically add column to target (if auto_evolve: true) |
| Column type widened | Apply compatible type promotion |
| Column removed | Mark as nullable in target, preserve historical data |
| Column renamed | Requires manual mapping configuration |
Parallel Replication
For large tables, the operator splits extraction across multiple workers:
| Parameter | Default | Description |
|---|---|---|
num_partitions | 4 | Number of parallel extraction threads |
partition_column | auto-detected | Column to split ranges on |
fetch_size | 5000 | JDBC fetch size per partition |
Data Validation
Post-replication validation checks ensure data integrity:
| Check | Description |
|---|---|
| Row count | Source count matches target count |
| Checksum | Hash of key columns matches between source and target |
| Schema drift | Target schema matches source schema |
| Null rates | No unexpected NULL increases in key columns |
Related Pages
- Batch Ingestion -- Single-table extraction
- Change Data Capture -- Real-time change capture
- Schema Registry -- Schema management