MATIH Platform is in active MVP development. Documentation reflects current implementation status.
10a. Data Ingestion
Sync Monitoring

Sync Monitoring

Sync Monitoring provides visibility into the health, performance, and reliability of data ingestion pipelines. This section covers the sync status dashboard, common error patterns with resolutions, Grafana monitoring dashboards, and alerting configuration.


Sync Status Dashboard

The Data Workbench includes a built-in sync monitoring dashboard accessible from Data Workbench > Ingestion > Sync History. The dashboard displays:

Connection Overview

ColumnDescription
ConnectionConnection name and source type
StatusCurrent connection status: ACTIVE, PAUSED, ERROR, PENDING
Last SyncTimestamp and result of the most recent sync
RecordsNumber of records synced in the last run
DurationTime elapsed for the last sync
ScheduleCron schedule or "Manual"
Next RunEstimated time of the next scheduled sync

Sync History Table

Each sync execution is recorded in the SyncHistory entity with the following fields.

FieldTypeDescription
idUUIDUnique sync job identifier
connectionIdUUIDParent connection
statusenumRUNNING, SUCCEEDED, FAILED, CANCELLED
recordsSyncedlongTotal records extracted and loaded
bytesSyncedlongTotal bytes transferred
startedAttimestampJob start time
completedAttimestampJob completion time
durationMslongTotal duration in milliseconds
errorMessagestringError details (null if succeeded)
tablesAffectedJSONList of tables modified during the sync

Filtering and Search

  • By connection: Select a specific connection to view only its sync history
  • By status: Filter by SUCCEEDED, FAILED, RUNNING, or CANCELLED
  • By date range: Set start and end dates to narrow the history window
  • Pagination: Default page size is 20 entries, sorted by startedAt descending

Common Sync Errors and Resolutions

Source Connection Errors

ErrorCauseResolution
Connection refusedSource database is unreachableVerify hostname and port. Check firewall rules and VPN connectivity. Confirm the database service is running.
Authentication failedInvalid credentialsUpdate the source with correct username/password. For databases, verify the user has LOGIN privilege. For SaaS connectors, regenerate the API key or refresh token.
SSL handshake failedSSL configuration mismatchMatch the ssl_mode in the source config to the server's SSL requirements. Try require or verify-ca.
Connection timed outNetwork latency or blocked portIncrease connection timeout. Verify network path between the Matih cluster and the source.
Unknown databaseDatabase name incorrectVerify the database name exists on the source server. Database names are case-sensitive on some platforms.

Schema and Data Errors

ErrorCauseResolution
Schema has changedSource table columns were added, removed, or type-changed since last syncRe-run schema discovery (POST /sources/{id}/discover) and update the connection's stream selection. Airbyte handles most schema evolution automatically for append syncs.
Primary key violationDuplicate primary key values in deduplication modeVerify the source table's primary key is unique. Switch to Full Refresh mode if the source has non-unique keys.
Unsupported data typeSource contains a column type not mapped by the connectorCheck the connector's type mapping documentation. The unsupported column may need to be excluded from the stream selection.
Column count mismatchSource schema changed between discovery and syncRe-discover the schema and recreate the connection.

Destination Errors

ErrorCauseResolution
Iceberg commit failedConcurrent writes to the same table or catalog unavailableRetry the sync. If persistent, check Polaris catalog health and Iceberg metadata storage.
Namespace not foundTenant namespace missing in PolarisVerify tenant provisioning completed. Contact platform admin to check Polaris namespace creation.
Quota exceededTenant has exceeded storage quotaCheck tenant storage usage. Contact admin to increase quota or archive old data.

Airbyte Infrastructure Errors

ErrorCauseResolution
Worker pod OOMKilledSync job exceeded memory limitsReduce the number of streams synced in parallel. For very wide tables, exclude unused columns.
Job exceeded timeoutSync took longer than the maximum allowed durationIncrease the job timeout for the connection. Consider switching to incremental mode for large tables.
Airbyte server unreachableTenant's Airbyte deployment is downCheck the Airbyte pod status. The platform will automatically restart failed Airbyte pods.
Rate limited by sourceSource API enforced rate limits (common for SaaS connectors)Reduce sync frequency. Airbyte handles rate limiting with automatic backoff, but aggressive schedules can trigger persistent throttling.

Monitoring Metrics

The Ingestion Service exposes Prometheus metrics for monitoring sync operations.

Key Metrics

MetricTypeLabelsDescription
matih_ingestion_syncs_totalCountertenant_id, connection_id, statusTotal sync jobs by status
matih_ingestion_sync_duration_secondsHistogramtenant_id, connection_idSync duration distribution
matih_ingestion_records_synced_totalCountertenant_id, connection_idTotal records synced
matih_ingestion_bytes_synced_totalCountertenant_id, connection_idTotal bytes synced
matih_ingestion_sync_errors_totalCountertenant_id, connection_id, error_typeSync errors by type
matih_ingestion_active_syncsGaugetenant_idCurrently running sync jobs
matih_ingestion_file_imports_totalCountertenant_id, status, formatFile import jobs by status and format
matih_ingestion_file_import_records_totalCountertenant_idTotal records imported from files

Grafana Dashboards

The platform includes pre-built Grafana dashboards for ingestion monitoring.

Dashboard: Ingestion Overview

Provides a high-level view of all ingestion activity across tenants.

PanelVisualizationQuery
Sync Success RateStat (percentage)rate(matih_ingestion_syncs_total{status="SUCCEEDED"}[24h]) / rate(matih_ingestion_syncs_total[24h])
Active SyncsStat (gauge)sum(matih_ingestion_active_syncs)
Syncs per HourTime seriesrate(matih_ingestion_syncs_total[1h])
Records Synced (24h)Stat (total)increase(matih_ingestion_records_synced_total[24h])
Sync Duration (p95)Time serieshistogram_quantile(0.95, rate(matih_ingestion_sync_duration_seconds_bucket[1h]))
Error Rate by TypeBar chartrate(matih_ingestion_sync_errors_total[24h]) by error_type

Dashboard: Connection Detail

Provides drill-down into a specific connection's sync history.

PanelVisualizationQuery
Sync TimelineTime series (success/failure)matih_ingestion_syncs_total{connection_id="..."}
Records per SyncBar chartmatih_ingestion_records_synced_total{connection_id="..."}
Duration TrendTime seriesmatih_ingestion_sync_duration_seconds{connection_id="..."}
Bytes TransferredTime seriesmatih_ingestion_bytes_synced_total{connection_id="..."}
Error LogTableRecent sync errors for this connection

Dashboard: File Import

Monitors file upload and import operations.

PanelVisualizationQuery
Imports by FormatPie chartmatih_ingestion_file_imports_total by format
Import Success RateStatrate(matih_ingestion_file_imports_total{status="COMPLETED"}[24h]) / rate(matih_ingestion_file_imports_total[24h])
Records Imported (24h)Statincrease(matih_ingestion_file_import_records_total[24h])

Alerting Configuration

Configure alerts for ingestion failures to ensure data freshness and pipeline reliability.

Recommended Alerts

AlertConditionSeverityDescription
Sync Failurematih_ingestion_syncs_total{status="FAILED"} increasesWarningA sync job failed. Investigate the error message in sync history.
Consecutive Failures3+ consecutive FAILED syncs for the same connectionCriticalPersistent failure indicates a configuration or infrastructure issue requiring immediate attention.
Sync Duration SpikeSync duration exceeds 3x the trailing 7-day averageWarningMay indicate source performance degradation, increased data volume, or network issues.
No Syncs Runningmatih_ingestion_active_syncs == 0 for more than 2x the shortest schedule intervalWarningScheduled syncs may not be triggering. Check Airbyte scheduler health.
Data Freshness SLANo successful sync for a connection in the last N hours (configurable per connection)CriticalData is stale beyond the acceptable threshold.
High Error Raterate(matih_ingestion_sync_errors_total[1h]) > 0.5WarningMore than half of sync attempts are failing.

Alert Configuration Example (Prometheus AlertManager)

groups:
  - name: ingestion-alerts
    rules:
      - alert: IngestionSyncFailed
        expr: increase(matih_ingestion_syncs_total{status="FAILED"}[1h]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Ingestion sync failed"
          description: "Connection {{ $labels.connection_id }} in tenant {{ $labels.tenant_id }} had a failed sync in the last hour."
 
      - alert: IngestionHighErrorRate
        expr: |
          rate(matih_ingestion_sync_errors_total[1h])
          / rate(matih_ingestion_syncs_total[1h]) > 0.5
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "High ingestion error rate"
          description: "More than 50% of syncs are failing for tenant {{ $labels.tenant_id }}."
 
      - alert: IngestionSyncDurationSpike
        expr: |
          matih_ingestion_sync_duration_seconds
          > 3 * avg_over_time(matih_ingestion_sync_duration_seconds[7d])
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Sync duration spike detected"
          description: "Connection {{ $labels.connection_id }} sync duration is 3x above the 7-day average."

Troubleshooting Workflow

When a sync failure alert fires, follow this workflow.

1. Check sync history
   GET /api/v1/syncs?connectionId={id}&sort=startedAt,desc&size=5

2. Read the error message from the most recent FAILED sync

3. Classify the error:
   - Source connection error    -> Check source accessibility and credentials
   - Schema/data error          -> Re-discover schema, check source changes
   - Destination error          -> Check Polaris catalog and Iceberg health
   - Infrastructure error       -> Check Airbyte pod status and resources

4. After fixing, trigger a manual sync:
   POST /api/v1/syncs/connections/{connectionId}/trigger

5. Verify the sync completes successfully:
   GET /api/v1/syncs/{syncId}

6. If the issue recurs, check Grafana dashboards for patterns
   (time-of-day correlation, data volume correlation, etc.)