Sync Monitoring
Sync Monitoring provides visibility into the health, performance, and reliability of data ingestion pipelines. This section covers the sync status dashboard, common error patterns with resolutions, Grafana monitoring dashboards, and alerting configuration.
Sync Status Dashboard
The Data Workbench includes a built-in sync monitoring dashboard accessible from Data Workbench > Ingestion > Sync History. The dashboard displays:
Connection Overview
| Column | Description |
|---|---|
| Connection | Connection name and source type |
| Status | Current connection status: ACTIVE, PAUSED, ERROR, PENDING |
| Last Sync | Timestamp and result of the most recent sync |
| Records | Number of records synced in the last run |
| Duration | Time elapsed for the last sync |
| Schedule | Cron schedule or "Manual" |
| Next Run | Estimated time of the next scheduled sync |
Sync History Table
Each sync execution is recorded in the SyncHistory entity with the following fields.
| Field | Type | Description |
|---|---|---|
id | UUID | Unique sync job identifier |
connectionId | UUID | Parent connection |
status | enum | RUNNING, SUCCEEDED, FAILED, CANCELLED |
recordsSynced | long | Total records extracted and loaded |
bytesSynced | long | Total bytes transferred |
startedAt | timestamp | Job start time |
completedAt | timestamp | Job completion time |
durationMs | long | Total duration in milliseconds |
errorMessage | string | Error details (null if succeeded) |
tablesAffected | JSON | List of tables modified during the sync |
Filtering and Search
- By connection: Select a specific connection to view only its sync history
- By status: Filter by
SUCCEEDED,FAILED,RUNNING, orCANCELLED - By date range: Set start and end dates to narrow the history window
- Pagination: Default page size is 20 entries, sorted by
startedAtdescending
Common Sync Errors and Resolutions
Source Connection Errors
| Error | Cause | Resolution |
|---|---|---|
Connection refused | Source database is unreachable | Verify hostname and port. Check firewall rules and VPN connectivity. Confirm the database service is running. |
Authentication failed | Invalid credentials | Update the source with correct username/password. For databases, verify the user has LOGIN privilege. For SaaS connectors, regenerate the API key or refresh token. |
SSL handshake failed | SSL configuration mismatch | Match the ssl_mode in the source config to the server's SSL requirements. Try require or verify-ca. |
Connection timed out | Network latency or blocked port | Increase connection timeout. Verify network path between the Matih cluster and the source. |
Unknown database | Database name incorrect | Verify the database name exists on the source server. Database names are case-sensitive on some platforms. |
Schema and Data Errors
| Error | Cause | Resolution |
|---|---|---|
Schema has changed | Source table columns were added, removed, or type-changed since last sync | Re-run schema discovery (POST /sources/{id}/discover) and update the connection's stream selection. Airbyte handles most schema evolution automatically for append syncs. |
Primary key violation | Duplicate primary key values in deduplication mode | Verify the source table's primary key is unique. Switch to Full Refresh mode if the source has non-unique keys. |
Unsupported data type | Source contains a column type not mapped by the connector | Check the connector's type mapping documentation. The unsupported column may need to be excluded from the stream selection. |
Column count mismatch | Source schema changed between discovery and sync | Re-discover the schema and recreate the connection. |
Destination Errors
| Error | Cause | Resolution |
|---|---|---|
Iceberg commit failed | Concurrent writes to the same table or catalog unavailable | Retry the sync. If persistent, check Polaris catalog health and Iceberg metadata storage. |
Namespace not found | Tenant namespace missing in Polaris | Verify tenant provisioning completed. Contact platform admin to check Polaris namespace creation. |
Quota exceeded | Tenant has exceeded storage quota | Check tenant storage usage. Contact admin to increase quota or archive old data. |
Airbyte Infrastructure Errors
| Error | Cause | Resolution |
|---|---|---|
Worker pod OOMKilled | Sync job exceeded memory limits | Reduce the number of streams synced in parallel. For very wide tables, exclude unused columns. |
Job exceeded timeout | Sync took longer than the maximum allowed duration | Increase the job timeout for the connection. Consider switching to incremental mode for large tables. |
Airbyte server unreachable | Tenant's Airbyte deployment is down | Check the Airbyte pod status. The platform will automatically restart failed Airbyte pods. |
Rate limited by source | Source API enforced rate limits (common for SaaS connectors) | Reduce sync frequency. Airbyte handles rate limiting with automatic backoff, but aggressive schedules can trigger persistent throttling. |
Monitoring Metrics
The Ingestion Service exposes Prometheus metrics for monitoring sync operations.
Key Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
matih_ingestion_syncs_total | Counter | tenant_id, connection_id, status | Total sync jobs by status |
matih_ingestion_sync_duration_seconds | Histogram | tenant_id, connection_id | Sync duration distribution |
matih_ingestion_records_synced_total | Counter | tenant_id, connection_id | Total records synced |
matih_ingestion_bytes_synced_total | Counter | tenant_id, connection_id | Total bytes synced |
matih_ingestion_sync_errors_total | Counter | tenant_id, connection_id, error_type | Sync errors by type |
matih_ingestion_active_syncs | Gauge | tenant_id | Currently running sync jobs |
matih_ingestion_file_imports_total | Counter | tenant_id, status, format | File import jobs by status and format |
matih_ingestion_file_import_records_total | Counter | tenant_id | Total records imported from files |
Grafana Dashboards
The platform includes pre-built Grafana dashboards for ingestion monitoring.
Dashboard: Ingestion Overview
Provides a high-level view of all ingestion activity across tenants.
| Panel | Visualization | Query |
|---|---|---|
| Sync Success Rate | Stat (percentage) | rate(matih_ingestion_syncs_total{status="SUCCEEDED"}[24h]) / rate(matih_ingestion_syncs_total[24h]) |
| Active Syncs | Stat (gauge) | sum(matih_ingestion_active_syncs) |
| Syncs per Hour | Time series | rate(matih_ingestion_syncs_total[1h]) |
| Records Synced (24h) | Stat (total) | increase(matih_ingestion_records_synced_total[24h]) |
| Sync Duration (p95) | Time series | histogram_quantile(0.95, rate(matih_ingestion_sync_duration_seconds_bucket[1h])) |
| Error Rate by Type | Bar chart | rate(matih_ingestion_sync_errors_total[24h]) by error_type |
Dashboard: Connection Detail
Provides drill-down into a specific connection's sync history.
| Panel | Visualization | Query |
|---|---|---|
| Sync Timeline | Time series (success/failure) | matih_ingestion_syncs_total{connection_id="..."} |
| Records per Sync | Bar chart | matih_ingestion_records_synced_total{connection_id="..."} |
| Duration Trend | Time series | matih_ingestion_sync_duration_seconds{connection_id="..."} |
| Bytes Transferred | Time series | matih_ingestion_bytes_synced_total{connection_id="..."} |
| Error Log | Table | Recent sync errors for this connection |
Dashboard: File Import
Monitors file upload and import operations.
| Panel | Visualization | Query |
|---|---|---|
| Imports by Format | Pie chart | matih_ingestion_file_imports_total by format |
| Import Success Rate | Stat | rate(matih_ingestion_file_imports_total{status="COMPLETED"}[24h]) / rate(matih_ingestion_file_imports_total[24h]) |
| Records Imported (24h) | Stat | increase(matih_ingestion_file_import_records_total[24h]) |
Alerting Configuration
Configure alerts for ingestion failures to ensure data freshness and pipeline reliability.
Recommended Alerts
| Alert | Condition | Severity | Description |
|---|---|---|---|
| Sync Failure | matih_ingestion_syncs_total{status="FAILED"} increases | Warning | A sync job failed. Investigate the error message in sync history. |
| Consecutive Failures | 3+ consecutive FAILED syncs for the same connection | Critical | Persistent failure indicates a configuration or infrastructure issue requiring immediate attention. |
| Sync Duration Spike | Sync duration exceeds 3x the trailing 7-day average | Warning | May indicate source performance degradation, increased data volume, or network issues. |
| No Syncs Running | matih_ingestion_active_syncs == 0 for more than 2x the shortest schedule interval | Warning | Scheduled syncs may not be triggering. Check Airbyte scheduler health. |
| Data Freshness SLA | No successful sync for a connection in the last N hours (configurable per connection) | Critical | Data is stale beyond the acceptable threshold. |
| High Error Rate | rate(matih_ingestion_sync_errors_total[1h]) > 0.5 | Warning | More than half of sync attempts are failing. |
Alert Configuration Example (Prometheus AlertManager)
groups:
- name: ingestion-alerts
rules:
- alert: IngestionSyncFailed
expr: increase(matih_ingestion_syncs_total{status="FAILED"}[1h]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Ingestion sync failed"
description: "Connection {{ $labels.connection_id }} in tenant {{ $labels.tenant_id }} had a failed sync in the last hour."
- alert: IngestionHighErrorRate
expr: |
rate(matih_ingestion_sync_errors_total[1h])
/ rate(matih_ingestion_syncs_total[1h]) > 0.5
for: 15m
labels:
severity: critical
annotations:
summary: "High ingestion error rate"
description: "More than 50% of syncs are failing for tenant {{ $labels.tenant_id }}."
- alert: IngestionSyncDurationSpike
expr: |
matih_ingestion_sync_duration_seconds
> 3 * avg_over_time(matih_ingestion_sync_duration_seconds[7d])
for: 10m
labels:
severity: warning
annotations:
summary: "Sync duration spike detected"
description: "Connection {{ $labels.connection_id }} sync duration is 3x above the 7-day average."Troubleshooting Workflow
When a sync failure alert fires, follow this workflow.
1. Check sync history
GET /api/v1/syncs?connectionId={id}&sort=startedAt,desc&size=5
2. Read the error message from the most recent FAILED sync
3. Classify the error:
- Source connection error -> Check source accessibility and credentials
- Schema/data error -> Re-discover schema, check source changes
- Destination error -> Check Polaris catalog and Iceberg health
- Infrastructure error -> Check Airbyte pod status and resources
4. After fixing, trigger a manual sync:
POST /api/v1/syncs/connections/{connectionId}/trigger
5. Verify the sync completes successfully:
GET /api/v1/syncs/{syncId}
6. If the issue recurs, check Grafana dashboards for patterns
(time-of-day correlation, data volume correlation, etc.)