Sync Monitoring

Sync Monitoring provides visibility into the health, performance, and reliability of data ingestion pipelines. This section covers the sync status dashboard, common error patterns with resolutions, Grafana monitoring dashboards, and alerting configuration.

Sync Status Dashboard

The Data Workbench includes a built-in sync monitoring dashboard accessible from Data Workbench > Ingestion > Sync History. The dashboard displays:

Connection Overview

Column	Description
Connection	Connection name and source type
Status	Current connection status: `ACTIVE`, `PAUSED`, `ERROR`, `PENDING`
Last Sync	Timestamp and result of the most recent sync
Records	Number of records synced in the last run
Duration	Time elapsed for the last sync
Schedule	Cron schedule or "Manual"
Next Run	Estimated time of the next scheduled sync

Sync History Table

Each sync execution is recorded in the SyncHistory entity with the following fields.

Field	Type	Description
`id`	UUID	Unique sync job identifier
`connectionId`	UUID	Parent connection
`status`	enum	`RUNNING`, `SUCCEEDED`, `FAILED`, `CANCELLED`
`recordsSynced`	long	Total records extracted and loaded
`bytesSynced`	long	Total bytes transferred
`startedAt`	timestamp	Job start time
`completedAt`	timestamp	Job completion time
`durationMs`	long	Total duration in milliseconds
`errorMessage`	string	Error details (null if succeeded)
`tablesAffected`	JSON	List of tables modified during the sync

Filtering and Search

By connection: Select a specific connection to view only its sync history
By status: Filter by SUCCEEDED, FAILED, RUNNING, or CANCELLED
By date range: Set start and end dates to narrow the history window
Pagination: Default page size is 20 entries, sorted by startedAt descending

Common Sync Errors and Resolutions

Source Connection Errors

Error	Cause	Resolution
`Connection refused`	Source database is unreachable	Verify hostname and port. Check firewall rules and VPN connectivity. Confirm the database service is running.
`Authentication failed`	Invalid credentials	Update the source with correct username/password. For databases, verify the user has LOGIN privilege. For SaaS connectors, regenerate the API key or refresh token.
`SSL handshake failed`	SSL configuration mismatch	Match the `ssl_mode` in the source config to the server's SSL requirements. Try `require` or `verify-ca`.
`Connection timed out`	Network latency or blocked port	Increase connection timeout. Verify network path between the Matih cluster and the source.
`Unknown database`	Database name incorrect	Verify the database name exists on the source server. Database names are case-sensitive on some platforms.

Schema and Data Errors

Error	Cause	Resolution
`Schema has changed`	Source table columns were added, removed, or type-changed since last sync	Re-run schema discovery (`POST /sources/{id}/discover`) and update the connection's stream selection. Airbyte handles most schema evolution automatically for append syncs.
`Primary key violation`	Duplicate primary key values in deduplication mode	Verify the source table's primary key is unique. Switch to Full Refresh mode if the source has non-unique keys.
`Unsupported data type`	Source contains a column type not mapped by the connector	Check the connector's type mapping documentation. The unsupported column may need to be excluded from the stream selection.
`Column count mismatch`	Source schema changed between discovery and sync	Re-discover the schema and recreate the connection.

Destination Errors

Error	Cause	Resolution
`Iceberg commit failed`	Concurrent writes to the same table or catalog unavailable	Retry the sync. If persistent, check Polaris catalog health and Iceberg metadata storage.
`Namespace not found`	Tenant namespace missing in Polaris	Verify tenant provisioning completed. Contact platform admin to check Polaris namespace creation.
`Quota exceeded`	Tenant has exceeded storage quota	Check tenant storage usage. Contact admin to increase quota or archive old data.

Airbyte Infrastructure Errors

Error	Cause	Resolution
`Worker pod OOMKilled`	Sync job exceeded memory limits	Reduce the number of streams synced in parallel. For very wide tables, exclude unused columns.
`Job exceeded timeout`	Sync took longer than the maximum allowed duration	Increase the job timeout for the connection. Consider switching to incremental mode for large tables.
`Airbyte server unreachable`	Tenant's Airbyte deployment is down	Check the Airbyte pod status. The platform will automatically restart failed Airbyte pods.
`Rate limited by source`	Source API enforced rate limits (common for SaaS connectors)	Reduce sync frequency. Airbyte handles rate limiting with automatic backoff, but aggressive schedules can trigger persistent throttling.

Monitoring Metrics

The Ingestion Service exposes Prometheus metrics for monitoring sync operations.

Key Metrics

Metric	Type	Labels	Description
`matih_ingestion_syncs_total`	Counter	`tenant_id`, `connection_id`, `status`	Total sync jobs by status
`matih_ingestion_sync_duration_seconds`	Histogram	`tenant_id`, `connection_id`	Sync duration distribution
`matih_ingestion_records_synced_total`	Counter	`tenant_id`, `connection_id`	Total records synced
`matih_ingestion_bytes_synced_total`	Counter	`tenant_id`, `connection_id`	Total bytes synced
`matih_ingestion_sync_errors_total`	Counter	`tenant_id`, `connection_id`, `error_type`	Sync errors by type
`matih_ingestion_active_syncs`	Gauge	`tenant_id`	Currently running sync jobs
`matih_ingestion_file_imports_total`	Counter	`tenant_id`, `status`, `format`	File import jobs by status and format
`matih_ingestion_file_import_records_total`	Counter	`tenant_id`	Total records imported from files

Grafana Dashboards

The platform includes pre-built Grafana dashboards for ingestion monitoring.

Dashboard: Ingestion Overview

Provides a high-level view of all ingestion activity across tenants.

Panel	Visualization	Query
Sync Success Rate	Stat (percentage)	`rate(matih_ingestion_syncs_total{status="SUCCEEDED"}[24h]) / rate(matih_ingestion_syncs_total[24h])`
Active Syncs	Stat (gauge)	`sum(matih_ingestion_active_syncs)`
Syncs per Hour	Time series	`rate(matih_ingestion_syncs_total[1h])`
Records Synced (24h)	Stat (total)	`increase(matih_ingestion_records_synced_total[24h])`
Sync Duration (p95)	Time series	`histogram_quantile(0.95, rate(matih_ingestion_sync_duration_seconds_bucket[1h]))`
Error Rate by Type	Bar chart	`rate(matih_ingestion_sync_errors_total[24h])` by `error_type`

Dashboard: Connection Detail

Provides drill-down into a specific connection's sync history.

Panel	Visualization	Query
Sync Timeline	Time series (success/failure)	`matih_ingestion_syncs_total{connection_id="..."}`
Records per Sync	Bar chart	`matih_ingestion_records_synced_total{connection_id="..."}`
Duration Trend	Time series	`matih_ingestion_sync_duration_seconds{connection_id="..."}`
Bytes Transferred	Time series	`matih_ingestion_bytes_synced_total{connection_id="..."}`
Error Log	Table	Recent sync errors for this connection

Dashboard: File Import

Monitors file upload and import operations.

Panel	Visualization	Query
Imports by Format	Pie chart	`matih_ingestion_file_imports_total` by `format`
Import Success Rate	Stat	`rate(matih_ingestion_file_imports_total{status="COMPLETED"}[24h]) / rate(matih_ingestion_file_imports_total[24h])`
Records Imported (24h)	Stat	`increase(matih_ingestion_file_import_records_total[24h])`

Alerting Configuration

Configure alerts for ingestion failures to ensure data freshness and pipeline reliability.

Recommended Alerts

Alert	Condition	Severity	Description
Sync Failure	`matih_ingestion_syncs_total{status="FAILED"}` increases	Warning	A sync job failed. Investigate the error message in sync history.
Consecutive Failures	3+ consecutive `FAILED` syncs for the same connection	Critical	Persistent failure indicates a configuration or infrastructure issue requiring immediate attention.
Sync Duration Spike	Sync duration exceeds 3x the trailing 7-day average	Warning	May indicate source performance degradation, increased data volume, or network issues.
No Syncs Running	`matih_ingestion_active_syncs == 0` for more than 2x the shortest schedule interval	Warning	Scheduled syncs may not be triggering. Check Airbyte scheduler health.
Data Freshness SLA	No successful sync for a connection in the last `N` hours (configurable per connection)	Critical	Data is stale beyond the acceptable threshold.
High Error Rate	`rate(matih_ingestion_sync_errors_total[1h]) > 0.5`	Warning	More than half of sync attempts are failing.

Alert Configuration Example (Prometheus AlertManager)

groups:
  - name: ingestion-alerts
    rules:
      - alert: IngestionSyncFailed
        expr: increase(matih_ingestion_syncs_total{status="FAILED"}[1h]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Ingestion sync failed"
          description: "Connection {{ $labels.connection_id }} in tenant {{ $labels.tenant_id }} had a failed sync in the last hour."
 
      - alert: IngestionHighErrorRate
        expr: |
          rate(matih_ingestion_sync_errors_total[1h])
          / rate(matih_ingestion_syncs_total[1h]) > 0.5
        for: 15m
        labels:
          severity: critical
        annotations:
          summary: "High ingestion error rate"
          description: "More than 50% of syncs are failing for tenant {{ $labels.tenant_id }}."
 
      - alert: IngestionSyncDurationSpike
        expr: |
          matih_ingestion_sync_duration_seconds
          > 3 * avg_over_time(matih_ingestion_sync_duration_seconds[7d])
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Sync duration spike detected"
          description: "Connection {{ $labels.connection_id }} sync duration is 3x above the 7-day average."

Troubleshooting Workflow

When a sync failure alert fires, follow this workflow.

1. Check sync history
   GET /api/v1/syncs?connectionId={id}&sort=startedAt,desc&size=5

2. Read the error message from the most recent FAILED sync

3. Classify the error:
   - Source connection error    -> Check source accessibility and credentials
   - Schema/data error          -> Re-discover schema, check source changes
   - Destination error          -> Check Polaris catalog and Iceberg health
   - Infrastructure error       -> Check Airbyte pod status and resources

4. After fixing, trigger a manual sync:
   POST /api/v1/syncs/connections/{connectionId}/trigger

5. Verify the sync completes successfully:
   GET /api/v1/syncs/{syncId}

6. If the issue recurs, check Grafana dashboards for patterns
   (time-of-day correlation, data volume correlation, etc.)

API Reference Overview