Data Observability
The data observability layer provides end-to-end visibility into data quality through metrics, lineage, tracing, and alerting. It integrates with Prometheus for metrics, OpenLineage for lineage tracking, and the MATIH notification service for alerts.
Source: data-plane/data-quality-service/src/observability/
Observability Components
| Component | Module | Purpose |
|---|---|---|
| Metrics exporter | metrics.py | Prometheus metrics for quality scores and validation results |
| Lineage emitter | lineage.py | OpenLineage events for quality check execution |
| Alert manager | alerts.py | Rule-based alerting on quality breaches |
| Tracing | tracing.py | OpenTelemetry spans for profiling and validation runs |
Prometheus Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
dq_validation_total | Counter | dataset, rule_type, status | Total validation executions |
dq_validation_duration_seconds | Histogram | dataset | Validation run duration |
dq_score_current | Gauge | dataset, dimension | Current quality score per dimension |
dq_anomaly_total | Counter | dataset, type, severity | Anomalies detected |
dq_profile_duration_seconds | Histogram | dataset, engine | Profiling run duration |
dq_freshness_age_seconds | Gauge | dataset | Current data age in seconds |
OpenLineage Integration
Every validation and profiling run emits OpenLineage events for lineage tracking:
{
"eventType": "COMPLETE",
"job": {
"namespace": "matih-data-quality",
"name": "validate-analytics.sales.transactions"
},
"inputs": [
{
"namespace": "iceberg",
"name": "analytics.sales.transactions",
"facets": {
"dataQualityMetrics": {
"overallScore": 0.94,
"rowCount": 1250000,
"rulesEvaluated": 12,
"rulesPassed": 10
}
}
}
],
"run": {
"runId": "run-abc-123"
}
}Alert Rules
Alerts are configured per dataset and dimension:
POST /v1/quality/alerts
Request:
{
"name": "sales-completeness-alert",
"dataset": "analytics.sales.transactions",
"condition": {
"dimension": "completeness",
"operator": "less_than",
"threshold": 0.95
},
"severity": "critical",
"channels": ["slack", "email"],
"recipients": ["data-engineering-team"],
"cooldownMinutes": 60
}Alert Channels
| Channel | Integration | Configuration |
|---|---|---|
| Slack | Webhook | Channel URL per team |
| SMTP via notification-service | Distribution list | |
| PagerDuty | Events API | Service key for on-call |
| Kafka | Producer | matih.quality.alerts topic |
Dashboard Integration
Quality metrics are visualized in Grafana dashboards:
| Dashboard | Content |
|---|---|
| Data Quality Overview | Overall scores across all datasets |
| Dataset Detail | Per-dimension scores, trends, anomalies |
| Validation History | Rule pass/fail rates over time |
| Freshness Monitor | Data arrival times and SLA compliance |
Incident Workflow
When a critical quality breach is detected:
1. Anomaly/validation failure detected
2. Alert sent to configured channels
3. Pipeline execution paused (if quality gate)
4. On-call engineer investigates root cause
5. Fix applied to source data or pipeline
6. Quality score re-evaluated
7. Pipeline resumes if score meets SLARelated Pages
- Quality Scoring -- Score computation
- Anomaly Detection -- Anomaly algorithms
- Validation Rules -- Rule configuration