Data Quality Architecture
The Data Quality Service is a Python/FastAPI application that provides validation, profiling, anomaly detection, and quality scoring for data flowing through MATIH pipelines. It integrates with Great Expectations for rule-based validation and uses statistical methods for anomaly detection.
Service Overview
| Property | Value |
|---|---|
| Language | Python 3.11 |
| Framework | FastAPI |
| Port | 8000 |
| Namespace | matih-data-plane |
| Storage | PostgreSQL (rules, scores, profiles) |
| Cache | Redis (profile cache, score cache) |
| Source code | data-plane/data-quality-service/src/ |
Sub-Pages
| Page | Description |
|---|---|
| Validation Rules | Rule types, severity levels, and rule engine |
| Data Profiling | Statistical profiling and schema inference |
| Anomaly Detection | Distribution and time-series anomaly detection |
| Quality Scoring | Multi-dimensional scoring and SLA compliance |
| Data Observability | Lineage, metrics, and alerting |
| API Reference | Complete REST API documentation |
Component Layout
data-quality-service/src/
api/routes/ -- REST API endpoints
rules.py -- Rule CRUD operations
validations.py -- Validation execution
profiles.py -- Profile endpoints
scores.py -- Quality score queries
anomalies.py -- Anomaly detection endpoints
validation/ -- Rule engine and validators
rule_engine.py -- Core rule evaluation
expectations.py -- Great Expectations integration
custom_rules.py -- Custom rule definitions
profiling/ -- Data profiling
engine.py -- Profiling orchestration
statistics.py -- Statistical calculations
spark_profiler.py -- Spark-based profiling for large datasets
scoring/ -- Quality scoring
calculator.py -- Score computation
dimensions.py -- Dimension calculators
trends.py -- Score trend analysis
anomaly/ -- Anomaly detection
detector.py -- Base anomaly detector
distribution.py -- Distribution-based detection
time_series.py -- Time-series anomaly detection
freshness.py -- Data freshness monitoring
observability/ -- Observability integration
lineage.py -- OpenLineage emission
metrics.py -- Prometheus metrics
alerts.py -- Alert managementQuality Dimensions
The service evaluates data quality across six dimensions:
| Dimension | Weight | Description |
|---|---|---|
| Completeness | 1.0 | Percentage of non-null values in required fields |
| Accuracy | 1.0 | Values conform to expected ranges and patterns |
| Consistency | 0.8 | Referential integrity and cross-field rules |
| Timeliness | 1.0 | Data freshness relative to SLA thresholds |
| Uniqueness | 0.8 | No unexpected duplicate records |
| Validity | 0.9 | Values match expected formats and types |