Data Quality Architecture

The Data Quality Service is a Python/FastAPI application that provides validation, profiling, anomaly detection, and quality scoring for data flowing through MATIH pipelines. It integrates with Great Expectations for rule-based validation and uses statistical methods for anomaly detection.

Service Overview

Property	Value
Language	Python 3.11
Framework	FastAPI
Port	8000
Namespace	`matih-data-plane`
Storage	PostgreSQL (rules, scores, profiles)
Cache	Redis (profile cache, score cache)
Source code	`data-plane/data-quality-service/src/`

Sub-Pages

Page	Description
Validation Rules	Rule types, severity levels, and rule engine
Data Profiling	Statistical profiling and schema inference
Anomaly Detection	Distribution and time-series anomaly detection
Quality Scoring	Multi-dimensional scoring and SLA compliance
Data Observability	Lineage, metrics, and alerting
API Reference	Complete REST API documentation

Component Layout

data-quality-service/src/
  api/routes/          -- REST API endpoints
    rules.py           -- Rule CRUD operations
    validations.py     -- Validation execution
    profiles.py        -- Profile endpoints
    scores.py          -- Quality score queries
    anomalies.py       -- Anomaly detection endpoints
  validation/          -- Rule engine and validators
    rule_engine.py     -- Core rule evaluation
    expectations.py    -- Great Expectations integration
    custom_rules.py    -- Custom rule definitions
  profiling/           -- Data profiling
    engine.py          -- Profiling orchestration
    statistics.py      -- Statistical calculations
    spark_profiler.py  -- Spark-based profiling for large datasets
  scoring/             -- Quality scoring
    calculator.py      -- Score computation
    dimensions.py      -- Dimension calculators
    trends.py          -- Score trend analysis
  anomaly/             -- Anomaly detection
    detector.py        -- Base anomaly detector
    distribution.py    -- Distribution-based detection
    time_series.py     -- Time-series anomaly detection
    freshness.py       -- Data freshness monitoring
  observability/       -- Observability integration
    lineage.py         -- OpenLineage emission
    metrics.py         -- Prometheus metrics
    alerts.py          -- Alert management

Quality Dimensions

The service evaluates data quality across six dimensions:

Dimension	Weight	Description
Completeness	1.0	Percentage of non-null values in required fields
Accuracy	1.0	Values conform to expected ranges and patterns
Consistency	0.8	Referential integrity and cross-field rules
Timeliness	1.0	Data freshness relative to SLA thresholds
Uniqueness	0.8	No unexpected duplicate records
Validity	0.9	Values match expected formats and types

Event Sourcing Validation Rules