MATIH Platform is in active MVP development. Documentation reflects current implementation status.
11. Pipelines & Data Engineering
Data Quality
Architecture

Data Quality Architecture

The Data Quality Service is a Python/FastAPI application that provides validation, profiling, anomaly detection, and quality scoring for data flowing through MATIH pipelines. It integrates with Great Expectations for rule-based validation and uses statistical methods for anomaly detection.


Service Overview

PropertyValue
LanguagePython 3.11
FrameworkFastAPI
Port8000
Namespacematih-data-plane
StoragePostgreSQL (rules, scores, profiles)
CacheRedis (profile cache, score cache)
Source codedata-plane/data-quality-service/src/

Sub-Pages

PageDescription
Validation RulesRule types, severity levels, and rule engine
Data ProfilingStatistical profiling and schema inference
Anomaly DetectionDistribution and time-series anomaly detection
Quality ScoringMulti-dimensional scoring and SLA compliance
Data ObservabilityLineage, metrics, and alerting
API ReferenceComplete REST API documentation

Component Layout

data-quality-service/src/
  api/routes/          -- REST API endpoints
    rules.py           -- Rule CRUD operations
    validations.py     -- Validation execution
    profiles.py        -- Profile endpoints
    scores.py          -- Quality score queries
    anomalies.py       -- Anomaly detection endpoints
  validation/          -- Rule engine and validators
    rule_engine.py     -- Core rule evaluation
    expectations.py    -- Great Expectations integration
    custom_rules.py    -- Custom rule definitions
  profiling/           -- Data profiling
    engine.py          -- Profiling orchestration
    statistics.py      -- Statistical calculations
    spark_profiler.py  -- Spark-based profiling for large datasets
  scoring/             -- Quality scoring
    calculator.py      -- Score computation
    dimensions.py      -- Dimension calculators
    trends.py          -- Score trend analysis
  anomaly/             -- Anomaly detection
    detector.py        -- Base anomaly detector
    distribution.py    -- Distribution-based detection
    time_series.py     -- Time-series anomaly detection
    freshness.py       -- Data freshness monitoring
  observability/       -- Observability integration
    lineage.py         -- OpenLineage emission
    metrics.py         -- Prometheus metrics
    alerts.py          -- Alert management

Quality Dimensions

The service evaluates data quality across six dimensions:

DimensionWeightDescription
Completeness1.0Percentage of non-null values in required fields
Accuracy1.0Values conform to expected ranges and patterns
Consistency0.8Referential integrity and cross-field rules
Timeliness1.0Data freshness relative to SLA thresholds
Uniqueness0.8No unexpected duplicate records
Validity0.9Values match expected formats and types