MATIH Platform is in active MVP development. Documentation reflects current implementation status.
11. Pipelines & Data Engineering
Data Quality
Anomaly Detection

Anomaly Detection

The anomaly detection subsystem identifies unexpected changes in data distributions, volumes, freshness, and statistical properties. It uses multiple detection algorithms to catch data quality issues before they propagate to downstream consumers.

Source: data-plane/data-quality-service/src/anomaly/


Detection Methods

MethodModuleDescription
Distribution shiftdistribution.pyDetects changes in value distributions using KL divergence and KS tests
Time-seriestime_series.pyIdentifies anomalous metric values using STL decomposition and Prophet
Freshnessfreshness.pyMonitors data arrival times against expected SLA windows
Advancedadvanced_detector.pyMulti-variate anomaly detection using isolation forests

Distribution Anomaly Detection

Detects statistical shifts in column value distributions between profiling runs:

AlgorithmUse CaseSensitivity
Kolmogorov-Smirnov testContinuous numeric columnsMedium
Chi-squared testCategorical columnsMedium
KL divergenceProbability distribution shiftsHigh
Jensen-Shannon divergenceSymmetric distribution comparisonMedium

Configuration

POST /v1/quality/anomalies/configure

Request:
{
  "dataset": "analytics.sales.transactions",
  "detectors": [
    {
      "type": "distribution",
      "columns": ["amount", "currency"],
      "sensitivity": 0.05,
      "baselineWindow": "7d"
    }
  ]
}

Time-Series Anomaly Detection

Monitors metric trends over time to detect unexpected spikes, drops, or pattern changes:

MetricDescription
Row count per runDetect volume anomalies
Null rate per columnDetect completeness degradation
Distinct value countDetect cardinality changes
Mean / standard deviationDetect value drift

Detection Pipeline

Historical Profiles ──> STL Decomposition ──> Residual Analysis ──> Anomaly Score
                                                                        |
                                                                   Threshold ──> Alert

Freshness Monitoring

The freshness detector monitors data arrival times and alerts when data is stale:

POST /v1/quality/anomalies/freshness

Request:
{
  "dataset": "analytics.sales.transactions",
  "expectedFrequency": "1h",
  "maxDelay": "2h",
  "timestampColumn": "updated_at"
}

Anomaly Response

GET /v1/quality/anomalies?dataset=analytics.sales.transactions

Response:
{
  "anomalies": [
    {
      "id": "anom-123",
      "dataset": "analytics.sales.transactions",
      "type": "distribution_shift",
      "column": "amount",
      "severity": "warning",
      "score": 0.87,
      "description": "Distribution shift detected: KS statistic 0.15 exceeds threshold 0.05",
      "detectedAt": "2026-02-12T06:15:00Z",
      "baselineProfile": "prof-001",
      "currentProfile": "prof-002"
    }
  ]
}

Alert Integration

Anomalies trigger alerts through the notification pipeline:

ChannelConfiguration
SlackWebhook URL per dataset owner
EmailDistribution list from dataset governance
PagerDutyCritical anomalies only
Kafkamatih.quality.anomalies topic

Related Pages