Anomaly Detection
The anomaly detection subsystem identifies unexpected changes in data distributions, volumes, freshness, and statistical properties. It uses multiple detection algorithms to catch data quality issues before they propagate to downstream consumers.
Source: data-plane/data-quality-service/src/anomaly/
Detection Methods
| Method | Module | Description |
|---|---|---|
| Distribution shift | distribution.py | Detects changes in value distributions using KL divergence and KS tests |
| Time-series | time_series.py | Identifies anomalous metric values using STL decomposition and Prophet |
| Freshness | freshness.py | Monitors data arrival times against expected SLA windows |
| Advanced | advanced_detector.py | Multi-variate anomaly detection using isolation forests |
Distribution Anomaly Detection
Detects statistical shifts in column value distributions between profiling runs:
| Algorithm | Use Case | Sensitivity |
|---|---|---|
| Kolmogorov-Smirnov test | Continuous numeric columns | Medium |
| Chi-squared test | Categorical columns | Medium |
| KL divergence | Probability distribution shifts | High |
| Jensen-Shannon divergence | Symmetric distribution comparison | Medium |
Configuration
POST /v1/quality/anomalies/configure
Request:
{
"dataset": "analytics.sales.transactions",
"detectors": [
{
"type": "distribution",
"columns": ["amount", "currency"],
"sensitivity": 0.05,
"baselineWindow": "7d"
}
]
}Time-Series Anomaly Detection
Monitors metric trends over time to detect unexpected spikes, drops, or pattern changes:
| Metric | Description |
|---|---|
| Row count per run | Detect volume anomalies |
| Null rate per column | Detect completeness degradation |
| Distinct value count | Detect cardinality changes |
| Mean / standard deviation | Detect value drift |
Detection Pipeline
Historical Profiles ──> STL Decomposition ──> Residual Analysis ──> Anomaly Score
|
Threshold ──> AlertFreshness Monitoring
The freshness detector monitors data arrival times and alerts when data is stale:
POST /v1/quality/anomalies/freshness
Request:
{
"dataset": "analytics.sales.transactions",
"expectedFrequency": "1h",
"maxDelay": "2h",
"timestampColumn": "updated_at"
}Anomaly Response
GET /v1/quality/anomalies?dataset=analytics.sales.transactions
Response:
{
"anomalies": [
{
"id": "anom-123",
"dataset": "analytics.sales.transactions",
"type": "distribution_shift",
"column": "amount",
"severity": "warning",
"score": 0.87,
"description": "Distribution shift detected: KS statistic 0.15 exceeds threshold 0.05",
"detectedAt": "2026-02-12T06:15:00Z",
"baselineProfile": "prof-001",
"currentProfile": "prof-002"
}
]
}Alert Integration
Anomalies trigger alerts through the notification pipeline:
| Channel | Configuration |
|---|---|
| Slack | Webhook URL per dataset owner |
| Distribution list from dataset governance | |
| PagerDuty | Critical anomalies only |
| Kafka | matih.quality.anomalies topic |
Related Pages
- Data Profiling -- Profile computation for baseline
- Quality Scoring -- Anomaly impact on scores
- Data Observability -- Unified observability