MATIH Platform is in active MVP development. Documentation reflects current implementation status.
11. Pipelines & Data Engineering
Data Quality
Data Profiling

Data Profiling

Data profiling computes statistical summaries, distribution analysis, and schema metadata for datasets. The profiling engine supports both in-process profiling for small datasets and Spark-based distributed profiling for large tables.

Source: data-plane/data-quality-service/src/profiling/engine.py


Profiling Engines

EngineMax RowsUse Case
In-process (pandas)1M rowsSmall to medium datasets
SparkUnlimitedLarge datasets, Iceberg tables
SamplingConfigurableQuick estimates for very large tables

Profile Output

A table profile contains the following sections:

Table-Level Statistics

MetricDescription
row_countTotal number of rows
column_countNumber of columns
size_bytesEstimated data size
freshnessTime since last update
duplicate_rowsNumber of exact duplicate rows

Column-Level Statistics

MetricApplies ToDescription
null_countAllNumber of NULL values
null_rateAllPercentage of NULL values
distinct_countAllNumber of unique values
min / maxNumeric, DateMinimum and maximum values
mean / stddevNumericMean and standard deviation
medianNumericMedian value (p50)
p25 / p75 / p95 / p99NumericPercentile values
min_length / max_lengthStringCharacter length range
pattern_frequenciesStringTop regex patterns detected
value_frequenciesCategoricalTop value distribution

Profiling API

POST /v1/quality/profiles

Request:
{
  "dataset": "analytics.sales.transactions",
  "engine": "auto",
  "sampleSize": 100000,
  "columns": null,
  "includeDistributions": true
}

Response:
{
  "profileId": "prof-abc-123",
  "dataset": "analytics.sales.transactions",
  "rowCount": 1250000,
  "columnCount": 15,
  "columns": {
    "amount": {
      "dataType": "DOUBLE",
      "nullCount": 0,
      "nullRate": 0.0,
      "distinctCount": 48923,
      "min": 0.01,
      "max": 99999.99,
      "mean": 245.67,
      "stddev": 1023.45,
      "p50": 89.99,
      "p95": 1250.00,
      "p99": 5000.00
    }
  },
  "computedAt": "2026-02-12T10:30:00Z",
  "durationMs": 4523
}

Schema Inference

The profiling engine infers schemas from untyped data sources:

Source: data-plane/data-quality-service/src/profiling/schema_inference.py

Source TypeInference Method
CSV filesSample rows, detect types by value patterns
JSON filesParse structure, infer types from values
API responsesAnalyze response payloads across multiple calls

Scheduled Profiling

Profiles can be scheduled to run periodically for trend tracking:

POST /v1/quality/profiles/schedule

Request:
{
  "dataset": "analytics.sales.transactions",
  "schedule": "0 2 * * *",
  "retentionDays": 90
}

Profile Comparison

Compare two profiles to detect drift:

GET /v1/quality/profiles/compare?baseline=prof-001&current=prof-002

Returns column-level drift indicators for null rates, distributions, and schema changes.


Related Pages