Data Profiling

Data profiling computes statistical summaries, distribution analysis, and schema metadata for datasets. The profiling engine supports both in-process profiling for small datasets and Spark-based distributed profiling for large tables.

Source: data-plane/data-quality-service/src/profiling/engine.py

Profiling Engines

Engine	Max Rows	Use Case
In-process (pandas)	1M rows	Small to medium datasets
Spark	Unlimited	Large datasets, Iceberg tables
Sampling	Configurable	Quick estimates for very large tables

Profile Output

A table profile contains the following sections:

Table-Level Statistics

Metric	Description
`row_count`	Total number of rows
`column_count`	Number of columns
`size_bytes`	Estimated data size
`freshness`	Time since last update
`duplicate_rows`	Number of exact duplicate rows

Column-Level Statistics

Metric	Applies To	Description
`null_count`	All	Number of NULL values
`null_rate`	All	Percentage of NULL values
`distinct_count`	All	Number of unique values
`min` / `max`	Numeric, Date	Minimum and maximum values
`mean` / `stddev`	Numeric	Mean and standard deviation
`median`	Numeric	Median value (p50)
`p25` / `p75` / `p95` / `p99`	Numeric	Percentile values
`min_length` / `max_length`	String	Character length range
`pattern_frequencies`	String	Top regex patterns detected
`value_frequencies`	Categorical	Top value distribution

Profiling API

POST /v1/quality/profiles

Request:
{
  "dataset": "analytics.sales.transactions",
  "engine": "auto",
  "sampleSize": 100000,
  "columns": null,
  "includeDistributions": true
}

Response:
{
  "profileId": "prof-abc-123",
  "dataset": "analytics.sales.transactions",
  "rowCount": 1250000,
  "columnCount": 15,
  "columns": {
    "amount": {
      "dataType": "DOUBLE",
      "nullCount": 0,
      "nullRate": 0.0,
      "distinctCount": 48923,
      "min": 0.01,
      "max": 99999.99,
      "mean": 245.67,
      "stddev": 1023.45,
      "p50": 89.99,
      "p95": 1250.00,
      "p99": 5000.00
    }
  },
  "computedAt": "2026-02-12T10:30:00Z",
  "durationMs": 4523
}

Schema Inference

The profiling engine infers schemas from untyped data sources:

Source: data-plane/data-quality-service/src/profiling/schema_inference.py

Source Type	Inference Method
CSV files	Sample rows, detect types by value patterns
JSON files	Parse structure, infer types from values
API responses	Analyze response payloads across multiple calls

Scheduled Profiling

Profiles can be scheduled to run periodically for trend tracking:

POST /v1/quality/profiles/schedule

Request:
{
  "dataset": "analytics.sales.transactions",
  "schedule": "0 2 * * *",
  "retentionDays": 90
}

Profile Comparison

Compare two profiles to detect drift:

GET /v1/quality/profiles/compare?baseline=prof-001&current=prof-002

Returns column-level drift indicators for null rates, distributions, and schema changes.

Anomaly Detection -- Detecting anomalies from profile trends
Validation Rules -- Auto-generating rules from profiles
Quality Scoring -- Profile-based scoring

Validation Rules Anomaly Detection