Data Profiling
Data profiling computes statistical summaries, distribution analysis, and schema metadata for datasets. The profiling engine supports both in-process profiling for small datasets and Spark-based distributed profiling for large tables.
Source: data-plane/data-quality-service/src/profiling/engine.py
Profiling Engines
| Engine | Max Rows | Use Case |
|---|---|---|
| In-process (pandas) | 1M rows | Small to medium datasets |
| Spark | Unlimited | Large datasets, Iceberg tables |
| Sampling | Configurable | Quick estimates for very large tables |
Profile Output
A table profile contains the following sections:
Table-Level Statistics
| Metric | Description |
|---|---|
row_count | Total number of rows |
column_count | Number of columns |
size_bytes | Estimated data size |
freshness | Time since last update |
duplicate_rows | Number of exact duplicate rows |
Column-Level Statistics
| Metric | Applies To | Description |
|---|---|---|
null_count | All | Number of NULL values |
null_rate | All | Percentage of NULL values |
distinct_count | All | Number of unique values |
min / max | Numeric, Date | Minimum and maximum values |
mean / stddev | Numeric | Mean and standard deviation |
median | Numeric | Median value (p50) |
p25 / p75 / p95 / p99 | Numeric | Percentile values |
min_length / max_length | String | Character length range |
pattern_frequencies | String | Top regex patterns detected |
value_frequencies | Categorical | Top value distribution |
Profiling API
POST /v1/quality/profiles
Request:
{
"dataset": "analytics.sales.transactions",
"engine": "auto",
"sampleSize": 100000,
"columns": null,
"includeDistributions": true
}
Response:
{
"profileId": "prof-abc-123",
"dataset": "analytics.sales.transactions",
"rowCount": 1250000,
"columnCount": 15,
"columns": {
"amount": {
"dataType": "DOUBLE",
"nullCount": 0,
"nullRate": 0.0,
"distinctCount": 48923,
"min": 0.01,
"max": 99999.99,
"mean": 245.67,
"stddev": 1023.45,
"p50": 89.99,
"p95": 1250.00,
"p99": 5000.00
}
},
"computedAt": "2026-02-12T10:30:00Z",
"durationMs": 4523
}Schema Inference
The profiling engine infers schemas from untyped data sources:
Source: data-plane/data-quality-service/src/profiling/schema_inference.py
| Source Type | Inference Method |
|---|---|
| CSV files | Sample rows, detect types by value patterns |
| JSON files | Parse structure, infer types from values |
| API responses | Analyze response payloads across multiple calls |
Scheduled Profiling
Profiles can be scheduled to run periodically for trend tracking:
POST /v1/quality/profiles/schedule
Request:
{
"dataset": "analytics.sales.transactions",
"schedule": "0 2 * * *",
"retentionDays": 90
}Profile Comparison
Compare two profiles to detect drift:
GET /v1/quality/profiles/compare?baseline=prof-001¤t=prof-002Returns column-level drift indicators for null rates, distributions, and schema changes.
Related Pages
- Anomaly Detection -- Detecting anomalies from profile trends
- Validation Rules -- Auto-generating rules from profiles
- Quality Scoring -- Profile-based scoring