Validation Rules
The validation rule engine evaluates data against configurable rules organized by quality dimension. Rules are defined in YAML or via the API, stored in PostgreSQL, and executed by the rule engine during pipeline quality gates.
Source: data-plane/data-quality-service/src/validation/rule_engine.py
Rule Types
Completeness Rules
| Rule Type | Description | Example |
|---|---|---|
null_check | Column must not contain NULL values | amount IS NOT NULL |
not_empty | String column must not be empty | name != '' |
required_columns | All specified columns must exist in the dataset | [id, name, email] |
Accuracy Rules
| Rule Type | Description | Example |
|---|---|---|
range_check | Numeric value within min/max bounds | amount BETWEEN 0 AND 1000000 |
regex_pattern | String matches a regular expression | email LIKE '%@%.%' |
enum_values | Value is one of an allowed set | status IN ('active', 'inactive') |
data_type | Column matches expected data type | created_at IS TIMESTAMP |
Consistency Rules
| Rule Type | Description | Example |
|---|---|---|
referential_integrity | Foreign key exists in reference table | customer_id IN customers.id |
cross_field | Relationship between columns holds | end_date >= start_date |
aggregate_check | Aggregate value meets threshold | SUM(amount) > 0 |
Uniqueness Rules
| Rule Type | Description | Example |
|---|---|---|
uniqueness | Column values are unique | DISTINCT(email) = COUNT(*) |
primary_key | Composite key is unique | UNIQUE(tenant_id, entity_id) |
duplicate_check | No duplicate rows by key | Fuzzy dedup by similarity |
Timeliness Rules
| Rule Type | Description | Example |
|---|---|---|
freshness | Data is recent relative to SLA | MAX(updated_at) > NOW() - 24h |
timestamp_valid | Timestamps are within valid range | created_at <= NOW() |
Custom Rules
| Rule Type | Description |
|---|---|
sql_expression | Arbitrary SQL expression evaluated against the dataset |
python_function | Custom Python validation function |
great_expectations | Great Expectations expectation suite |
Severity Levels
| Severity | Behavior |
|---|---|
critical | Blocks the pipeline, requires immediate action |
warning | Logged and alerted, does not block execution |
info | Logged only, informational |
Rule Definition
POST /v1/quality/rules
Request:
{
"name": "amount_positive",
"description": "Transaction amount must be positive",
"ruleType": "range_check",
"severity": "critical",
"status": "active",
"config": {
"column": "amount",
"min": 0.01,
"max": null
},
"datasets": ["analytics.sales.transactions"],
"tags": ["financial", "critical"]
}Rule Execution
The rule engine evaluates all active rules for a dataset and returns a validation report:
POST /v1/quality/validate
Request:
{
"dataset": "analytics.sales.transactions",
"ruleIds": null,
"sampleSize": 10000
}
Response:
{
"dataset": "analytics.sales.transactions",
"totalRules": 12,
"passed": 10,
"failed": 2,
"results": [
{
"ruleId": "rule-123",
"ruleName": "amount_positive",
"status": "FAILED",
"severity": "critical",
"failedRows": 42,
"totalRows": 125000,
"failRate": 0.000336
}
]
}Related Pages
- Quality Scoring -- Score computation from rule results
- Data Profiling -- Auto-generate rules from profiles
- Pipeline Service -- Pipeline quality gates