Tutorial: Data Quality Exploration

In this tutorial, you will use the MATIH Data Workbench to profile your data, detect anomalies, define quality rules, and monitor data health over time. Data quality is the foundation of reliable analytics and machine learning -- poor data leads to poor insights.

What You Will Learn

How to profile a dataset and understand its statistical properties
How to detect anomalies such as null values, outliers, and distribution shifts
How to define data quality rules (expectations) for ongoing monitoring
How to view quality scores and trend reports
How to set up alerts for data quality violations

Prerequisites

Requirement	How to Verify
MATIH platform running	`./scripts/tools/platform-status.sh` returns healthy
Data quality service operational	Health check on data-quality-service passes
Sample data loaded	The retail analytics tables are available

Step 1: Open the Data Workbench

Navigate to the Data Workbench:

Local development: http://localhost:3002
Cloud deployment: https://data.{your-tenant}.matih.ai

Section	Description
Data Sources	Connected databases and their schemas
Profiles	Data profiling results
Quality Rules	Defined quality expectations
Monitors	Quality monitoring dashboards
Lineage	Data lineage and dependency graphs

Step 2: Profile a Dataset

Data profiling automatically computes statistics for every column in a table.

Click Data Sources in the sidebar.
Expand your data source and select the orders table.
Click Profile Table.

The profiler analyzes the table and produces a comprehensive report.

Profile Report: `orders` Table

Table-Level Statistics:

Metric	Value
Row count	50,000
Column count	8
Estimated size	12.4 MB
Last updated	2026-02-11 23:45:00

Column-Level Statistics:

Column	Type	Non-null	Null %	Unique	Min	Max	Mean	Std Dev
`id`	INT	50,000	0.0%	50,000	1	50,000	--	--
`customer_id`	INT	50,000	0.0%	4,823	1	5,000	--	--
`product_id`	INT	49,987	0.03%	498	1	500	--	--
`order_date`	DATE	50,000	0.0%	365	2025-02-12	2026-02-11	--	--
`total_amount`	DECIMAL	49,950	0.1%	8,432	1.99	2,499.99	295.42	187.63
`status`	VARCHAR	50,000	0.0%	4	--	--	--	--
`shipping_city`	VARCHAR	49,800	0.4%	342	--	--	--	--
`created_at`	TIMESTAMP	50,000	0.0%	49,876	--	--	--	--

Value Distribution for status:

Value	Count	Percentage
COMPLETED	42,000	84.0%
SHIPPED	4,500	9.0%
PENDING	2,500	5.0%
CANCELLED	1,000	2.0%

Step 3: Detect Anomalies

The profiler automatically flags potential data quality issues.

Anomaly Detection Results

Column	Anomaly Type	Description	Severity
`product_id`	Null values	13 null values (0.03%)	Low
`total_amount`	Null values	50 null values (0.1%)	Medium
`total_amount`	Outliers	23 values above 3 standard deviations	Low
`shipping_city`	Null values	200 null values (0.4%)	Medium
`shipping_city`	Cardinality	High cardinality (342 unique values)	Info

Outlier Analysis for `total_amount`

The profiler shows a histogram of the total_amount distribution with outliers highlighted:

Statistic	Value
Mean	$295.42
Median	$245.00
Std Dev	$187.63
IQR	$210.00
Lower fence (Q1 - 1.5 x IQR)	- $115.00 (effectively$ 0)
Upper fence (Q3 + 1.5 x IQR)	$715.00
Values above upper fence	1,247 (2.5%)
Values above 3 std dev ($857.31)	23 (0.05%)

Click on any anomaly to drill into the affected rows.

Step 4: Define Quality Rules

Quality rules (also called expectations) define the constraints your data must satisfy. When data violates a rule, it is flagged for investigation.

Create Rules for the `orders` Table

Click Quality Rules in the sidebar.
Click New Rule Set.
Name: orders_quality_rules.
Select table: orders.

Add the following rules:

Rule 1: No null order amounts

Setting	Value
Column	`total_amount`
Rule type	Not Null
Severity	Critical
Description	Every order must have a total amount

Rule 2: Positive order amounts

Setting	Value
Column	`total_amount`
Rule type	Greater Than
Threshold	0
Severity	Critical
Description	Order amounts must be positive

Rule 3: Valid status values

Setting	Value
Column	`status`
Rule type	In Set
Values	COMPLETED, SHIPPED, PENDING, CANCELLED
Severity	High
Description	Order status must be a known value

Rule 4: Order date within range

Setting	Value
Column	`order_date`
Rule type	Between
Min	2020-01-01
Max	CURRENT_DATE
Severity	High
Description	Order dates must not be in the future or before 2020

Rule 5: Referential integrity

Setting	Value
Column	`customer_id`
Rule type	Exists In
Reference table	`customers`
Reference column	`id`
Severity	Critical
Description	Every order must reference a valid customer

Rule 6: Row count threshold

Setting	Value
Rule type	Table Row Count
Min rows	45,000
Severity	High
Description	The orders table should have at least 45,000 rows

Click Save Rule Set.

Step 5: Run Quality Checks

In the rule set view, click Run Checks.
The data quality service evaluates each rule against the current data.

Quality Check Results

Rule	Status	Pass Rate	Failed Rows
No null order amounts	WARN	99.9%	50
Positive order amounts	PASS	100.0%	0
Valid status values	PASS	100.0%	0
Order date within range	PASS	100.0%	0
Referential integrity	PASS	100.0%	0
Row count threshold	PASS	--	--

Overall Quality Score: 94.2%

The quality score is computed as a weighted average of rule pass rates, with critical rules weighted higher than informational ones.

Step 6: Investigate Failing Rules

Click on the "No null order amounts" rule to see the failing rows:

-- Auto-generated investigation query
SELECT *
FROM orders
WHERE total_amount IS NULL
ORDER BY order_date DESC
LIMIT 100

id	customer_id	product_id	order_date	total_amount	status
49,823	3,421	42	2026-02-11	NULL	PENDING
49,756	1,892	105	2026-02-10	NULL	PENDING
...	...	...	...	NULL	...

Finding: All 50 null total_amount values are in PENDING orders that have not been finalized yet. This is expected behavior, and you might update the rule to exclude pending orders:

total_amount IS NOT NULL OR status = 'PENDING'

Step 7: Set Up Quality Monitoring

Schedule recurring quality checks to detect issues as data changes.

Click Monitors in the sidebar.
Click New Monitor.
Configure:
- Name: Orders Quality Monitor
- Rule set: orders_quality_rules
- Schedule: Daily at 7:00 AM (before business hours)
- Alert on: Quality score drops below 90%
- Notification channel: Email to data-engineering@acme.com
Click Create Monitor.

Quality Trend Dashboard

The monitoring dashboard shows quality metrics over time:

Date	Quality Score	Rules Passed	Rules Failed	Total Rows
Feb 10	95.1%	6/6	0/6	49,800
Feb 11	94.8%	5/6	1/6	49,950
Feb 12	94.2%	5/6	1/6	50,000

The trend chart shows quality score over time, with annotations for significant changes.

Step 8: Profile the Customers Table

Repeat the profiling process for the customers table to establish a complete quality baseline:

Navigate to the customers table.
Click Profile Table.

Key Findings

Column	Finding	Action
`email`	12 duplicate values	Investigate potential duplicate accounts
`age`	3 values below 0	Data entry errors; add validation rule
`segment`	5 distinct values (expected 4)	Unknown segment "OTHER" needs classification
`phone`	8.2% null values	Expected for optional field

Step 9: Data Quality Rules Best Practices

Category	Rules to Define	Example
Completeness	Non-null checks on required fields	`email IS NOT NULL`
Validity	Value range and format checks	`age BETWEEN 0 AND 150`
Uniqueness	Primary key and business key uniqueness	`email` is unique per tenant
Consistency	Cross-column and cross-table checks	`order_date <= shipped_date`
Timeliness	Data freshness checks	Latest `order_date` within last 24 hours
Referential integrity	Foreign key relationships	`product_id EXISTS IN products.id`
Volume	Row count thresholds	`orders.count >= 45000`

Step 10: Export Quality Reports

Generate reports for stakeholders:

Format	Description	Use Case
PDF Report	Formatted quality summary	Weekly team review
CSV Detail	Failed rows with details	Data engineering investigation
JSON API	Quality metrics via API	Integration with external tools

# Export quality report via API
curl -X GET "http://localhost:8000/api/v1/quality/reports/orders_quality_rules" \
  -H "Authorization: Bearer ${ACCESS_TOKEN}" \
  -H "Accept: application/json"

Troubleshooting

Issue	Cause	Resolution
"Table not found"	Schema not synced	Refresh the data source metadata
Profiling times out	Very large table	Profile a sample instead of the full table
Quality checks slow	Complex join rules	Index foreign key columns
False positive anomalies	Expected patterns flagged	Adjust anomaly thresholds or exclude known patterns
Monitor not running	Schedule configuration error	Check cron expression and timezone settings

Next Steps

With data quality controls in place, proceed to Platform Administration to learn how to manage tenants, users, and platform settings.

Tutorial: Training a Model Tutorial: Platform Administration