Tutorial: Data Quality Exploration
In this tutorial, you will use the MATIH Data Workbench to profile your data, detect anomalies, define quality rules, and monitor data health over time. Data quality is the foundation of reliable analytics and machine learning -- poor data leads to poor insights.
What You Will Learn
- How to profile a dataset and understand its statistical properties
- How to detect anomalies such as null values, outliers, and distribution shifts
- How to define data quality rules (expectations) for ongoing monitoring
- How to view quality scores and trend reports
- How to set up alerts for data quality violations
Prerequisites
| Requirement | How to Verify |
|---|---|
| MATIH platform running | ./scripts/tools/platform-status.sh returns healthy |
| Data quality service operational | Health check on data-quality-service passes |
| Sample data loaded | The retail analytics tables are available |
Step 1: Open the Data Workbench
Navigate to the Data Workbench:
- Local development:
http://localhost:3002 - Cloud deployment:
https://data.{your-tenant}.matih.ai
Log in with your tenant credentials. The Data Workbench shows:
| Section | Description |
|---|---|
| Data Sources | Connected databases and their schemas |
| Profiles | Data profiling results |
| Quality Rules | Defined quality expectations |
| Monitors | Quality monitoring dashboards |
| Lineage | Data lineage and dependency graphs |
Step 2: Profile a Dataset
Data profiling automatically computes statistics for every column in a table.
- Click Data Sources in the sidebar.
- Expand your data source and select the
orderstable. - Click Profile Table.
The profiler analyzes the table and produces a comprehensive report.
Profile Report: orders Table
Table-Level Statistics:
| Metric | Value |
|---|---|
| Row count | 50,000 |
| Column count | 8 |
| Estimated size | 12.4 MB |
| Last updated | 2026-02-11 23:45:00 |
Column-Level Statistics:
| Column | Type | Non-null | Null % | Unique | Min | Max | Mean | Std Dev |
|---|---|---|---|---|---|---|---|---|
id | INT | 50,000 | 0.0% | 50,000 | 1 | 50,000 | -- | -- |
customer_id | INT | 50,000 | 0.0% | 4,823 | 1 | 5,000 | -- | -- |
product_id | INT | 49,987 | 0.03% | 498 | 1 | 500 | -- | -- |
order_date | DATE | 50,000 | 0.0% | 365 | 2025-02-12 | 2026-02-11 | -- | -- |
total_amount | DECIMAL | 49,950 | 0.1% | 8,432 | 1.99 | 2,499.99 | 295.42 | 187.63 |
status | VARCHAR | 50,000 | 0.0% | 4 | -- | -- | -- | -- |
shipping_city | VARCHAR | 49,800 | 0.4% | 342 | -- | -- | -- | -- |
created_at | TIMESTAMP | 50,000 | 0.0% | 49,876 | -- | -- | -- | -- |
Value Distribution for status:
| Value | Count | Percentage |
|---|---|---|
| COMPLETED | 42,000 | 84.0% |
| SHIPPED | 4,500 | 9.0% |
| PENDING | 2,500 | 5.0% |
| CANCELLED | 1,000 | 2.0% |
Step 3: Detect Anomalies
The profiler automatically flags potential data quality issues.
Anomaly Detection Results
| Column | Anomaly Type | Description | Severity |
|---|---|---|---|
product_id | Null values | 13 null values (0.03%) | Low |
total_amount | Null values | 50 null values (0.1%) | Medium |
total_amount | Outliers | 23 values above 3 standard deviations | Low |
shipping_city | Null values | 200 null values (0.4%) | Medium |
shipping_city | Cardinality | High cardinality (342 unique values) | Info |
Outlier Analysis for total_amount
The profiler shows a histogram of the total_amount distribution with outliers highlighted:
| Statistic | Value |
|---|---|
| Mean | $295.42 |
| Median | $245.00 |
| Std Dev | $187.63 |
| IQR | $210.00 |
| Lower fence (Q1 - 1.5 x IQR) | -0) |
| Upper fence (Q3 + 1.5 x IQR) | $715.00 |
| Values above upper fence | 1,247 (2.5%) |
| Values above 3 std dev ($857.31) | 23 (0.05%) |
Click on any anomaly to drill into the affected rows.
Step 4: Define Quality Rules
Quality rules (also called expectations) define the constraints your data must satisfy. When data violates a rule, it is flagged for investigation.
Create Rules for the orders Table
- Click Quality Rules in the sidebar.
- Click New Rule Set.
- Name:
orders_quality_rules. - Select table:
orders.
Add the following rules:
Rule 1: No null order amounts
| Setting | Value |
|---|---|
| Column | total_amount |
| Rule type | Not Null |
| Severity | Critical |
| Description | Every order must have a total amount |
Rule 2: Positive order amounts
| Setting | Value |
|---|---|
| Column | total_amount |
| Rule type | Greater Than |
| Threshold | 0 |
| Severity | Critical |
| Description | Order amounts must be positive |
Rule 3: Valid status values
| Setting | Value |
|---|---|
| Column | status |
| Rule type | In Set |
| Values | COMPLETED, SHIPPED, PENDING, CANCELLED |
| Severity | High |
| Description | Order status must be a known value |
Rule 4: Order date within range
| Setting | Value |
|---|---|
| Column | order_date |
| Rule type | Between |
| Min | 2020-01-01 |
| Max | CURRENT_DATE |
| Severity | High |
| Description | Order dates must not be in the future or before 2020 |
Rule 5: Referential integrity
| Setting | Value |
|---|---|
| Column | customer_id |
| Rule type | Exists In |
| Reference table | customers |
| Reference column | id |
| Severity | Critical |
| Description | Every order must reference a valid customer |
Rule 6: Row count threshold
| Setting | Value |
|---|---|
| Rule type | Table Row Count |
| Min rows | 45,000 |
| Severity | High |
| Description | The orders table should have at least 45,000 rows |
- Click Save Rule Set.
Step 5: Run Quality Checks
- In the rule set view, click Run Checks.
- The data quality service evaluates each rule against the current data.
Quality Check Results
| Rule | Status | Pass Rate | Failed Rows |
|---|---|---|---|
| No null order amounts | WARN | 99.9% | 50 |
| Positive order amounts | PASS | 100.0% | 0 |
| Valid status values | PASS | 100.0% | 0 |
| Order date within range | PASS | 100.0% | 0 |
| Referential integrity | PASS | 100.0% | 0 |
| Row count threshold | PASS | -- | -- |
Overall Quality Score: 94.2%
The quality score is computed as a weighted average of rule pass rates, with critical rules weighted higher than informational ones.
Step 6: Investigate Failing Rules
Click on the "No null order amounts" rule to see the failing rows:
-- Auto-generated investigation query
SELECT *
FROM orders
WHERE total_amount IS NULL
ORDER BY order_date DESC
LIMIT 100| id | customer_id | product_id | order_date | total_amount | status |
|---|---|---|---|---|---|
| 49,823 | 3,421 | 42 | 2026-02-11 | NULL | PENDING |
| 49,756 | 1,892 | 105 | 2026-02-10 | NULL | PENDING |
| ... | ... | ... | ... | NULL | ... |
Finding: All 50 null total_amount values are in PENDING orders that have not been finalized yet. This is expected behavior, and you might update the rule to exclude pending orders:
total_amount IS NOT NULL OR status = 'PENDING'Step 7: Set Up Quality Monitoring
Schedule recurring quality checks to detect issues as data changes.
-
Click Monitors in the sidebar.
-
Click New Monitor.
-
Configure:
- Name: Orders Quality Monitor
- Rule set:
orders_quality_rules - Schedule: Daily at 7:00 AM (before business hours)
- Alert on: Quality score drops below 90%
- Notification channel: Email to data-engineering@acme.com
-
Click Create Monitor.
Quality Trend Dashboard
The monitoring dashboard shows quality metrics over time:
| Date | Quality Score | Rules Passed | Rules Failed | Total Rows |
|---|---|---|---|---|
| Feb 10 | 95.1% | 6/6 | 0/6 | 49,800 |
| Feb 11 | 94.8% | 5/6 | 1/6 | 49,950 |
| Feb 12 | 94.2% | 5/6 | 1/6 | 50,000 |
The trend chart shows quality score over time, with annotations for significant changes.
Step 8: Profile the Customers Table
Repeat the profiling process for the customers table to establish a complete quality baseline:
- Navigate to the
customerstable. - Click Profile Table.
Key Findings
| Column | Finding | Action |
|---|---|---|
email | 12 duplicate values | Investigate potential duplicate accounts |
age | 3 values below 0 | Data entry errors; add validation rule |
segment | 5 distinct values (expected 4) | Unknown segment "OTHER" needs classification |
phone | 8.2% null values | Expected for optional field |
Step 9: Data Quality Rules Best Practices
| Category | Rules to Define | Example |
|---|---|---|
| Completeness | Non-null checks on required fields | email IS NOT NULL |
| Validity | Value range and format checks | age BETWEEN 0 AND 150 |
| Uniqueness | Primary key and business key uniqueness | email is unique per tenant |
| Consistency | Cross-column and cross-table checks | order_date <= shipped_date |
| Timeliness | Data freshness checks | Latest order_date within last 24 hours |
| Referential integrity | Foreign key relationships | product_id EXISTS IN products.id |
| Volume | Row count thresholds | orders.count >= 45000 |
Step 10: Export Quality Reports
Generate reports for stakeholders:
| Format | Description | Use Case |
|---|---|---|
| PDF Report | Formatted quality summary | Weekly team review |
| CSV Detail | Failed rows with details | Data engineering investigation |
| JSON API | Quality metrics via API | Integration with external tools |
# Export quality report via API
curl -X GET "http://localhost:8000/api/v1/quality/reports/orders_quality_rules" \
-H "Authorization: Bearer ${ACCESS_TOKEN}" \
-H "Accept: application/json"Troubleshooting
| Issue | Cause | Resolution |
|---|---|---|
| "Table not found" | Schema not synced | Refresh the data source metadata |
| Profiling times out | Very large table | Profile a sample instead of the full table |
| Quality checks slow | Complex join rules | Index foreign key columns |
| False positive anomalies | Expected patterns flagged | Adjust anomaly thresholds or exclude known patterns |
| Monitor not running | Schedule configuration error | Check cron expression and timezone settings |
Next Steps
With data quality controls in place, proceed to Platform Administration to learn how to manage tenants, users, and platform settings.