MATIH Platform is in active MVP development. Documentation reflects current implementation status.
5. Quickstart Tutorials
Tutorial: Data Quality Exploration

Tutorial: Data Quality Exploration

In this tutorial, you will use the MATIH Data Workbench to profile your data, detect anomalies, define quality rules, and monitor data health over time. Data quality is the foundation of reliable analytics and machine learning -- poor data leads to poor insights.


What You Will Learn

  • How to profile a dataset and understand its statistical properties
  • How to detect anomalies such as null values, outliers, and distribution shifts
  • How to define data quality rules (expectations) for ongoing monitoring
  • How to view quality scores and trend reports
  • How to set up alerts for data quality violations

Prerequisites

RequirementHow to Verify
MATIH platform running./scripts/tools/platform-status.sh returns healthy
Data quality service operationalHealth check on data-quality-service passes
Sample data loadedThe retail analytics tables are available

Step 1: Open the Data Workbench

Navigate to the Data Workbench:

  • Local development: http://localhost:3002
  • Cloud deployment: https://data.{your-tenant}.matih.ai

Log in with your tenant credentials. The Data Workbench shows:

SectionDescription
Data SourcesConnected databases and their schemas
ProfilesData profiling results
Quality RulesDefined quality expectations
MonitorsQuality monitoring dashboards
LineageData lineage and dependency graphs

Step 2: Profile a Dataset

Data profiling automatically computes statistics for every column in a table.

  1. Click Data Sources in the sidebar.
  2. Expand your data source and select the orders table.
  3. Click Profile Table.

The profiler analyzes the table and produces a comprehensive report.

Profile Report: orders Table

Table-Level Statistics:

MetricValue
Row count50,000
Column count8
Estimated size12.4 MB
Last updated2026-02-11 23:45:00

Column-Level Statistics:

ColumnTypeNon-nullNull %UniqueMinMaxMeanStd Dev
idINT50,0000.0%50,000150,000----
customer_idINT50,0000.0%4,82315,000----
product_idINT49,9870.03%4981500----
order_dateDATE50,0000.0%3652025-02-122026-02-11----
total_amountDECIMAL49,9500.1%8,4321.992,499.99295.42187.63
statusVARCHAR50,0000.0%4--------
shipping_cityVARCHAR49,8000.4%342--------
created_atTIMESTAMP50,0000.0%49,876--------

Value Distribution for status:

ValueCountPercentage
COMPLETED42,00084.0%
SHIPPED4,5009.0%
PENDING2,5005.0%
CANCELLED1,0002.0%

Step 3: Detect Anomalies

The profiler automatically flags potential data quality issues.

Anomaly Detection Results

ColumnAnomaly TypeDescriptionSeverity
product_idNull values13 null values (0.03%)Low
total_amountNull values50 null values (0.1%)Medium
total_amountOutliers23 values above 3 standard deviationsLow
shipping_cityNull values200 null values (0.4%)Medium
shipping_cityCardinalityHigh cardinality (342 unique values)Info

Outlier Analysis for total_amount

The profiler shows a histogram of the total_amount distribution with outliers highlighted:

StatisticValue
Mean$295.42
Median$245.00
Std Dev$187.63
IQR$210.00
Lower fence (Q1 - 1.5 x IQR)-115.00(effectively115.00 (effectively 0)
Upper fence (Q3 + 1.5 x IQR)$715.00
Values above upper fence1,247 (2.5%)
Values above 3 std dev ($857.31)23 (0.05%)

Click on any anomaly to drill into the affected rows.


Step 4: Define Quality Rules

Quality rules (also called expectations) define the constraints your data must satisfy. When data violates a rule, it is flagged for investigation.

Create Rules for the orders Table

  1. Click Quality Rules in the sidebar.
  2. Click New Rule Set.
  3. Name: orders_quality_rules.
  4. Select table: orders.

Add the following rules:

Rule 1: No null order amounts

SettingValue
Columntotal_amount
Rule typeNot Null
SeverityCritical
DescriptionEvery order must have a total amount

Rule 2: Positive order amounts

SettingValue
Columntotal_amount
Rule typeGreater Than
Threshold0
SeverityCritical
DescriptionOrder amounts must be positive

Rule 3: Valid status values

SettingValue
Columnstatus
Rule typeIn Set
ValuesCOMPLETED, SHIPPED, PENDING, CANCELLED
SeverityHigh
DescriptionOrder status must be a known value

Rule 4: Order date within range

SettingValue
Columnorder_date
Rule typeBetween
Min2020-01-01
MaxCURRENT_DATE
SeverityHigh
DescriptionOrder dates must not be in the future or before 2020

Rule 5: Referential integrity

SettingValue
Columncustomer_id
Rule typeExists In
Reference tablecustomers
Reference columnid
SeverityCritical
DescriptionEvery order must reference a valid customer

Rule 6: Row count threshold

SettingValue
Rule typeTable Row Count
Min rows45,000
SeverityHigh
DescriptionThe orders table should have at least 45,000 rows
  1. Click Save Rule Set.

Step 5: Run Quality Checks

  1. In the rule set view, click Run Checks.
  2. The data quality service evaluates each rule against the current data.

Quality Check Results

RuleStatusPass RateFailed Rows
No null order amountsWARN99.9%50
Positive order amountsPASS100.0%0
Valid status valuesPASS100.0%0
Order date within rangePASS100.0%0
Referential integrityPASS100.0%0
Row count thresholdPASS----

Overall Quality Score: 94.2%

The quality score is computed as a weighted average of rule pass rates, with critical rules weighted higher than informational ones.


Step 6: Investigate Failing Rules

Click on the "No null order amounts" rule to see the failing rows:

-- Auto-generated investigation query
SELECT *
FROM orders
WHERE total_amount IS NULL
ORDER BY order_date DESC
LIMIT 100
idcustomer_idproduct_idorder_datetotal_amountstatus
49,8233,421422026-02-11NULLPENDING
49,7561,8921052026-02-10NULLPENDING
............NULL...

Finding: All 50 null total_amount values are in PENDING orders that have not been finalized yet. This is expected behavior, and you might update the rule to exclude pending orders:

total_amount IS NOT NULL OR status = 'PENDING'

Step 7: Set Up Quality Monitoring

Schedule recurring quality checks to detect issues as data changes.

  1. Click Monitors in the sidebar.

  2. Click New Monitor.

  3. Configure:

    • Name: Orders Quality Monitor
    • Rule set: orders_quality_rules
    • Schedule: Daily at 7:00 AM (before business hours)
    • Alert on: Quality score drops below 90%
    • Notification channel: Email to data-engineering@acme.com
  4. Click Create Monitor.

Quality Trend Dashboard

The monitoring dashboard shows quality metrics over time:

DateQuality ScoreRules PassedRules FailedTotal Rows
Feb 1095.1%6/60/649,800
Feb 1194.8%5/61/649,950
Feb 1294.2%5/61/650,000

The trend chart shows quality score over time, with annotations for significant changes.


Step 8: Profile the Customers Table

Repeat the profiling process for the customers table to establish a complete quality baseline:

  1. Navigate to the customers table.
  2. Click Profile Table.

Key Findings

ColumnFindingAction
email12 duplicate valuesInvestigate potential duplicate accounts
age3 values below 0Data entry errors; add validation rule
segment5 distinct values (expected 4)Unknown segment "OTHER" needs classification
phone8.2% null valuesExpected for optional field

Step 9: Data Quality Rules Best Practices

CategoryRules to DefineExample
CompletenessNon-null checks on required fieldsemail IS NOT NULL
ValidityValue range and format checksage BETWEEN 0 AND 150
UniquenessPrimary key and business key uniquenessemail is unique per tenant
ConsistencyCross-column and cross-table checksorder_date <= shipped_date
TimelinessData freshness checksLatest order_date within last 24 hours
Referential integrityForeign key relationshipsproduct_id EXISTS IN products.id
VolumeRow count thresholdsorders.count >= 45000

Step 10: Export Quality Reports

Generate reports for stakeholders:

FormatDescriptionUse Case
PDF ReportFormatted quality summaryWeekly team review
CSV DetailFailed rows with detailsData engineering investigation
JSON APIQuality metrics via APIIntegration with external tools
# Export quality report via API
curl -X GET "http://localhost:8000/api/v1/quality/reports/orders_quality_rules" \
  -H "Authorization: Bearer ${ACCESS_TOKEN}" \
  -H "Accept: application/json"

Troubleshooting

IssueCauseResolution
"Table not found"Schema not syncedRefresh the data source metadata
Profiling times outVery large tableProfile a sample instead of the full table
Quality checks slowComplex join rulesIndex foreign key columns
False positive anomaliesExpected patterns flaggedAdjust anomaly thresholds or exclude known patterns
Monitor not runningSchedule configuration errorCheck cron expression and timezone settings

Next Steps

With data quality controls in place, proceed to Platform Administration to learn how to manage tenants, users, and platform settings.