MATIH Platform is in active MVP development. Documentation reflects current implementation status.
10a. Data Ingestion
Governance & Classification

Governance & Classification

After data is ingested, the platform automatically classifies it, detects PII, and generates governance recommendations.

Post-Ingestion Intelligence Pipeline

Airbyte sync completes → catalog-service registers tables

TABLE_DISCOVERED event → [catalog-events Kafka topic]

┌─────────────────────────┬──────────────────────┬────────────────────┬──────────────────┐
│ Auto-Classification     │ Auto-Ontology        │ Auto-Semantic      │ Auto-Quality     │
│ (PII detection + level) │ (entity extraction)  │ (DRAFT models)     │ (profiling)      │
└─────────────────────────┴──────────────────────┴────────────────────┴──────────────────┘

DATA_CLASSIFIED event → governance-service

RLS suggestions + masking rules

Auto-Classification

When a new table is discovered via ingestion:

  1. PII Detection — scans column names and sample values for patterns (SSN, email, phone, credit card, addresses)
  2. Risk Level — assigns NONE / LOW / MEDIUM / HIGH / CRITICAL based on PII types found
  3. Classification Level — PUBLIC / INTERNAL / CONFIDENTIAL / RESTRICTED based on data sensitivity

RLS Auto-Suggestions

For tables classified as HIGH or CRITICAL:

PII TypeSuggested Policy
tenant_id columnTenant-scoped RLS: WHERE tenant_id = current_tenant()
SSNColumn restriction: only PII_VIEWER role sees full value
EmailMasking: ***@domain.com for non-privileged users
PhoneMasking: (XXX) XXX-1234 for non-privileged users
Credit CardPCI-DSS masking: XXXX-XXXX-XXXX-1234

Suggestions are created as DRAFT policies that require human approval before activation.

Dynamic Masking

Trino masking functions are automatically generated:

PII TypeMasking Expression
SSN'XXX-XX-' || SUBSTR(column, -4)
Email'***@' || SPLIT_PART(column, '@', 2)
Phone'(XXX) XXX-' || SUBSTR(column, -4)
Credit Card'XXXX-XXXX-XXXX-' || SUBSTR(column, -4)

Accuracy Metrics

Three accuracy services run after each sync:

Freshness SLA

  • Tracks last_sync_at vs configured SLA (default: 24 hours)
  • Alerts on breach via notification-service

Schema Drift Detection

  • Compares column schemas between consecutive syncs
  • Detects: added columns, removed columns, type changes

Row Count Validation

  • Flags >20% row count drops as anomalies (possible data loss)
  • Flags >500% spikes as unusual (possible data explosion)
  • Critical alert on drop to zero

RBAC

OperationPermissionRoles
View classificationcatalog:readDATA_ENGINEER, DATA_ANALYST, DATA_SCIENTIST
View RLS suggestionsgovernance:readDATA_ENGINEER, PLATFORM_ADMIN
Apply RLS policiesgovernance:writeDATA_ENGINEER, PLATFORM_ADMIN