MATIH Platform is in active MVP development. Documentation reflects current implementation status.
1. Introduction
Data Governance

Data Governance

Production - catalog-service, governance-service, data-quality-service -- Catalog, lineage, classification, quality monitoring

The Data Governance pillar ensures that data across the MATIH Platform is discoverable, trustworthy, secure, and compliant. Unlike standalone governance tools that operate separately from the analytics workflow, MATIH's governance capabilities are integrated directly into the query pipeline, the dashboard rendering, the ML training workflow, and the conversational analytics interface.


1.1Data Catalog

The catalog-service (Java, port 8086) provides a centralized metadata repository:

CapabilityDescriptionImplementation
Automated discoveryCrawl connected Trino catalogs to discover tables, columns, and schemasScheduled crawlers with incremental sync
Business glossaryDefine business terms and map them to technical schema elementsTerm management with approval workflows
Data lineageTrack data flow from source through transformations to consumptionAutomatic lineage from query execution; manual lineage for external processes
SearchFull-text search across all cataloged assets with faceted filteringElasticsearch-backed with tenant-scoped indices
ClassificationTag sensitive data (PII, PHI, financial, confidential)Automatic classification via pattern matching + manual overrides
OwnershipAssign data stewards and owners to tables and schemasRBAC-integrated ownership model
DocumentationRich descriptions for tables, columns, and schemasMarkdown support with version history

Schema Discovery Workflow

Discovery Crawler (scheduled or manual)
  |
  +-- Connect to Trino catalog for tenant
  |     |
  |     +-- List all schemas
  |     +-- List all tables per schema
  |     +-- Describe columns (name, type, nullable, comment)
  |
  +-- Compare with existing catalog metadata
  |     |
  |     +-- New tables -> Create catalog entries
  |     +-- Changed columns -> Flag schema drift
  |     +-- Dropped tables -> Mark as deprecated
  |
  +-- Publish events
        |
        +-- schema.discovered -> notification-service (new data available)
        +-- schema.drifted -> data-quality-service (quality re-evaluation)
        +-- schema.deprecated -> Context Graph (impact analysis)

1.2Data Lineage

Data lineage in MATIH is tracked at multiple granularities:

Lineage LevelGranularitySource
Table-levelSource table -> Target tablePipeline execution records in pipeline-service
Column-levelSource column -> Target columnSQL parsing of transformation queries
Query-levelQuery -> Tables readTrino query execution logs
Dashboard-levelDashboard -> Queries -> Tablesbi-service query references
Model-levelModel -> Training tables -> Featuresml-service experiment records

Lineage in the Context Graph

All lineage information is stored in the Context Graph (Neo4j) as relationships between nodes:

Source DB (Postgres)
  |
  [EXTRACTED_BY] Pipeline: daily_orders_load
  |
  v
Staging Table: raw.orders
  |
  [TRANSFORMED_BY] Pipeline: orders_transform
  |
  v
Analytics Table: analytics.orders_daily
  |
  +--[QUERIED_BY] Dashboard: "Sales Overview"
  |
  +--[QUERIED_BY] Conversation: "Show me revenue trends"
  |
  +--[TRAINED_ON] Model: "demand_forecast_v2"

Impact Analysis

The lineage graph enables powerful impact analysis queries:

QuestionGraph QueryResult
"What dashboards break if I drop this column?"Traverse READS_FROM -> DISPLAYS edges from ColumnList of affected dashboards with widget details
"What is the source of this metric?"Traverse backward from SemanticMetric through transformationsComplete source-to-metric lineage chain
"Who uses this table?"Find all edges from Table nodeUsers, queries, dashboards, models, and pipelines
"What models are affected by this data quality issue?"Traverse from QualityAlert through Table to ModelModels with degraded training data

1.3Data Classification and Masking

The governance-service (Python, port 8080) manages data classification and protection:

Automatic Classification

The governance-service runs classification rules against catalog metadata to automatically identify sensitive data:

ClassificationPatternExamples
PII (Personal Identifiable Information)Name patterns, email regex, phone regex, SSN patternscustomer_name, email_address, phone_number
PHI (Protected Health Information)Medical record patterns, diagnosis codes, treatment fieldspatient_id, diagnosis_code, medication
FinancialAccount number patterns, credit card regex, transaction fieldsaccount_number, credit_card, salary
ConfidentialManual classification by data stewardsinternal_cost, margin, strategy_notes

Column Masking Policies

Once data is classified, masking policies can be applied:

Masking TypeDescriptionExample
Full maskReplace entire value with a constantJohn Smith -> ****
Partial maskShow first/last characters onlyjohn@company.com -> j***@c***y.com
HashReplace with deterministic hash (preserves joins)SSN: 123-45-6789 -> SSN: a7f3b...
NullReplace with NULLsalary: 150000 -> salary: NULL
RangeReplace precise value with rangeage: 34 -> age: 30-40
Custom functionApply tenant-defined masking logicConfigurable via governance-service API

Masking is enforced at the semantic layer, ensuring that masked data appears consistently regardless of access path (dashboard, conversational query, API, or direct SQL).


1.4Access Control and Compliance

Row-Level Security

Row-level security (RLS) filters query results based on user attributes:

Policy: Regional Access
  Rule: Users with role "Regional Manager" see only data
        WHERE region = user.assigned_region
  Applied to: All tables with a "region" column
  Enforcement: Semantic layer appends filter to every query

RLS policies are defined in the governance-service and enforced by the semantic layer, ensuring they apply to:

  • Dashboard queries
  • Conversational analytics queries
  • Direct API queries
  • ML training data access

Audit Trail

The audit-service (Java, port 8086) captures every significant action across all services:

Audit EventCaptured DataRetention
Data accessUser, tenant, table, columns, row count, timestamp90 days (configurable)
Query executionFull SQL, execution plan, duration, result size90 days
Configuration changeOld value, new value, user, timestamp365 days
Policy changeRLS rule, masking rule, classification change365 days
User actionLogin, logout, role change, permission grant365 days
Model deploymentModel version, deployer, approval chain365 days

Compliance Reporting

MATIH provides pre-built compliance report templates:

Compliance FrameworkReport Contents
GDPRData subject access requests, processing records, consent tracking, right to erasure audit
HIPAAPHI access logs, minimum necessary checks, breach notification timeline
SOC 2Access control effectiveness, change management records, incident response timeline
CCPAConsumer data inventory, sale/sharing records, opt-out compliance

1.5Governance in the Conversational Flow

Governance is not a separate workflow in MATIH. It is embedded in the conversational analytics experience:

ScenarioGovernance ActionUser Experience
User queries a table with PIIColumn masking applied automaticallyMasked values appear in results with a governance indicator
User queries data with low quality scoreQuality warning included in response"Note: The warehouse_id column has 15% null values. Results may be incomplete."
User queries data outside their access scopeRLS filter applied silentlyUser sees only their authorized rows; no error, no indication of filtered data
User attempts to export classified dataExport blocked with explanation"This data contains PII-classified columns that cannot be exported without Data Steward approval."
User asks about data lineageContext Graph queried for lineage"This table is sourced from the orders database via a daily CDC pipeline, last updated 2 hours ago."

Deep Dive References