Data Governance
The Data Governance pillar ensures that data across the MATIH Platform is discoverable, trustworthy, secure, and compliant. Unlike standalone governance tools that operate separately from the analytics workflow, MATIH's governance capabilities are integrated directly into the query pipeline, the dashboard rendering, the ML training workflow, and the conversational analytics interface.
1.1Data Catalog
The catalog-service (Java, port 8086) provides a centralized metadata repository:
| Capability | Description | Implementation |
|---|---|---|
| Automated discovery | Crawl connected Trino catalogs to discover tables, columns, and schemas | Scheduled crawlers with incremental sync |
| Business glossary | Define business terms and map them to technical schema elements | Term management with approval workflows |
| Data lineage | Track data flow from source through transformations to consumption | Automatic lineage from query execution; manual lineage for external processes |
| Search | Full-text search across all cataloged assets with faceted filtering | Elasticsearch-backed with tenant-scoped indices |
| Classification | Tag sensitive data (PII, PHI, financial, confidential) | Automatic classification via pattern matching + manual overrides |
| Ownership | Assign data stewards and owners to tables and schemas | RBAC-integrated ownership model |
| Documentation | Rich descriptions for tables, columns, and schemas | Markdown support with version history |
Schema Discovery Workflow
Discovery Crawler (scheduled or manual)
|
+-- Connect to Trino catalog for tenant
| |
| +-- List all schemas
| +-- List all tables per schema
| +-- Describe columns (name, type, nullable, comment)
|
+-- Compare with existing catalog metadata
| |
| +-- New tables -> Create catalog entries
| +-- Changed columns -> Flag schema drift
| +-- Dropped tables -> Mark as deprecated
|
+-- Publish events
|
+-- schema.discovered -> notification-service (new data available)
+-- schema.drifted -> data-quality-service (quality re-evaluation)
+-- schema.deprecated -> Context Graph (impact analysis)1.2Data Lineage
Data lineage in MATIH is tracked at multiple granularities:
| Lineage Level | Granularity | Source |
|---|---|---|
| Table-level | Source table -> Target table | Pipeline execution records in pipeline-service |
| Column-level | Source column -> Target column | SQL parsing of transformation queries |
| Query-level | Query -> Tables read | Trino query execution logs |
| Dashboard-level | Dashboard -> Queries -> Tables | bi-service query references |
| Model-level | Model -> Training tables -> Features | ml-service experiment records |
Lineage in the Context Graph
All lineage information is stored in the Context Graph (Neo4j) as relationships between nodes:
Source DB (Postgres)
|
[EXTRACTED_BY] Pipeline: daily_orders_load
|
v
Staging Table: raw.orders
|
[TRANSFORMED_BY] Pipeline: orders_transform
|
v
Analytics Table: analytics.orders_daily
|
+--[QUERIED_BY] Dashboard: "Sales Overview"
|
+--[QUERIED_BY] Conversation: "Show me revenue trends"
|
+--[TRAINED_ON] Model: "demand_forecast_v2"Impact Analysis
The lineage graph enables powerful impact analysis queries:
| Question | Graph Query | Result |
|---|---|---|
| "What dashboards break if I drop this column?" | Traverse READS_FROM -> DISPLAYS edges from Column | List of affected dashboards with widget details |
| "What is the source of this metric?" | Traverse backward from SemanticMetric through transformations | Complete source-to-metric lineage chain |
| "Who uses this table?" | Find all edges from Table node | Users, queries, dashboards, models, and pipelines |
| "What models are affected by this data quality issue?" | Traverse from QualityAlert through Table to Model | Models with degraded training data |
1.3Data Classification and Masking
The governance-service (Python, port 8080) manages data classification and protection:
Automatic Classification
The governance-service runs classification rules against catalog metadata to automatically identify sensitive data:
| Classification | Pattern | Examples |
|---|---|---|
| PII (Personal Identifiable Information) | Name patterns, email regex, phone regex, SSN patterns | customer_name, email_address, phone_number |
| PHI (Protected Health Information) | Medical record patterns, diagnosis codes, treatment fields | patient_id, diagnosis_code, medication |
| Financial | Account number patterns, credit card regex, transaction fields | account_number, credit_card, salary |
| Confidential | Manual classification by data stewards | internal_cost, margin, strategy_notes |
Column Masking Policies
Once data is classified, masking policies can be applied:
| Masking Type | Description | Example |
|---|---|---|
| Full mask | Replace entire value with a constant | John Smith -> **** |
| Partial mask | Show first/last characters only | john@company.com -> j***@c***y.com |
| Hash | Replace with deterministic hash (preserves joins) | SSN: 123-45-6789 -> SSN: a7f3b... |
| Null | Replace with NULL | salary: 150000 -> salary: NULL |
| Range | Replace precise value with range | age: 34 -> age: 30-40 |
| Custom function | Apply tenant-defined masking logic | Configurable via governance-service API |
Masking is enforced at the semantic layer, ensuring that masked data appears consistently regardless of access path (dashboard, conversational query, API, or direct SQL).
1.4Access Control and Compliance
Row-Level Security
Row-level security (RLS) filters query results based on user attributes:
Policy: Regional Access
Rule: Users with role "Regional Manager" see only data
WHERE region = user.assigned_region
Applied to: All tables with a "region" column
Enforcement: Semantic layer appends filter to every queryRLS policies are defined in the governance-service and enforced by the semantic layer, ensuring they apply to:
- Dashboard queries
- Conversational analytics queries
- Direct API queries
- ML training data access
Audit Trail
The audit-service (Java, port 8086) captures every significant action across all services:
| Audit Event | Captured Data | Retention |
|---|---|---|
| Data access | User, tenant, table, columns, row count, timestamp | 90 days (configurable) |
| Query execution | Full SQL, execution plan, duration, result size | 90 days |
| Configuration change | Old value, new value, user, timestamp | 365 days |
| Policy change | RLS rule, masking rule, classification change | 365 days |
| User action | Login, logout, role change, permission grant | 365 days |
| Model deployment | Model version, deployer, approval chain | 365 days |
Compliance Reporting
MATIH provides pre-built compliance report templates:
| Compliance Framework | Report Contents |
|---|---|
| GDPR | Data subject access requests, processing records, consent tracking, right to erasure audit |
| HIPAA | PHI access logs, minimum necessary checks, breach notification timeline |
| SOC 2 | Access control effectiveness, change management records, incident response timeline |
| CCPA | Consumer data inventory, sale/sharing records, opt-out compliance |
1.5Governance in the Conversational Flow
Governance is not a separate workflow in MATIH. It is embedded in the conversational analytics experience:
| Scenario | Governance Action | User Experience |
|---|---|---|
| User queries a table with PII | Column masking applied automatically | Masked values appear in results with a governance indicator |
| User queries data with low quality score | Quality warning included in response | "Note: The warehouse_id column has 15% null values. Results may be incomplete." |
| User queries data outside their access scope | RLS filter applied silently | User sees only their authorized rows; no error, no indication of filtered data |
| User attempts to export classified data | Export blocked with explanation | "This data contains PII-classified columns that cannot be exported without Data Steward approval." |
| User asks about data lineage | Context Graph queried for lineage | "This table is sourced from the orders database via a daily CDC pipeline, last updated 2 hours ago." |
Deep Dive References
- Data Catalog -- Complete catalog service architecture and API reference
- Lineage Tracking -- Lineage collection, storage, and visualization
- Governance Service -- Classification, masking, and compliance policies
- Semantic Layer -- Metric definitions and security enforcement
- Context Graph -- Neo4j graph model and query patterns