Data Governance

Production - catalog-service, governance-service, data-quality-service -- Catalog, lineage, classification, quality monitoring

The Data Governance pillar ensures that data across the MATIH Platform is discoverable, trustworthy, secure, and compliant. Unlike standalone governance tools that operate separately from the analytics workflow, MATIH's governance capabilities are integrated directly into the query pipeline, the dashboard rendering, the ML training workflow, and the conversational analytics interface.

1.1Data Catalog

The catalog-service (Java, port 8086) provides a centralized metadata repository:

Capability	Description	Implementation
Automated discovery	Crawl connected Trino catalogs to discover tables, columns, and schemas	Scheduled crawlers with incremental sync
Business glossary	Define business terms and map them to technical schema elements	Term management with approval workflows
Data lineage	Track data flow from source through transformations to consumption	Automatic lineage from query execution; manual lineage for external processes
Search	Full-text search across all cataloged assets with faceted filtering	Elasticsearch-backed with tenant-scoped indices
Classification	Tag sensitive data (PII, PHI, financial, confidential)	Automatic classification via pattern matching + manual overrides
Ownership	Assign data stewards and owners to tables and schemas	RBAC-integrated ownership model
Documentation	Rich descriptions for tables, columns, and schemas	Markdown support with version history

Schema Discovery Workflow

Discovery Crawler (scheduled or manual)
  |
  +-- Connect to Trino catalog for tenant
  |     |
  |     +-- List all schemas
  |     +-- List all tables per schema
  |     +-- Describe columns (name, type, nullable, comment)
  |
  +-- Compare with existing catalog metadata
  |     |
  |     +-- New tables -> Create catalog entries
  |     +-- Changed columns -> Flag schema drift
  |     +-- Dropped tables -> Mark as deprecated
  |
  +-- Publish events
        |
        +-- schema.discovered -> notification-service (new data available)
        +-- schema.drifted -> data-quality-service (quality re-evaluation)
        +-- schema.deprecated -> Context Graph (impact analysis)

1.2Data Lineage

Data lineage in MATIH is tracked at multiple granularities:

Lineage Level	Granularity	Source
Table-level	Source table -> Target table	Pipeline execution records in pipeline-service
Column-level	Source column -> Target column	SQL parsing of transformation queries
Query-level	Query -> Tables read	Trino query execution logs
Dashboard-level	Dashboard -> Queries -> Tables	bi-service query references
Model-level	Model -> Training tables -> Features	ml-service experiment records

Lineage in the Context Graph

All lineage information is stored in the Context Graph (Neo4j) as relationships between nodes:

Source DB (Postgres)
  |
  [EXTRACTED_BY] Pipeline: daily_orders_load
  |
  v
Staging Table: raw.orders
  |
  [TRANSFORMED_BY] Pipeline: orders_transform
  |
  v
Analytics Table: analytics.orders_daily
  |
  +--[QUERIED_BY] Dashboard: "Sales Overview"
  |
  +--[QUERIED_BY] Conversation: "Show me revenue trends"
  |
  +--[TRAINED_ON] Model: "demand_forecast_v2"

Impact Analysis

The lineage graph enables powerful impact analysis queries:

Question	Graph Query	Result
"What dashboards break if I drop this column?"	Traverse READS_FROM -> DISPLAYS edges from Column	List of affected dashboards with widget details
"What is the source of this metric?"	Traverse backward from SemanticMetric through transformations	Complete source-to-metric lineage chain
"Who uses this table?"	Find all edges from Table node	Users, queries, dashboards, models, and pipelines
"What models are affected by this data quality issue?"	Traverse from QualityAlert through Table to Model	Models with degraded training data

1.3Data Classification and Masking

The governance-service (Python, port 8080) manages data classification and protection:

Automatic Classification

The governance-service runs classification rules against catalog metadata to automatically identify sensitive data:

Classification	Pattern	Examples
PII (Personal Identifiable Information)	Name patterns, email regex, phone regex, SSN patterns	`customer_name`, `email_address`, `phone_number`
PHI (Protected Health Information)	Medical record patterns, diagnosis codes, treatment fields	`patient_id`, `diagnosis_code`, `medication`
Financial	Account number patterns, credit card regex, transaction fields	`account_number`, `credit_card`, `salary`
Confidential	Manual classification by data stewards	`internal_cost`, `margin`, `strategy_notes`

Column Masking Policies

Once data is classified, masking policies can be applied:

Masking Type	Description	Example
Full mask	Replace entire value with a constant	`John Smith` -> `****`
Partial mask	Show first/last characters only	`john@company.com` -> `j*@c*y.com`
Hash	Replace with deterministic hash (preserves joins)	`SSN: 123-45-6789` -> `SSN: a7f3b...`
Null	Replace with NULL	`salary: 150000` -> `salary: NULL`
Range	Replace precise value with range	`age: 34` -> `age: 30-40`
Custom function	Apply tenant-defined masking logic	Configurable via governance-service API

Masking is enforced at the semantic layer, ensuring that masked data appears consistently regardless of access path (dashboard, conversational query, API, or direct SQL).

1.4Access Control and Compliance

Row-Level Security

Row-level security (RLS) filters query results based on user attributes:

Policy: Regional Access
  Rule: Users with role "Regional Manager" see only data
        WHERE region = user.assigned_region
  Applied to: All tables with a "region" column
  Enforcement: Semantic layer appends filter to every query

RLS policies are defined in the governance-service and enforced by the semantic layer, ensuring they apply to:

Dashboard queries
Conversational analytics queries
Direct API queries
ML training data access

Audit Trail

The audit-service (Java, port 8086) captures every significant action across all services:

Audit Event	Captured Data	Retention
Data access	User, tenant, table, columns, row count, timestamp	90 days (configurable)
Query execution	Full SQL, execution plan, duration, result size	90 days
Configuration change	Old value, new value, user, timestamp	365 days
Policy change	RLS rule, masking rule, classification change	365 days
User action	Login, logout, role change, permission grant	365 days
Model deployment	Model version, deployer, approval chain	365 days

Compliance Reporting

MATIH provides pre-built compliance report templates:

Compliance Framework	Report Contents
GDPR	Data subject access requests, processing records, consent tracking, right to erasure audit
HIPAA	PHI access logs, minimum necessary checks, breach notification timeline
SOC 2	Access control effectiveness, change management records, incident response timeline
CCPA	Consumer data inventory, sale/sharing records, opt-out compliance

1.5Governance in the Conversational Flow

Governance is not a separate workflow in MATIH. It is embedded in the conversational analytics experience:

Scenario	Governance Action	User Experience
User queries a table with PII	Column masking applied automatically	Masked values appear in results with a governance indicator
User queries data with low quality score	Quality warning included in response	"Note: The `warehouse_id` column has 15% null values. Results may be incomplete."
User queries data outside their access scope	RLS filter applied silently	User sees only their authorized rows; no error, no indication of filtered data
User attempts to export classified data	Export blocked with explanation	"This data contains PII-classified columns that cannot be exported without Data Steward approval."
User asks about data lineage	Context Graph queried for lineage	"This table is sourced from the orders database via a daily CDC pipeline, last updated 2 hours ago."

Deep Dive References

Data Catalog -- Complete catalog service architecture and API reference
Lineage Tracking -- Lineage collection, storage, and visualization
Governance Service -- Classification, masking, and compliance policies
Semantic Layer -- Metric definitions and security enforcement
Context Graph -- Neo4j graph model and query patterns

Agent Workflows Multi-Tenancy