Chapter 10: Data Catalog & Governance
Comprehensive metadata management, data lineage tracking, governance policy enforcement, semantic modeling, and data classification across the MATIH platform.
Learning Objectives
- Understand the Catalog Service architecture, metadata ingestion, and full-text search capabilities
- Learn the Asset Service unified registry, versioning lifecycle, permission model, and governance workflows
- Master data lineage tracking including column-level lineage, impact analysis, and OpenLineage integration
- Learn governance policy management, ABAC, RLS, data masking, and compliance reporting
- Build semantic models with dimensions, metrics, relationships, and natural language query translation
Details
- Chapter 4: Installation & Configuration
- Chapter 9: Query Engine
- Ch. 9: Query Engine
- Ch. 11: Pipelines & Data Engineering
- Ch. 12: AI Service
The Data Catalog & Governance layer provides the metadata backbone of the MATIH platform. It encompasses four services that work together to catalog data assets, track data lineage, enforce governance policies, and define a semantic layer over raw data.
Service Architecture
Services Overview
| Service | Technology | Port | Responsibilities |
|---|---|---|---|
| Catalog Service | Java 21, Spring Boot 3.2 | 8086 | Metadata discovery, search, tagging, lineage, classification, data sources |
| Asset Service | Java 21, Spring Boot 3.2 | 8093 | Unified asset registry, versioning, lifecycle governance, permissions, cloning |
| Governance Service | Java 21, Spring Boot 3.2 | 8080 | Policies, ABAC, RLS, masking, query audit, compliance, sensitive data |
| Semantic Layer | Java 21, Spring Boot 3.2 | 8086 | Semantic models, metrics, dimensions, NL-to-query, query optimization |
| Data Quality Service | Python, FastAPI | 8000 | Validation rules, profiling, anomaly detection, quality scoring |
Chapter Structure
Catalog Service
| Section | Description |
|---|---|
| Architecture | Service internals, OpenMetadata integration, Elasticsearch search |
| Search | Full-text search, autocomplete suggestions, search tracking |
| Databases | Database listing, FQN lookup, data source filtering |
| Tables | Table listing, schema introspection, column details, tag filtering |
| Tags | Tagging system, categories, tag-based asset discovery |
| Data Sources | Data source registration, CRUD, configuration management |
| Metadata Ingestion | Async and sync ingestion, OpenMetadata synchronization |
| Statistics | Catalog coverage metrics, asset counts, health indicators |
| API Reference | Complete REST API for all catalog endpoints |
Asset Service
| Section | Description |
|---|---|
| Architecture | Service internals, asset types, component layout, deployment |
| Version Lifecycle | State machine, approval workflow, governance policies |
| Permission Model | Hierarchical RBAC, effective permissions, ownership transfer |
| Cloning | Cross-asset cloning with provenance tracking |
| API Endpoints | Complete REST API for assets, versions, lifecycle, permissions, clones |
| Prometheus Metrics | Custom business counters for asset operations |
Data Lineage
| Section | Description |
|---|---|
| Overview | Lineage architecture, edge model, OpenLineage protocol |
| Upstream Lineage | Source dependency tracking, traversal depth control |
| Downstream Lineage | Impact analysis, consumer identification |
| Full Lineage | Complete graph construction, bidirectional traversal |
| Column-Level Lineage | SQL parsing, column mapping extraction, batch processing |
| Visualization & Export | Graph rendering, path finding, JSON/CSV/GraphML/DOT export |
| Creating Lineage | Manual and automated lineage creation, OpenLineage ingestion |
Governance
| Section | Description |
|---|---|
| Overview | Governance architecture, policy engine, OPA integration |
| Policy Management | Policy CRUD, lifecycle (draft/active/suspended), rule evaluation |
| Data Classification | Manual and auto-classification, sensitivity levels, verification |
| Data Masking | Masking types, auto-mask by category, batch masking, detokenization |
| ABAC | Attribute-based access control, OPA Rego generation |
| Row-Level Security | RLS policy definition, WHERE clause injection, audit logging |
| Query Audit | Execution audit trail, slow/failed/anomalous query detection |
| Sensitive Data | Sensitive data access monitoring, PII/PHI/PCI discovery |
| Compliance | GDPR, HIPAA, PCI-DSS reporting, control mapping |
| API Reference | Complete REST API for all governance endpoints |
Semantic Layer
| Section | Description |
|---|---|
| Architecture | Semantic layer design, WrenAI integration, MDL compiler |
| Semantic Models | Model creation, dimensions, metrics, status lifecycle |
| Metric Queries | Metric query execution, preview, compiled SQL |
| Natural Language | NL-to-semantic query translation, ask, explain, validate |
| Query Optimization | Rewriting, caching, cost estimation, table statistics |
| Advanced Metrics | Cumulative, period comparison, moving average, percentile, CAGR |
| Metric Versioning | Version history, comparison, rollback |
| Relationships | Model relationships, join paths, relationship types |
| API Reference | Complete REST API for all semantic layer endpoints |
Key Source Files
| Component | Location |
|---|---|
| Catalog Controller | data-plane/catalog-service/src/main/java/com/matih/catalog/controller/CatalogController.java |
| Data Source Controller | data-plane/catalog-service/src/main/java/com/matih/catalog/controller/DataSourceController.java |
| Discovery Controller | data-plane/catalog-service/src/main/java/com/matih/catalog/controller/CatalogDiscoveryController.java |
| Lineage Controller | data-plane/catalog-service/src/main/java/com/matih/catalog/lineage/LineageController.java |
| Column Lineage Controller | data-plane/catalog-service/src/main/java/com/matih/catalog/lineage/ColumnLineageController.java |
| Lineage Visualization | data-plane/catalog-service/src/main/java/com/matih/catalog/controller/LineageVisualizationController.java |
| Classification Controller | data-plane/catalog-service/src/main/java/com/matih/catalog/classification/ClassificationController.java |
| Asset Controller | data-plane/asset-service/src/main/java/com/matih/asset/controller/AssetController.java |
| Version Lifecycle Controller | data-plane/asset-service/src/main/java/com/matih/asset/controller/VersionLifecycleController.java |
| Asset Permission Controller | data-plane/asset-service/src/main/java/com/matih/asset/controller/AssetPermissionController.java |
| Governance Controller | data-plane/governance-service/src/main/java/com/matih/governance/controller/GovernanceController.java |
| ABAC Controller | data-plane/governance-service/src/main/java/com/matih/governance/abac/controller/AbacController.java |
| RLS Controller | data-plane/governance-service/src/main/java/com/matih/governance/rls/controller/RlsController.java |
| Query Audit Controller | data-plane/governance-service/src/main/java/com/matih/governance/controller/QueryAuditController.java |
| Semantic Model Controller | data-plane/semantic-layer/src/main/java/com/matih/semantic/controller/SemanticModelController.java |
| Advanced Metric Controller | data-plane/semantic-layer/src/main/java/com/matih/semantic/controller/AdvancedMetricController.java |
| NL Controller | data-plane/semantic-layer/src/main/java/com/matih/semantic/controller/NaturalLanguageController.java |
| Query Optimization Controller | data-plane/semantic-layer/src/main/java/com/matih/semantic/controller/QueryOptimizationController.java |
Design Principles
-
Single source of truth for metadata. OpenMetadata serves as the canonical metadata store. All metadata changes flow through the Catalog Service's synchronization layer.
-
Lineage as infrastructure. Data lineage is not an afterthought but a core capability that informs impact analysis, debugging, and governance decisions.
-
Policy as code. Governance policies are defined programmatically and support lifecycle management (draft, review, active, suspended).
-
Quality is continuous. Data quality is monitored in real time through validation rules, anomaly detection, and quality scoring.
-
Classification drives security. Sensitivity classifications assigned in the catalog directly control data masking and access policies throughout the platform.
-
Semantic abstraction. The Semantic Layer translates business language into SQL, enabling non-technical users to query data through natural language.
How This Chapter Connects
- The Query Engine (Chapter 9) uses catalog metadata for query optimization, RLS policy evaluation, and data masking rules
- The Asset Service stores versioned assets (queries, dashboards, pipelines, models) with lifecycle governance and cross-service permission control
- The AI Service (Chapter 12) uses catalog metadata for text-to-SQL schema context and data understanding
- The Pipeline Service (Chapter 11) publishes lineage events and consumes quality validation rules
- The Semantic Layer provides the metrics definitions used by the BI Service for dashboard creation
Begin with the Catalog Service Architecture to understand the metadata backbone, then explore the Asset Service for the unified asset registry.