Catalog and Semantic Layer Architecture
The Catalog Service, Semantic Layer, and Governance Service form the metadata management triad of the Data Plane. Together they provide schema discovery, business-friendly metric abstraction, data lineage, classification, and access policy enforcement.
2.4.D.1Catalog Service
The Catalog Service manages the platform's metadata catalog -- databases, tables, columns, schemas, lineage, and data classifications.
Core responsibilities:
- Data source registration and automated schema discovery
- Table and column metadata management with type information
- Data lineage tracking (upstream/downstream dependencies)
- Schema change detection and versioning
- Integration with OpenMetadata for metadata interchange
- Full-text search and discovery via tags and classifications
Schema discovery flow:
1. Data source registered (PostgreSQL, S3, Iceberg, etc.)
2. Catalog service connects via appropriate connector
3. Introspects schema: databases, tables, columns, types
4. Stores metadata in catalog database
5. Publishes CATALOG_UPDATED event to Kafka
6. AI service receives event, updates RAG contextKey APIs:
| Endpoint | Method | Description |
|---|---|---|
/api/v1/catalog/sources | GET/POST | Data source management |
/api/v1/catalog/tables | GET | List tables with metadata |
/api/v1/catalog/tables/{id}/columns | GET | Column metadata with types |
/api/v1/catalog/lineage/{id} | GET | Data lineage graph |
/api/v1/catalog/search | GET | Full-text metadata search |
/api/v1/catalog/classifications | GET/POST | Data classification tags |
2.4.D.2Semantic Layer
The Semantic Layer provides a business-friendly abstraction over raw database schemas. It translates business concepts (metrics, dimensions, relationships) into SQL.
Semantic model structure:
Semantic Model
+-- Entities (e.g., "Customer", "Order", "Product")
| +-- Dimensions (e.g., "region", "category", "date")
| +-- Metrics (e.g., "total_revenue", "avg_order_value")
| +-- Relationships (e.g., Customer --has_many--> Order)
|
+-- Calculations
+-- Simple: SUM(amount), COUNT(DISTINCT customer_id)
+-- Derived: revenue_per_customer = revenue / customer_count
+-- Time Intelligence: YoY growth, MTD, rolling 7-day averageQuery translation example:
Semantic query:
"total_revenue by region for Q4 2025"
Generated SQL:
SELECT
d.region,
SUM(o.amount) AS total_revenue
FROM orders o
JOIN dim_geography d ON o.geo_id = d.id
WHERE o.order_date BETWEEN '2025-10-01' AND '2025-12-31'
GROUP BY d.region
ORDER BY total_revenue DESCThe Semantic Layer caches computed metrics in Redis with tenant-scoped keys. Cache TTL depends on the metric's refresh frequency configuration.
2.4.D.3Governance Service
The Governance Service enforces data governance policies:
| Policy Type | Enforcement Point | Example |
|---|---|---|
| Row-level security | Query Engine (WHERE clause injection) | Analysts can only see their region's data |
| Column masking | Query Engine (SELECT transformation) | PII columns are hashed/masked |
| Data classification | Catalog metadata tags | Mark columns as PII, PHI, financial |
| Retention policies | Pipeline Service (scheduled cleanup) | Delete data older than 7 years |
| Access policies | Query authorization check | Only data_engineer role can access raw tables |
Integration with Apache Polaris provides fine-grained access control at the Iceberg catalog level.
Related Sections
- Data Catalog -- Full catalog documentation
- Query Architecture -- How queries use catalog metadata
- AI Architecture -- How AI uses semantic models