MATIH Platform is in active MVP development. Documentation reflects current implementation status.
2. Architecture
Catalog Architecture

Catalog and Semantic Layer Architecture

Production - Catalog (Java:8086), Semantic Layer (Java:8086), Governance (Python:8080)

The Catalog Service, Semantic Layer, and Governance Service form the metadata management triad of the Data Plane. Together they provide schema discovery, business-friendly metric abstraction, data lineage, classification, and access policy enforcement.


2.4.D.1Catalog Service

The Catalog Service manages the platform's metadata catalog -- databases, tables, columns, schemas, lineage, and data classifications.

Core responsibilities:

  • Data source registration and automated schema discovery
  • Table and column metadata management with type information
  • Data lineage tracking (upstream/downstream dependencies)
  • Schema change detection and versioning
  • Integration with OpenMetadata for metadata interchange
  • Full-text search and discovery via tags and classifications

Schema discovery flow:

1. Data source registered (PostgreSQL, S3, Iceberg, etc.)
2. Catalog service connects via appropriate connector
3. Introspects schema: databases, tables, columns, types
4. Stores metadata in catalog database
5. Publishes CATALOG_UPDATED event to Kafka
6. AI service receives event, updates RAG context

Key APIs:

EndpointMethodDescription
/api/v1/catalog/sourcesGET/POSTData source management
/api/v1/catalog/tablesGETList tables with metadata
/api/v1/catalog/tables/{id}/columnsGETColumn metadata with types
/api/v1/catalog/lineage/{id}GETData lineage graph
/api/v1/catalog/searchGETFull-text metadata search
/api/v1/catalog/classificationsGET/POSTData classification tags

2.4.D.2Semantic Layer

The Semantic Layer provides a business-friendly abstraction over raw database schemas. It translates business concepts (metrics, dimensions, relationships) into SQL.

Semantic model structure:

Semantic Model
  +-- Entities (e.g., "Customer", "Order", "Product")
  |    +-- Dimensions (e.g., "region", "category", "date")
  |    +-- Metrics (e.g., "total_revenue", "avg_order_value")
  |    +-- Relationships (e.g., Customer --has_many--> Order)
  |
  +-- Calculations
       +-- Simple: SUM(amount), COUNT(DISTINCT customer_id)
       +-- Derived: revenue_per_customer = revenue / customer_count
       +-- Time Intelligence: YoY growth, MTD, rolling 7-day average

Query translation example:

Semantic query:
  "total_revenue by region for Q4 2025"

Generated SQL:
  SELECT
    d.region,
    SUM(o.amount) AS total_revenue
  FROM orders o
  JOIN dim_geography d ON o.geo_id = d.id
  WHERE o.order_date BETWEEN '2025-10-01' AND '2025-12-31'
  GROUP BY d.region
  ORDER BY total_revenue DESC

The Semantic Layer caches computed metrics in Redis with tenant-scoped keys. Cache TTL depends on the metric's refresh frequency configuration.


2.4.D.3Governance Service

The Governance Service enforces data governance policies:

Policy TypeEnforcement PointExample
Row-level securityQuery Engine (WHERE clause injection)Analysts can only see their region's data
Column maskingQuery Engine (SELECT transformation)PII columns are hashed/masked
Data classificationCatalog metadata tagsMark columns as PII, PHI, financial
Retention policiesPipeline Service (scheduled cleanup)Delete data older than 7 years
Access policiesQuery authorization checkOnly data_engineer role can access raw tables

Integration with Apache Polaris provides fine-grained access control at the Iceberg catalog level.


Related Sections