MATIH Platform is in active MVP development. Documentation reflects current implementation status.
10. Data Catalog & Governance
Metadata Management

Metadata Management

Metadata management in MATIH encompasses the discovery, cataloging, profiling, and organization of data assets across all connected data sources. This section covers table discovery, schema browsing, data profiling, data product definitions, the business glossary, and the catalog search functionality.


Table Discovery

The CatalogDiscoveryController provides endpoints for browsing the metadata hierarchy:

Data Source -> Database -> Schema -> Table -> Column

Hierarchy Browsing

GET /v1/catalog/browse?level=schemas&dataSource=pg-analytics

Response:
{
  "dataSource": "pg-analytics",
  "type": "POSTGRESQL",
  "databases": [
    {
      "name": "analytics",
      "schemas": [
        {
          "name": "public",
          "tableCount": 45,
          "lastSyncedAt": "2026-02-12T02:15:00Z"
        },
        {
          "name": "staging",
          "tableCount": 12,
          "lastSyncedAt": "2026-02-12T02:15:00Z"
        }
      ]
    }
  ]
}

Table Detail

GET /v1/catalog/tables/{tableFqn}

Response:
{
  "fqn": "pg-analytics.analytics.public.orders",
  "name": "orders",
  "description": "Customer order transactions with line items",
  "database": "analytics",
  "schema": "public",
  "columns": [
    {
      "name": "order_id",
      "type": "BIGINT",
      "nullable": false,
      "primaryKey": true,
      "description": "Unique order identifier"
    },
    {
      "name": "customer_id",
      "type": "BIGINT",
      "nullable": false,
      "foreignKey": {"table": "customers", "column": "id"},
      "description": "Reference to customer"
    },
    {
      "name": "amount",
      "type": "DECIMAL(12,2)",
      "nullable": false,
      "description": "Order total amount"
    },
    {
      "name": "email",
      "type": "VARCHAR(255)",
      "nullable": true,
      "classification": "RESTRICTED",
      "tags": ["PII", "email"],
      "description": "Customer email address"
    }
  ],
  "rowCount": 15234567,
  "sizeBytes": 3221225472,
  "partitioning": {
    "type": "RANGE",
    "columns": ["order_date"]
  },
  "owner": "data-engineering-team",
  "tags": ["transactional", "finance"],
  "lastUpdatedAt": "2026-02-12T08:00:00Z",
  "qualityScore": 0.94,
  "popularity": 87
}

Data Profiling

Data profiling generates statistical summaries for tables and columns. The profiling engine runs during metadata ingestion and can be triggered on demand.

Table-Level Profile

MetricDescription
Row countTotal number of rows
Size (bytes)Storage size
Column countNumber of columns
Last modifiedTimestamp of last data change
Growth rateRows added per day (30-day average)
Schema stabilityNumber of schema changes in last 90 days

Column-Level Profile

MetricApplicable TypesDescription
Null count / ratioAllNumber and percentage of NULL values
Distinct count / ratioAllNumber and percentage of distinct values
Min / MaxNumeric, DateMinimum and maximum values
Mean / Median / StddevNumericStatistical measures
HistogramNumericValue distribution in 20 buckets
Top valuesString, CategoricalMost frequent values with counts
Pattern analysisStringCommon patterns (e.g., email, phone, SSN)
Min / Max lengthStringShortest and longest string lengths

Profile Response

{
  "tableFqn": "pg-analytics.analytics.public.orders",
  "profiledAt": "2026-02-12T02:30:00Z",
  "rowCount": 15234567,
  "columns": [
    {
      "name": "amount",
      "type": "DECIMAL(12,2)",
      "nullCount": 0,
      "nullRatio": 0.0,
      "distinctCount": 12456,
      "min": 0.01,
      "max": 99999.99,
      "mean": 245.67,
      "median": 125.00,
      "stddev": 312.45,
      "histogram": {
        "buckets": [
          {"min": 0, "max": 100, "count": 5234567},
          {"min": 100, "max": 500, "count": 6789012},
          {"min": 500, "max": 1000, "count": 2345678}
        ]
      }
    },
    {
      "name": "email",
      "type": "VARCHAR(255)",
      "nullCount": 23456,
      "nullRatio": 0.0015,
      "distinctCount": 14890000,
      "minLength": 6,
      "maxLength": 254,
      "topValues": [
        {"value": "user@example.com", "count": 12},
        {"value": "admin@test.com", "count": 8}
      ],
      "patterns": [
        {"pattern": "*@*.com", "percentage": 0.85},
        {"pattern": "*@*.org", "percentage": 0.08},
        {"pattern": "*@*.io", "percentage": 0.05}
      ]
    }
  ]
}

Data Products

Data products are curated, documented collections of related data assets designed for specific consumer use cases. The Catalog Service supports data product definitions that bundle tables, views, and documentation into discoverable units.

Data Product Structure

{
  "id": "dp-sales-analytics",
  "name": "Sales Analytics Data Product",
  "description": "Curated sales data for analytics and reporting",
  "domain": "Sales",
  "owner": "data-engineering-team",
  "sla": {
    "freshness": "< 1 hour",
    "availability": "99.9%",
    "qualityScore": "> 0.95"
  },
  "assets": [
    {"type": "TABLE", "fqn": "analytics.public.orders", "role": "fact"},
    {"type": "TABLE", "fqn": "analytics.public.customers", "role": "dimension"},
    {"type": "TABLE", "fqn": "analytics.public.products", "role": "dimension"},
    {"type": "VIEW", "fqn": "analytics.reports.daily_sales_summary", "role": "aggregate"}
  ],
  "consumers": ["bi-service", "ai-service"],
  "qualityChecks": [
    {"rule": "orders.amount > 0", "severity": "CRITICAL"},
    {"rule": "orders.customer_id IS NOT NULL", "severity": "CRITICAL"}
  ],
  "documentation": "https://docs.internal/data-products/sales-analytics",
  "status": "ACTIVE",
  "version": "2.3.0"
}

Business Glossary

The DataGlossaryController manages business terminology definitions:

POST /v1/catalog/glossary/terms

Request:
{
  "term": "Revenue",
  "definition": "Total income generated from sales of goods and services, calculated as sum of order amounts excluding returns and discounts",
  "domain": "Finance",
  "synonyms": ["Sales Revenue", "Gross Revenue", "Top Line"],
  "relatedTerms": ["Net Revenue", "ARPU", "MRR"],
  "formula": "SUM(orders.amount) - SUM(returns.amount) - SUM(discounts.amount)",
  "owner": "finance-team",
  "tables": [
    {"fqn": "analytics.public.orders", "column": "amount", "relationship": "SOURCE"}
  ]
}

Glossary Features

FeatureDescription
Term definitionsBusiness-friendly descriptions of data concepts
SynonymsAlternative names for the same concept
Related termsLinks between related concepts
FormulasCalculation definitions tied to actual columns
Table mappingsLinks between terms and physical data assets
OwnershipTeam or person responsible for the definition
Approval workflowDraft, review, approved, deprecated lifecycle

Catalog Search

The CatalogSearchService provides full-text search and faceted browsing across all metadata:

@Service
public class CatalogSearchService {
 
    public SearchResult search(SearchRequest request) {
        // Full-text search across table names, descriptions, column names, tags
        // Faceted filtering by data source, schema, classification, owner, tag
        // Relevance scoring with popularity boost
    }
}

Search API

GET /v1/catalog/search?q=customer+revenue&type=TABLE&tags=finance&limit=20

Response:
{
  "query": "customer revenue",
  "totalResults": 12,
  "results": [
    {
      "type": "TABLE",
      "fqn": "analytics.public.customer_revenue",
      "name": "customer_revenue",
      "description": "Monthly revenue per customer with lifetime value",
      "score": 0.95,
      "highlights": {
        "description": ["Monthly <em>revenue</em> per <em>customer</em> with lifetime value"]
      },
      "tags": ["finance", "revenue"],
      "qualityScore": 0.97,
      "popularity": 92,
      "owner": "data-engineering-team"
    }
  ],
  "facets": {
    "dataSource": [{"value": "pg-analytics", "count": 8}, {"value": "iceberg-lake", "count": 4}],
    "schema": [{"value": "public", "count": 7}, {"value": "reports", "count": 5}],
    "tags": [{"value": "finance", "count": 12}, {"value": "customer", "count": 9}],
    "classification": [{"value": "INTERNAL", "count": 10}, {"value": "CONFIDENTIAL", "count": 2}]
  }
}

Search Ranking Factors

FactorWeightDescription
Text relevance0.40BM25 score for query terms in name, description, columns
Popularity0.25Query frequency and user access patterns
Quality score0.15Data quality rating from the Quality Service
Freshness0.10How recently the data was updated
Completeness0.10Percentage of metadata fields populated

Schema Evolution Tracking

The Catalog Service tracks schema changes over time, maintaining a history of all modifications:

GET /v1/catalog/tables/{tableFqn}/schema-history

Response:
{
  "tableFqn": "analytics.public.orders",
  "changes": [
    {
      "version": 15,
      "timestamp": "2026-02-10T14:30:00Z",
      "changeType": "COLUMN_ADDED",
      "column": "discount_code",
      "columnType": "VARCHAR(50)",
      "changedBy": "migration-script-v2.3"
    },
    {
      "version": 14,
      "timestamp": "2026-01-28T09:15:00Z",
      "changeType": "COLUMN_TYPE_CHANGED",
      "column": "amount",
      "previousType": "DECIMAL(10,2)",
      "newType": "DECIMAL(12,2)",
      "changedBy": "dba-team"
    }
  ]
}

Related Sections