Metadata Management
Metadata management in MATIH encompasses the discovery, cataloging, profiling, and organization of data assets across all connected data sources. This section covers table discovery, schema browsing, data profiling, data product definitions, the business glossary, and the catalog search functionality.
Table Discovery
The CatalogDiscoveryController provides endpoints for browsing the metadata hierarchy:
Data Source -> Database -> Schema -> Table -> ColumnHierarchy Browsing
GET /v1/catalog/browse?level=schemas&dataSource=pg-analytics
Response:
{
"dataSource": "pg-analytics",
"type": "POSTGRESQL",
"databases": [
{
"name": "analytics",
"schemas": [
{
"name": "public",
"tableCount": 45,
"lastSyncedAt": "2026-02-12T02:15:00Z"
},
{
"name": "staging",
"tableCount": 12,
"lastSyncedAt": "2026-02-12T02:15:00Z"
}
]
}
]
}Table Detail
GET /v1/catalog/tables/{tableFqn}
Response:
{
"fqn": "pg-analytics.analytics.public.orders",
"name": "orders",
"description": "Customer order transactions with line items",
"database": "analytics",
"schema": "public",
"columns": [
{
"name": "order_id",
"type": "BIGINT",
"nullable": false,
"primaryKey": true,
"description": "Unique order identifier"
},
{
"name": "customer_id",
"type": "BIGINT",
"nullable": false,
"foreignKey": {"table": "customers", "column": "id"},
"description": "Reference to customer"
},
{
"name": "amount",
"type": "DECIMAL(12,2)",
"nullable": false,
"description": "Order total amount"
},
{
"name": "email",
"type": "VARCHAR(255)",
"nullable": true,
"classification": "RESTRICTED",
"tags": ["PII", "email"],
"description": "Customer email address"
}
],
"rowCount": 15234567,
"sizeBytes": 3221225472,
"partitioning": {
"type": "RANGE",
"columns": ["order_date"]
},
"owner": "data-engineering-team",
"tags": ["transactional", "finance"],
"lastUpdatedAt": "2026-02-12T08:00:00Z",
"qualityScore": 0.94,
"popularity": 87
}Data Profiling
Data profiling generates statistical summaries for tables and columns. The profiling engine runs during metadata ingestion and can be triggered on demand.
Table-Level Profile
| Metric | Description |
|---|---|
| Row count | Total number of rows |
| Size (bytes) | Storage size |
| Column count | Number of columns |
| Last modified | Timestamp of last data change |
| Growth rate | Rows added per day (30-day average) |
| Schema stability | Number of schema changes in last 90 days |
Column-Level Profile
| Metric | Applicable Types | Description |
|---|---|---|
| Null count / ratio | All | Number and percentage of NULL values |
| Distinct count / ratio | All | Number and percentage of distinct values |
| Min / Max | Numeric, Date | Minimum and maximum values |
| Mean / Median / Stddev | Numeric | Statistical measures |
| Histogram | Numeric | Value distribution in 20 buckets |
| Top values | String, Categorical | Most frequent values with counts |
| Pattern analysis | String | Common patterns (e.g., email, phone, SSN) |
| Min / Max length | String | Shortest and longest string lengths |
Profile Response
{
"tableFqn": "pg-analytics.analytics.public.orders",
"profiledAt": "2026-02-12T02:30:00Z",
"rowCount": 15234567,
"columns": [
{
"name": "amount",
"type": "DECIMAL(12,2)",
"nullCount": 0,
"nullRatio": 0.0,
"distinctCount": 12456,
"min": 0.01,
"max": 99999.99,
"mean": 245.67,
"median": 125.00,
"stddev": 312.45,
"histogram": {
"buckets": [
{"min": 0, "max": 100, "count": 5234567},
{"min": 100, "max": 500, "count": 6789012},
{"min": 500, "max": 1000, "count": 2345678}
]
}
},
{
"name": "email",
"type": "VARCHAR(255)",
"nullCount": 23456,
"nullRatio": 0.0015,
"distinctCount": 14890000,
"minLength": 6,
"maxLength": 254,
"topValues": [
{"value": "user@example.com", "count": 12},
{"value": "admin@test.com", "count": 8}
],
"patterns": [
{"pattern": "*@*.com", "percentage": 0.85},
{"pattern": "*@*.org", "percentage": 0.08},
{"pattern": "*@*.io", "percentage": 0.05}
]
}
]
}Data Products
Data products are curated, documented collections of related data assets designed for specific consumer use cases. The Catalog Service supports data product definitions that bundle tables, views, and documentation into discoverable units.
Data Product Structure
{
"id": "dp-sales-analytics",
"name": "Sales Analytics Data Product",
"description": "Curated sales data for analytics and reporting",
"domain": "Sales",
"owner": "data-engineering-team",
"sla": {
"freshness": "< 1 hour",
"availability": "99.9%",
"qualityScore": "> 0.95"
},
"assets": [
{"type": "TABLE", "fqn": "analytics.public.orders", "role": "fact"},
{"type": "TABLE", "fqn": "analytics.public.customers", "role": "dimension"},
{"type": "TABLE", "fqn": "analytics.public.products", "role": "dimension"},
{"type": "VIEW", "fqn": "analytics.reports.daily_sales_summary", "role": "aggregate"}
],
"consumers": ["bi-service", "ai-service"],
"qualityChecks": [
{"rule": "orders.amount > 0", "severity": "CRITICAL"},
{"rule": "orders.customer_id IS NOT NULL", "severity": "CRITICAL"}
],
"documentation": "https://docs.internal/data-products/sales-analytics",
"status": "ACTIVE",
"version": "2.3.0"
}Business Glossary
The DataGlossaryController manages business terminology definitions:
POST /v1/catalog/glossary/terms
Request:
{
"term": "Revenue",
"definition": "Total income generated from sales of goods and services, calculated as sum of order amounts excluding returns and discounts",
"domain": "Finance",
"synonyms": ["Sales Revenue", "Gross Revenue", "Top Line"],
"relatedTerms": ["Net Revenue", "ARPU", "MRR"],
"formula": "SUM(orders.amount) - SUM(returns.amount) - SUM(discounts.amount)",
"owner": "finance-team",
"tables": [
{"fqn": "analytics.public.orders", "column": "amount", "relationship": "SOURCE"}
]
}Glossary Features
| Feature | Description |
|---|---|
| Term definitions | Business-friendly descriptions of data concepts |
| Synonyms | Alternative names for the same concept |
| Related terms | Links between related concepts |
| Formulas | Calculation definitions tied to actual columns |
| Table mappings | Links between terms and physical data assets |
| Ownership | Team or person responsible for the definition |
| Approval workflow | Draft, review, approved, deprecated lifecycle |
Catalog Search
The CatalogSearchService provides full-text search and faceted browsing across all metadata:
@Service
public class CatalogSearchService {
public SearchResult search(SearchRequest request) {
// Full-text search across table names, descriptions, column names, tags
// Faceted filtering by data source, schema, classification, owner, tag
// Relevance scoring with popularity boost
}
}Search API
GET /v1/catalog/search?q=customer+revenue&type=TABLE&tags=finance&limit=20
Response:
{
"query": "customer revenue",
"totalResults": 12,
"results": [
{
"type": "TABLE",
"fqn": "analytics.public.customer_revenue",
"name": "customer_revenue",
"description": "Monthly revenue per customer with lifetime value",
"score": 0.95,
"highlights": {
"description": ["Monthly <em>revenue</em> per <em>customer</em> with lifetime value"]
},
"tags": ["finance", "revenue"],
"qualityScore": 0.97,
"popularity": 92,
"owner": "data-engineering-team"
}
],
"facets": {
"dataSource": [{"value": "pg-analytics", "count": 8}, {"value": "iceberg-lake", "count": 4}],
"schema": [{"value": "public", "count": 7}, {"value": "reports", "count": 5}],
"tags": [{"value": "finance", "count": 12}, {"value": "customer", "count": 9}],
"classification": [{"value": "INTERNAL", "count": 10}, {"value": "CONFIDENTIAL", "count": 2}]
}
}Search Ranking Factors
| Factor | Weight | Description |
|---|---|---|
| Text relevance | 0.40 | BM25 score for query terms in name, description, columns |
| Popularity | 0.25 | Query frequency and user access patterns |
| Quality score | 0.15 | Data quality rating from the Quality Service |
| Freshness | 0.10 | How recently the data was updated |
| Completeness | 0.10 | Percentage of metadata fields populated |
Schema Evolution Tracking
The Catalog Service tracks schema changes over time, maintaining a history of all modifications:
GET /v1/catalog/tables/{tableFqn}/schema-history
Response:
{
"tableFqn": "analytics.public.orders",
"changes": [
{
"version": 15,
"timestamp": "2026-02-10T14:30:00Z",
"changeType": "COLUMN_ADDED",
"column": "discount_code",
"columnType": "VARCHAR(50)",
"changedBy": "migration-script-v2.3"
},
{
"version": 14,
"timestamp": "2026-01-28T09:15:00Z",
"changeType": "COLUMN_TYPE_CHANGED",
"column": "amount",
"previousType": "DECIMAL(10,2)",
"newType": "DECIMAL(12,2)",
"changedBy": "dba-team"
}
]
}Related Sections
- Catalog Service -- Service architecture and OpenMetadata integration
- Data Lineage -- Lineage tracking for metadata changes
- Classification -- Auto-classification during profiling
- Data Quality -- Quality scoring based on profile metrics