Data Virtualization

Data virtualization provides a federated query layer that allows pipelines to read from multiple heterogeneous data sources without physically moving data. MATIH uses Trino as the federation engine, enabling SQL queries that span databases, data lakes, and APIs through a unified interface.

Architecture

Pipeline Service ──> Trino Federation Engine
                         |
          +--------------+--------------+
          |              |              |
     PostgreSQL     Iceberg/S3     ClickHouse
     (via JDBC)     (via Polaris)  (via JDBC)

Supported Trino Connectors

Connector	Data Source	Use Case
`postgresql`	PostgreSQL databases	Operational data access
`iceberg`	Iceberg tables via Polaris	Data lake queries
`clickhouse`	ClickHouse OLAP	Analytics queries
`hive`	Hive-compatible stores	Legacy data warehouses
`memory`	In-memory tables	Temporary staging

Federated Pipeline Definition

metadata:
  name: cross-source-enrichment
  version: "1.0.0"
  owner: analytics-team
 
sources:
  federated_query:
    type: trino
    connection: ${TRINO_CONNECTION}
    query: |
      SELECT
        t.transaction_id,
        t.amount,
        c.customer_name,
        c.segment
      FROM postgresql.crm.transactions t
      JOIN iceberg.analytics.customers c
        ON t.customer_id = c.customer_id
      WHERE t.created_at > DATE '2026-02-01'
 
sinks:
  enriched_output:
    type: iceberg
    table: analytics.enriched.transactions
    mode: overwrite_partition
    partition_by: [date]
 
orchestration:
  engine: airflow
  schedule: "0 8 * * *"

Query Pushdown

Trino optimizes federated queries by pushing predicates and projections to source connectors:

Optimization	Description
Predicate pushdown	WHERE clauses pushed to source databases
Projection pushdown	Only selected columns retrieved from source
Limit pushdown	LIMIT clauses applied at source level
Aggregation pushdown	Simple aggregations computed at source

Catalog Configuration

Each data source is registered as a Trino catalog in the Pipeline Service:

POST /v1/virtualizations/catalogs

Request:
{
  "name": "crm_postgres",
  "connector": "postgresql",
  "properties": {
    "connection-url": "jdbc:postgresql://crm-db:5432/crm",
    "connection-user": "${CRM_DB_USER}",
    "connection-password": "${CRM_DB_PASSWORD}"
  }
}

Caching

For frequently accessed virtual tables, the Pipeline Service supports materialization caching:

Strategy	TTL	Description
None	--	Query source on every access
Time-based	Configurable	Cache results for a fixed duration
Event-driven	--	Invalidate on source change event

Performance Considerations

Factor	Recommendation
Cross-source joins	Ensure smaller table is on the build side
Large scans	Add partition predicates to reduce scan volume
Network latency	Co-locate Trino workers with data sources
Memory	Configure Trino memory limits per query

Batch Ingestion -- Physical data movement
Schema Registry -- Schema metadata
Pipeline Service -- Architecture overview

Schema Registry Event Sourcing