MATIH Platform is in active MVP development. Documentation reflects current implementation status.
11. Pipelines & Data Engineering
Data Virtualization

Data Virtualization

Data virtualization provides a federated query layer that allows pipelines to read from multiple heterogeneous data sources without physically moving data. MATIH uses Trino as the federation engine, enabling SQL queries that span databases, data lakes, and APIs through a unified interface.


Architecture

Pipeline Service ──> Trino Federation Engine
                         |
          +--------------+--------------+
          |              |              |
     PostgreSQL     Iceberg/S3     ClickHouse
     (via JDBC)     (via Polaris)  (via JDBC)

Supported Trino Connectors

ConnectorData SourceUse Case
postgresqlPostgreSQL databasesOperational data access
icebergIceberg tables via PolarisData lake queries
clickhouseClickHouse OLAPAnalytics queries
hiveHive-compatible storesLegacy data warehouses
memoryIn-memory tablesTemporary staging

Federated Pipeline Definition

metadata:
  name: cross-source-enrichment
  version: "1.0.0"
  owner: analytics-team
 
sources:
  federated_query:
    type: trino
    connection: ${TRINO_CONNECTION}
    query: |
      SELECT
        t.transaction_id,
        t.amount,
        c.customer_name,
        c.segment
      FROM postgresql.crm.transactions t
      JOIN iceberg.analytics.customers c
        ON t.customer_id = c.customer_id
      WHERE t.created_at > DATE '2026-02-01'
 
sinks:
  enriched_output:
    type: iceberg
    table: analytics.enriched.transactions
    mode: overwrite_partition
    partition_by: [date]
 
orchestration:
  engine: airflow
  schedule: "0 8 * * *"

Query Pushdown

Trino optimizes federated queries by pushing predicates and projections to source connectors:

OptimizationDescription
Predicate pushdownWHERE clauses pushed to source databases
Projection pushdownOnly selected columns retrieved from source
Limit pushdownLIMIT clauses applied at source level
Aggregation pushdownSimple aggregations computed at source

Catalog Configuration

Each data source is registered as a Trino catalog in the Pipeline Service:

POST /v1/virtualizations/catalogs

Request:
{
  "name": "crm_postgres",
  "connector": "postgresql",
  "properties": {
    "connection-url": "jdbc:postgresql://crm-db:5432/crm",
    "connection-user": "${CRM_DB_USER}",
    "connection-password": "${CRM_DB_PASSWORD}"
  }
}

Caching

For frequently accessed virtual tables, the Pipeline Service supports materialization caching:

StrategyTTLDescription
None--Query source on every access
Time-basedConfigurableCache results for a fixed duration
Event-driven--Invalidate on source change event

Performance Considerations

FactorRecommendation
Cross-source joinsEnsure smaller table is on the build side
Large scansAdd partition predicates to reduce scan volume
Network latencyCo-locate Trino workers with data sources
MemoryConfigure Trino memory limits per query

Related Pages