Data Virtualization
Data virtualization provides a federated query layer that allows pipelines to read from multiple heterogeneous data sources without physically moving data. MATIH uses Trino as the federation engine, enabling SQL queries that span databases, data lakes, and APIs through a unified interface.
Architecture
Pipeline Service ──> Trino Federation Engine
|
+--------------+--------------+
| | |
PostgreSQL Iceberg/S3 ClickHouse
(via JDBC) (via Polaris) (via JDBC)Supported Trino Connectors
| Connector | Data Source | Use Case |
|---|---|---|
postgresql | PostgreSQL databases | Operational data access |
iceberg | Iceberg tables via Polaris | Data lake queries |
clickhouse | ClickHouse OLAP | Analytics queries |
hive | Hive-compatible stores | Legacy data warehouses |
memory | In-memory tables | Temporary staging |
Federated Pipeline Definition
metadata:
name: cross-source-enrichment
version: "1.0.0"
owner: analytics-team
sources:
federated_query:
type: trino
connection: ${TRINO_CONNECTION}
query: |
SELECT
t.transaction_id,
t.amount,
c.customer_name,
c.segment
FROM postgresql.crm.transactions t
JOIN iceberg.analytics.customers c
ON t.customer_id = c.customer_id
WHERE t.created_at > DATE '2026-02-01'
sinks:
enriched_output:
type: iceberg
table: analytics.enriched.transactions
mode: overwrite_partition
partition_by: [date]
orchestration:
engine: airflow
schedule: "0 8 * * *"Query Pushdown
Trino optimizes federated queries by pushing predicates and projections to source connectors:
| Optimization | Description |
|---|---|
| Predicate pushdown | WHERE clauses pushed to source databases |
| Projection pushdown | Only selected columns retrieved from source |
| Limit pushdown | LIMIT clauses applied at source level |
| Aggregation pushdown | Simple aggregations computed at source |
Catalog Configuration
Each data source is registered as a Trino catalog in the Pipeline Service:
POST /v1/virtualizations/catalogs
Request:
{
"name": "crm_postgres",
"connector": "postgresql",
"properties": {
"connection-url": "jdbc:postgresql://crm-db:5432/crm",
"connection-user": "${CRM_DB_USER}",
"connection-password": "${CRM_DB_PASSWORD}"
}
}Caching
For frequently accessed virtual tables, the Pipeline Service supports materialization caching:
| Strategy | TTL | Description |
|---|---|---|
| None | -- | Query source on every access |
| Time-based | Configurable | Cache results for a fixed duration |
| Event-driven | -- | Invalidate on source change event |
Performance Considerations
| Factor | Recommendation |
|---|---|
| Cross-source joins | Ensure smaller table is on the build side |
| Large scans | Add partition predicates to reduce scan volume |
| Network latency | Co-locate Trino workers with data sources |
| Memory | Configure Trino memory limits per query |
Related Pages
- Batch Ingestion -- Physical data movement
- Schema Registry -- Schema metadata
- Pipeline Service -- Architecture overview