MATIH Platform is in active MVP development. Documentation reflects current implementation status.
9. Query Engine & SQL
Data Federation

Data Federation

Trino's federated query capability allows the MATIH platform to execute queries that join data across multiple data sources in a single SQL statement. This enables unified analytics without requiring data movement or ETL pipelines.


How Federation Works

SELECT
  orders.order_id,
  customers.name,
  inventory.stock_level
FROM delta.public.orders AS orders
JOIN postgresql.crm.customers AS customers
  ON orders.customer_id = customers.id
JOIN iceberg.warehouse.inventory AS inventory
  ON orders.product_id = inventory.product_id
WHERE orders.date > DATE '2026-01-01'

Trino resolves each table reference to its respective catalog and connector, executes partial queries on each data source, and joins the results in memory.


Federation Topology

                    Trino Coordinator
                    /       |       \
                   /        |        \
           Delta Lake   PostgreSQL   Iceberg
           (S3/MinIO)   (External)   (S3/MinIO)

Supported Source Combinations

Source ASource BJoin SupportPerformance Notes
Delta LakeDelta LakeFullBest performance; pushdown optimized
Delta LakeIcebergFullGood performance; similar storage layer
Delta LakePostgreSQLFullNetwork transfer for PostgreSQL data
AnyMemoryFullIn-memory tables are fastest for lookup joins

Query Optimization for Federation

Trino applies several optimizations for federated queries:

OptimizationDescription
Predicate pushdownWHERE clauses pushed to source connectors
Projection pushdownOnly required columns fetched from sources
Dynamic filteringRuntime filters from one source applied to another
Join reorderingSmaller tables processed first for broadcast joins

Limitations

LimitationDescription
Write federationCross-catalog INSERT/CREATE is not supported
Transaction scopeTransactions do not span multiple catalogs
Data freshnessExternal sources may have stale data
Network overheadRemote sources add network latency

Best Practices

  • Place the largest table in the most performant catalog (Delta Lake or Iceberg)
  • Use predicate filters to minimize data transfer from external sources
  • Consider materialized views for frequently federated queries
  • Monitor cross-catalog query performance through the analytics dashboard