MATIH Platform is in active MVP development. Documentation reflects current implementation status.
1. Introduction
Compute Engines

Compute Engines

The MATIH Platform uses four compute engines for different data processing workloads: Trino for federated SQL queries, ClickHouse/StarRocks for OLAP analytics, Apache Spark for batch processing, and Apache Flink for stream processing. Each engine is selected for its performance characteristics and workload fit.


Engine Overview

EngineCategoryWorkload TypeDeployment
TrinoQuery federationInteractive SQL across heterogeneous sourcesKubernetes StatefulSet
ClickHouseOLAPFast analytical queries on columnar dataKubernetes Operator
StarRocksOLAPReal-time analytics with materialized viewsKubernetes StatefulSet
Apache SparkBatch processingLarge-scale data transformations and ETLSpark on Kubernetes
Apache FlinkStream processingReal-time data transformations and CDCKubernetes Operator

Trino

Trino is the primary query execution engine for the platform. It provides federated SQL across multiple data sources.

AspectDetails
RolePrimary query engine for all SQL execution
Data sourcesIceberg (lakehouse), ClickHouse (OLAP), PostgreSQL (metadata), Hive (legacy)
Access patternJDBC from Query Engine service
Multi-tenancyPer-tenant catalog configuration
Performance targetLess than 500ms (p95) for simple queries, less than 30 seconds for complex analytics

Trino Connectors

ConnectorData SourceUse Case
IcebergS3/MinIO lakehouse tablesPrimary data lake access
ClickHouseClickHouse OLAP enginePre-aggregated analytical queries
PostgreSQLPostgreSQL databasesMetadata and small-table queries
HiveHive MetastoreLegacy data warehouse access

ClickHouse

ClickHouse provides fast OLAP analytics on large datasets:

AspectDetails
RolePre-aggregated analytics, time-series data, event analytics
Storage formatColumnar with MergeTree engine family
Access patternTrino ClickHouse connector, direct JDBC for analytics
PerformanceMillions of rows per second on aggregation queries
CompressionLZ4 or ZSTD for columnar compression

StarRocks

StarRocks provides real-time analytics with materialized views:

AspectDetails
RoleReal-time dashboard queries, materialized aggregations
StorageColumnar with intelligent indexing
Access patternJDBC/MySQL protocol
Materialized viewsAutomatic refresh for pre-computed aggregations

Apache Spark

Spark handles large-scale batch data processing:

AspectDetails
Version3.5
RoleETL pipelines, large data transformations, feature engineering
DeploymentSpark on Kubernetes (spark-submit to K8s)
Data formatsParquet, Iceberg, Delta Lake, CSV, JSON
IntegrationPipeline Service orchestrates Spark jobs via Temporal

Apache Flink

Flink handles real-time stream processing:

AspectDetails
Version1.18+
RoleReal-time transformations, CDC processing, event aggregation
DeploymentFlink Kubernetes Operator
Source connectorsKafka, CDC (Debezium), file systems
Sink connectorsClickHouse, Iceberg, Kafka, Elasticsearch

Flink Jobs

JobSourceSinkPurpose
Agent performance aggregationKafka agent eventsClickHouseReal-time agent metric aggregation
LLM operations aggregationKafka LLM eventsClickHouseToken usage and latency tracking
Session analyticsKafka session eventsClickHouseUser session behavior analysis
State transition CDCPostgreSQL CDCKafkaChange data capture for state changes

Engine Selection Guide

WorkloadRecommended EngineRationale
Interactive SQL across sourcesTrinoFederated query across heterogeneous stores
Fast aggregations on large dataClickHouseColumnar storage optimized for aggregations
Real-time dashboard queriesStarRocksMaterialized views for sub-second response
Large-scale ETLSparkDistributed processing with fault tolerance
Real-time event processingFlinkLow-latency stream processing with exactly-once
ML feature engineeringSpark or RayDistributed compute with ML library access

Related Pages