Compute Engines
The MATIH Platform uses four compute engines for different data processing workloads: Trino for federated SQL queries, ClickHouse/StarRocks for OLAP analytics, Apache Spark for batch processing, and Apache Flink for stream processing. Each engine is selected for its performance characteristics and workload fit.
Engine Overview
| Engine | Category | Workload Type | Deployment |
|---|---|---|---|
| Trino | Query federation | Interactive SQL across heterogeneous sources | Kubernetes StatefulSet |
| ClickHouse | OLAP | Fast analytical queries on columnar data | Kubernetes Operator |
| StarRocks | OLAP | Real-time analytics with materialized views | Kubernetes StatefulSet |
| Apache Spark | Batch processing | Large-scale data transformations and ETL | Spark on Kubernetes |
| Apache Flink | Stream processing | Real-time data transformations and CDC | Kubernetes Operator |
Trino
Trino is the primary query execution engine for the platform. It provides federated SQL across multiple data sources.
| Aspect | Details |
|---|---|
| Role | Primary query engine for all SQL execution |
| Data sources | Iceberg (lakehouse), ClickHouse (OLAP), PostgreSQL (metadata), Hive (legacy) |
| Access pattern | JDBC from Query Engine service |
| Multi-tenancy | Per-tenant catalog configuration |
| Performance target | Less than 500ms (p95) for simple queries, less than 30 seconds for complex analytics |
Trino Connectors
| Connector | Data Source | Use Case |
|---|---|---|
| Iceberg | S3/MinIO lakehouse tables | Primary data lake access |
| ClickHouse | ClickHouse OLAP engine | Pre-aggregated analytical queries |
| PostgreSQL | PostgreSQL databases | Metadata and small-table queries |
| Hive | Hive Metastore | Legacy data warehouse access |
ClickHouse
ClickHouse provides fast OLAP analytics on large datasets:
| Aspect | Details |
|---|---|
| Role | Pre-aggregated analytics, time-series data, event analytics |
| Storage format | Columnar with MergeTree engine family |
| Access pattern | Trino ClickHouse connector, direct JDBC for analytics |
| Performance | Millions of rows per second on aggregation queries |
| Compression | LZ4 or ZSTD for columnar compression |
StarRocks
StarRocks provides real-time analytics with materialized views:
| Aspect | Details |
|---|---|
| Role | Real-time dashboard queries, materialized aggregations |
| Storage | Columnar with intelligent indexing |
| Access pattern | JDBC/MySQL protocol |
| Materialized views | Automatic refresh for pre-computed aggregations |
Apache Spark
Spark handles large-scale batch data processing:
| Aspect | Details |
|---|---|
| Version | 3.5 |
| Role | ETL pipelines, large data transformations, feature engineering |
| Deployment | Spark on Kubernetes (spark-submit to K8s) |
| Data formats | Parquet, Iceberg, Delta Lake, CSV, JSON |
| Integration | Pipeline Service orchestrates Spark jobs via Temporal |
Apache Flink
Flink handles real-time stream processing:
| Aspect | Details |
|---|---|
| Version | 1.18+ |
| Role | Real-time transformations, CDC processing, event aggregation |
| Deployment | Flink Kubernetes Operator |
| Source connectors | Kafka, CDC (Debezium), file systems |
| Sink connectors | ClickHouse, Iceberg, Kafka, Elasticsearch |
Flink Jobs
| Job | Source | Sink | Purpose |
|---|---|---|---|
| Agent performance aggregation | Kafka agent events | ClickHouse | Real-time agent metric aggregation |
| LLM operations aggregation | Kafka LLM events | ClickHouse | Token usage and latency tracking |
| Session analytics | Kafka session events | ClickHouse | User session behavior analysis |
| State transition CDC | PostgreSQL CDC | Kafka | Change data capture for state changes |
Engine Selection Guide
| Workload | Recommended Engine | Rationale |
|---|---|---|
| Interactive SQL across sources | Trino | Federated query across heterogeneous stores |
| Fast aggregations on large data | ClickHouse | Columnar storage optimized for aggregations |
| Real-time dashboard queries | StarRocks | Materialized views for sub-second response |
| Large-scale ETL | Spark | Distributed processing with fault tolerance |
| Real-time event processing | Flink | Low-latency stream processing with exactly-once |
| ML feature engineering | Spark or Ray | Distributed compute with ML library access |
Related Pages
- Data Infrastructure -- Data store technologies
- Data Stores: Trino -- Trino architecture details
- Data Stores: OLAP -- ClickHouse and StarRocks
- Data Flow: Query Flow -- Query execution lifecycle