MATIH Platform is in active MVP development. Documentation reflects current implementation status.
17. Kubernetes & Helm
Spark

Spark

Apache Spark provides distributed computing for batch ETL, interactive analytics via Spark Connect, and ML feature engineering.


Architecture

MATIH deploys Spark 4.1.1 with Spark Connect for interactive client access:

+------------------+     +------------------+
| Spark Connect    |     | Spark Executors   |
| Server           |---->| (On-demand)       |
| Port: 15002      |     | Dynamic alloc     |
+------------------+     +------------------+
       |
       v
+------------------+
| Polaris / S3     |
| (Iceberg tables) |
+------------------+

Spark Connect

# Global configuration
global:
  sparkConnect:
    enabled: true
    host: "spark-connect.matih-data-plane.svc.cluster.local"
    port: 15002
    connectTimeoutMs: 10000
    requestTimeoutMs: 300000

Services connect to Spark via gRPC on port 15002 for interactive DataFrame operations.


History Server

The Spark History Server provides UI access to completed application logs:

sparkHistoryServer:
  enabled: true
  logDirectory: "s3a://spark-history/"
  s3:
    endpoint: "http://minio.matih-data-plane.svc.cluster.local:9000"