Spark

Apache Spark provides distributed computing for batch ETL, interactive analytics via Spark Connect, and ML feature engineering.

Architecture

MATIH deploys Spark 4.1.1 with Spark Connect for interactive client access:

+------------------+     +------------------+
| Spark Connect    |     | Spark Executors   |
| Server           |---->| (On-demand)       |
| Port: 15002      |     | Dynamic alloc     |
+------------------+     +------------------+
       |
       v
+------------------+
| Polaris / S3     |
| (Iceberg tables) |
+------------------+

Spark Connect

# Global configuration
global:
  sparkConnect:
    enabled: true
    host: "spark-connect.matih-data-plane.svc.cluster.local"
    port: 15002
    connectTimeoutMs: 10000
    requestTimeoutMs: 300000

Services connect to Spark via gRPC on port 15002 for interactive DataFrame operations.

History Server

The Spark History Server provides UI access to completed application logs:

sparkHistoryServer:
  enabled: true
  logDirectory: "s3a://spark-history/"
  s3:
    endpoint: "http://minio.matih-data-plane.svc.cluster.local:9000"

Flink Dgraph