Spark
Apache Spark provides distributed computing for batch ETL, interactive analytics via Spark Connect, and ML feature engineering.
Architecture
MATIH deploys Spark 4.1.1 with Spark Connect for interactive client access:
+------------------+ +------------------+
| Spark Connect | | Spark Executors |
| Server |---->| (On-demand) |
| Port: 15002 | | Dynamic alloc |
+------------------+ +------------------+
|
v
+------------------+
| Polaris / S3 |
| (Iceberg tables) |
+------------------+Spark Connect
# Global configuration
global:
sparkConnect:
enabled: true
host: "spark-connect.matih-data-plane.svc.cluster.local"
port: 15002
connectTimeoutMs: 10000
requestTimeoutMs: 300000Services connect to Spark via gRPC on port 15002 for interactive DataFrame operations.
History Server
The Spark History Server provides UI access to completed application logs:
sparkHistoryServer:
enabled: true
logDirectory: "s3a://spark-history/"
s3:
endpoint: "http://minio.matih-data-plane.svc.cluster.local:9000"