Flink Streaming Jobs Overview
MATIH runs four production Flink SQL jobs that consume events from Kafka topics and aggregate them into Iceberg tables for analytics. All jobs use tumbling windows and are deployed to the Flink cluster managed by the Flink Operator in the matih-data-plane namespace.
Production Jobs
| Job | Source Topic | Sink Table | Window |
|---|---|---|---|
| Session Analytics | matih.ai.state-changes | matih_analytics.session_analytics | 15 min |
| Agent Performance | matih.ai.agent-traces | matih_analytics.agent_performance_metrics | 5 min |
| LLM Operations | matih.ai.llm-ops | matih_analytics.llm_operations_metrics | 15 min |
| State Transition CDC | PostgreSQL CDC | matih_analytics.state_transition_log | Continuous |
Source Files
| Job | File |
|---|---|
| Session Analytics | infrastructure/flink/jobs/session-analytics.sql |
| Agent Performance | infrastructure/flink/jobs/agent-performance-agg.sql |
| LLM Operations | infrastructure/flink/jobs/llm-operations-agg.sql |
| State Transition CDC | infrastructure/flink/jobs/state-transition-cdc.sql |
Common Configuration
All streaming jobs share the following Kafka source configuration:
| Property | Value |
|---|---|
| Kafka bootstrap servers | strimzi-kafka-kafka-bootstrap.matih-data-plane.svc.cluster.local:9093 |
| Kafka port (dev) | 9092 (plaintext), override before submitting |
| Kafka port (prod) | 9093 (TLS) |
| Scan startup mode | latest-offset |
| Message format | JSON with ISO-8601 timestamps |
| Watermark delay | 30 seconds |
Sink Configuration
All jobs write to Iceberg tables via the Polaris catalog:
INSERT INTO polaris.matih_analytics.<table_name>
SELECT ...The Iceberg catalog is configured in the Flink SQL Gateway session properties.
Deployment
Jobs are submitted to the Flink SQL Gateway:
# Dev: override Kafka port to plaintext
export KAFKA_PORT=9092
envsubst < infrastructure/flink/jobs/session-analytics.sql | \
curl -X POST http://flink-sql-gateway:8083/v1/statements \
-H "Content-Type: application/json" \
-d '{"statement": "..."}'Monitoring
| Metric | Source | Description |
|---|---|---|
| Checkpoint duration | Flink metrics | Time to complete state checkpoints |
| Records in/out | Flink metrics | Throughput per operator |
| Watermark lag | Flink metrics | Delay between event time and processing time |
| Kafka consumer lag | Kafka metrics | Offset lag per consumer group |