Kafka / Strimzi
MATIH uses Strimzi to deploy and manage Apache Kafka 4.1.1 in KRaft mode (no ZooKeeper). Kafka provides the event streaming backbone for domain events, audit logs, and inter-service communication.
Cluster Configuration
# From infrastructure/k8s/kafka/kafka-cluster.yaml
apiVersion: kafka.strimzi.io/v1
kind: Kafka
metadata:
name: strimzi-kafka
namespace: matih-data-plane
annotations:
strimzi.io/kraft: "enabled"
strimzi.io/node-pools: "enabled"
spec:
kafka:
version: 4.1.1
listeners:
- name: plain
port: 9092
type: internal
tls: false
- name: tls
port: 9093
type: internal
tls: true
config:
offsets.topic.replication.factor: 1
transaction.state.log.replication.factor: 1
log.retention.hours: 168
log.segment.bytes: 1073741824KafkaNodePool
apiVersion: kafka.strimzi.io/v1
kind: KafkaNodePool
metadata:
name: combined
namespace: matih-data-plane
spec:
replicas: 1
roles:
- controller
- broker
storage:
type: jbod
volumes:
- id: 0
type: persistent-claim
size: 10Gi
deleteClaim: false
kraftMetadata: shared
resources:
requests:
memory: 512Mi
cpu: 250m
limits:
memory: 1Gi
cpu: 500mTLS Enforcement
All MATIH services connect to Kafka on port 9093 (TLS). Network policies block plaintext port 9092:
# Service configuration
kafka:
bootstrapServers: "strimzi-kafka-kafka-bootstrap.matih-data-plane.svc.cluster.local:9093"
securityProtocol: SSL
ssl:
truststoreSecret: "kafka-cluster-ca-cert"
truststoreKey: "ca.p12"
truststorePasswordKey: "ca.password"AI Service Topics
topics:
- name: "matih.ai.state-changes" # Agent state transitions
partitions: 12
retentionMs: 2592000000 # 30 days
- name: "matih.ai.agent-traces" # Agent execution traces
partitions: 12
- name: "matih.ai.evaluations" # Quality evaluations
partitions: 6
retentionMs: 7776000000 # 90 days
- name: "matih.ai.llm-ops" # LLM operation metrics
partitions: 12
retentionMs: 604800000 # 7 days
- name: "matih.ai.feedback" # User feedback
partitions: 6Entity Operator
Strimzi deploys topic and user operators for automated topic/user management:
entityOperator:
topicOperator:
resources:
requests: { memory: 256Mi, cpu: 100m }
limits: { memory: 512Mi, cpu: 250m }
userOperator:
resources:
requests: { memory: 256Mi, cpu: 100m }
limits: { memory: 512Mi, cpu: 250m }