MATIH Platform is in active MVP development. Documentation reflects current implementation status.
17. Kubernetes & Helm
Cluster Setup
Node Pools

Node Pool Strategy

MATIH uses dedicated node pools to isolate workloads by resource requirements, security boundaries, and billing attribution. Each node pool is tainted to prevent general-purpose pods from scheduling on specialized nodes.


Node Pool Summary

PoolPurposeTaint KeyLabelTypical Instance
systemKubernetes system componentsNoneagentpool=system4 vCPU / 16 GiB
ctrlplaneControl plane servicesmatih.ai/control-planeagentpool=ctrlplane4 vCPU / 16 GiB
dataplaneData plane servicesmatih.ai/data-planeagentpool=dataplane8 vCPU / 32 GiB
computeQuery engines (Trino, Spark)matih.ai/computeagentpool=compute16 vCPU / 128 GiB
aicomputeAI/ML workloadsmatih.ai/ai-computeagentpool=aicompute8 vCPU / 32 GiB
gpuGPU inference (vLLM, Triton)nvidia.com/gpunvidia.com/gpu.present=trueGPU instance
playgroundFree-tier sandboxed workloadsmatih.io/playgroundmatih.io/node-purpose=playground2 vCPU / 8 GiB

Scheduling Configuration

The matih-base library chart provides scheduling helpers that services inherit:

# From infrastructure/helm/base/templates/_helpers.tpl
 
# matih.nodeSelector - adds nodeSelector based on nodepool value
# matih.tolerations - adds tolerations for the node pool taint
# matih.affinity - adds architecture and anti-affinity rules
# matih.scheduling - combines all three

Data Plane Service Example

# From infrastructure/helm/matih-data-plane/values.yaml
nodeSelector:
  agentpool: dataplane
 
tolerations:
  - key: "matih.ai/data-plane"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

AI Service Example

# From infrastructure/helm/ai-service/values.yaml
nodeSelector:
  agentpool: aicompute
 
tolerations:
  - key: "matih.ai/ai-compute"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app.kubernetes.io/name: ai-service
          topologyKey: kubernetes.io/hostname

Trino Workers on Compute Nodes

# From infrastructure/helm/trino/values.yaml
worker:
  nodeSelector:
    agentpool: compute
  tolerations:
    - key: "matih.ai/compute"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: app.kubernetes.io/component
                  operator: In
                  values:
                    - worker
            topologyKey: kubernetes.io/hostname

GPU Node Configuration

GPU nodes require the NVIDIA device plugin and specific tolerations:

# GPU scheduling via base chart helpers
multiArch:
  gpu:
    enabled: true
    type: "nvidia"
 
# Rendered nodeSelector
nodeSelector:
  nvidia.com/gpu.present: "true"
 
# Rendered tolerations
tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

GPU resource requests in the pod spec:

resources:
  limits:
    nvidia.com/gpu: 1
    cpu: 2000m
    memory: 4Gi
  requests:
    cpu: 500m
    memory: 1Gi

Autoscaling Policies

Each node pool has autoscaling bounds defined in Terraform:

PoolMin NodesMax NodesScale-Down Delay
system33N/A (fixed)
ctrlplane2510 minutes
dataplane21010 minutes
compute21015 minutes
aicompute1815 minutes
gpu0430 minutes
playground135 minutes

GPU nodes use a longer scale-down delay to avoid expensive cold starts.


Playground Nodes

The playground node pool serves the free-tier sandbox environment with restricted resources:

# From infrastructure/helm/base/templates/_helpers.tpl
# matih.playgroundNodeSelector
nodeSelector:
  matih.io/node-purpose: "playground"
 
# matih.playgroundTolerations
tolerations:
  - key: "matih.io/playground"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  - key: "matih.io/purpose"
    operator: "Equal"
    value: "playground"
    effect: "NoSchedule"
 
# matih.playgroundResources
resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "2"
    memory: "4Gi"