Node Pool Strategy

MATIH uses dedicated node pools to isolate workloads by resource requirements, security boundaries, and billing attribution. Each node pool is tainted to prevent general-purpose pods from scheduling on specialized nodes.

Node Pool Summary

Pool	Purpose	Taint Key	Label	Typical Instance
system	Kubernetes system components	None	`agentpool=system`	4 vCPU / 16 GiB
ctrlplane	Control plane services	`matih.ai/control-plane`	`agentpool=ctrlplane`	4 vCPU / 16 GiB
dataplane	Data plane services	`matih.ai/data-plane`	`agentpool=dataplane`	8 vCPU / 32 GiB
compute	Query engines (Trino, Spark)	`matih.ai/compute`	`agentpool=compute`	16 vCPU / 128 GiB
aicompute	AI/ML workloads	`matih.ai/ai-compute`	`agentpool=aicompute`	8 vCPU / 32 GiB
gpu	GPU inference (vLLM, Triton)	`nvidia.com/gpu`	`nvidia.com/gpu.present=true`	GPU instance
playground	Free-tier sandboxed workloads	`matih.io/playground`	`matih.io/node-purpose=playground`	2 vCPU / 8 GiB

Scheduling Configuration

The matih-base library chart provides scheduling helpers that services inherit:

# From infrastructure/helm/base/templates/_helpers.tpl
 
# matih.nodeSelector - adds nodeSelector based on nodepool value
# matih.tolerations - adds tolerations for the node pool taint
# matih.affinity - adds architecture and anti-affinity rules
# matih.scheduling - combines all three

Data Plane Service Example

# From infrastructure/helm/matih-data-plane/values.yaml
nodeSelector:
  agentpool: dataplane
 
tolerations:
  - key: "matih.ai/data-plane"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

AI Service Example

# From infrastructure/helm/ai-service/values.yaml
nodeSelector:
  agentpool: aicompute
 
tolerations:
  - key: "matih.ai/ai-compute"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 100
        podAffinityTerm:
          labelSelector:
            matchLabels:
              app.kubernetes.io/name: ai-service
          topologyKey: kubernetes.io/hostname

Trino Workers on Compute Nodes

# From infrastructure/helm/trino/values.yaml
worker:
  nodeSelector:
    agentpool: compute
  tolerations:
    - key: "matih.ai/compute"
      operator: "Equal"
      value: "true"
      effect: "NoSchedule"
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchExpressions:
                - key: app.kubernetes.io/component
                  operator: In
                  values:
                    - worker
            topologyKey: kubernetes.io/hostname

GPU Node Configuration

GPU nodes require the NVIDIA device plugin and specific tolerations:

# GPU scheduling via base chart helpers
multiArch:
  gpu:
    enabled: true
    type: "nvidia"
 
# Rendered nodeSelector
nodeSelector:
  nvidia.com/gpu.present: "true"
 
# Rendered tolerations
tolerations:
  - key: "nvidia.com/gpu"
    operator: "Exists"
    effect: "NoSchedule"

GPU resource requests in the pod spec:

resources:
  limits:
    nvidia.com/gpu: 1
    cpu: 2000m
    memory: 4Gi
  requests:
    cpu: 500m
    memory: 1Gi

Autoscaling Policies

Each node pool has autoscaling bounds defined in Terraform:

Pool	Min Nodes	Max Nodes	Scale-Down Delay
system	3	3	N/A (fixed)
ctrlplane	2	5	10 minutes
dataplane	2	10	10 minutes
compute	2	10	15 minutes
aicompute	1	8	15 minutes
gpu	0	4	30 minutes
playground	1	3	5 minutes

GPU nodes use a longer scale-down delay to avoid expensive cold starts.

Playground Nodes

The playground node pool serves the free-tier sandbox environment with restricted resources:

# From infrastructure/helm/base/templates/_helpers.tpl
# matih.playgroundNodeSelector
nodeSelector:
  matih.io/node-purpose: "playground"
 
# matih.playgroundTolerations
tolerations:
  - key: "matih.io/playground"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"
  - key: "matih.io/purpose"
    operator: "Equal"
    value: "playground"
    effect: "NoSchedule"
 
# matih.playgroundResources
resources:
  requests:
    cpu: "500m"
    memory: "1Gi"
  limits:
    cpu: "2"
    memory: "4Gi"

Google GKE Networking