MATIH Platform is in active MVP development. Documentation reflects current implementation status.
17. Kubernetes & Helm
Cluster Setup
Overview

Cluster Architecture Overview

The MATIH platform runs on managed Kubernetes across three cloud providers: Azure Kubernetes Service (AKS), Amazon Elastic Kubernetes Service (EKS), and Google Kubernetes Engine (GKE). This section covers the cluster-level architecture including node pool design, networking, identity federation, and cloud-specific configuration.


Multi-Cloud Strategy

MATIH adopts a cloud-agnostic approach where identical Helm charts deploy to any supported provider. Cloud-specific concerns are isolated to:

LayerAbstraction
Infrastructure provisioningTerraform modules per provider
Identity and authWorkload Identity (AKS), IRSA (EKS), Workload Identity Federation (GKE)
Secret managementExternal Secrets Operator with provider-specific ClusterSecretStores
Storage classesProvider-native CSI drivers mapped to common storage class names
Load balancersProvider-native LB with NGINX Ingress Controller
DNS and TLScert-manager with DNS01 challenge per provider

Node Pool Architecture

The cluster uses purpose-specific node pools to isolate workloads by resource profile and criticality:

+------------------------------------------------------------------+
|                    Kubernetes Cluster                              |
|                                                                    |
|  +----------------+  +------------------+  +------------------+   |
|  | ctrlplane      |  | dataplane        |  | compute          |   |
|  | (D4s_v3/m5)    |  | (D8s_v3/m5.2xl) |  | (E16s_v3/r5.4xl)|   |
|  | Control Plane  |  | Data Plane       |  | Query Engines    |   |
|  | Services       |  | Services         |  | Trino, Spark     |   |
|  +----------------+  +------------------+  +------------------+   |
|                                                                    |
|  +----------------+  +------------------+  +------------------+   |
|  | aicompute      |  | gpu              |  | playground       |   |
|  | (D8s_v3/m5.2xl)|  | (NC6s_v3/p3.2xl)|  | (D2s_v3/t3)     |   |
|  | AI Service     |  | vLLM, Triton     |  | Free Tier        |   |
|  | ML Workloads   |  | GPU Inference    |  | Sandboxed        |   |
|  +----------------+  +------------------+  +------------------+   |
+------------------------------------------------------------------+

Each node pool uses taints and labels for workload scheduling:

# Example: Data Plane node pool
nodeSelector:
  agentpool: dataplane
tolerations:
  - key: "matih.ai/data-plane"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

Cluster Requirements

RequirementMinimum (Dev)Production
Kubernetes version1.28+1.29+
Total nodes312-50+
Total vCPUs16128+
Total memory64 GiB512+ GiB
GPU nodes02+ (NVIDIA A100/H100)
Storage200 GiB SSD2+ TiB SSD

Section Contents

PageDescription
Azure AKSAKS cluster configuration, node pools, Azure CNI, and Workload Identity
AWS EKSEKS cluster configuration, managed node groups, VPC CNI, and IRSA
Google GKEGKE cluster configuration, node pools, and Workload Identity Federation
Node PoolsNode pool strategy, taints, labels, and autoscaling policies
NetworkingCNI configuration, service mesh, DNS, and load balancing