Cluster Architecture Overview
The MATIH platform runs on managed Kubernetes across three cloud providers: Azure Kubernetes Service (AKS), Amazon Elastic Kubernetes Service (EKS), and Google Kubernetes Engine (GKE). This section covers the cluster-level architecture including node pool design, networking, identity federation, and cloud-specific configuration.
Multi-Cloud Strategy
MATIH adopts a cloud-agnostic approach where identical Helm charts deploy to any supported provider. Cloud-specific concerns are isolated to:
| Layer | Abstraction |
|---|---|
| Infrastructure provisioning | Terraform modules per provider |
| Identity and auth | Workload Identity (AKS), IRSA (EKS), Workload Identity Federation (GKE) |
| Secret management | External Secrets Operator with provider-specific ClusterSecretStores |
| Storage classes | Provider-native CSI drivers mapped to common storage class names |
| Load balancers | Provider-native LB with NGINX Ingress Controller |
| DNS and TLS | cert-manager with DNS01 challenge per provider |
Node Pool Architecture
The cluster uses purpose-specific node pools to isolate workloads by resource profile and criticality:
+------------------------------------------------------------------+
| Kubernetes Cluster |
| |
| +----------------+ +------------------+ +------------------+ |
| | ctrlplane | | dataplane | | compute | |
| | (D4s_v3/m5) | | (D8s_v3/m5.2xl) | | (E16s_v3/r5.4xl)| |
| | Control Plane | | Data Plane | | Query Engines | |
| | Services | | Services | | Trino, Spark | |
| +----------------+ +------------------+ +------------------+ |
| |
| +----------------+ +------------------+ +------------------+ |
| | aicompute | | gpu | | playground | |
| | (D8s_v3/m5.2xl)| | (NC6s_v3/p3.2xl)| | (D2s_v3/t3) | |
| | AI Service | | vLLM, Triton | | Free Tier | |
| | ML Workloads | | GPU Inference | | Sandboxed | |
| +----------------+ +------------------+ +------------------+ |
+------------------------------------------------------------------+Each node pool uses taints and labels for workload scheduling:
# Example: Data Plane node pool
nodeSelector:
agentpool: dataplane
tolerations:
- key: "matih.ai/data-plane"
operator: "Equal"
value: "true"
effect: "NoSchedule"Cluster Requirements
| Requirement | Minimum (Dev) | Production |
|---|---|---|
| Kubernetes version | 1.28+ | 1.29+ |
| Total nodes | 3 | 12-50+ |
| Total vCPUs | 16 | 128+ |
| Total memory | 64 GiB | 512+ GiB |
| GPU nodes | 0 | 2+ (NVIDIA A100/H100) |
| Storage | 200 GiB SSD | 2+ TiB SSD |
Section Contents
| Page | Description |
|---|---|
| Azure AKS | AKS cluster configuration, node pools, Azure CNI, and Workload Identity |
| AWS EKS | EKS cluster configuration, managed node groups, VPC CNI, and IRSA |
| Google GKE | GKE cluster configuration, node pools, and Workload Identity Federation |
| Node Pools | Node pool strategy, taints, labels, and autoscaling policies |
| Networking | CNI configuration, service mesh, DNS, and load balancing |