Cluster Architecture Overview

The MATIH platform runs on managed Kubernetes across three cloud providers: Azure Kubernetes Service (AKS), Amazon Elastic Kubernetes Service (EKS), and Google Kubernetes Engine (GKE). This section covers the cluster-level architecture including node pool design, networking, identity federation, and cloud-specific configuration.

Multi-Cloud Strategy

MATIH adopts a cloud-agnostic approach where identical Helm charts deploy to any supported provider. Cloud-specific concerns are isolated to:

Layer	Abstraction
Infrastructure provisioning	Terraform modules per provider
Identity and auth	Workload Identity (AKS), IRSA (EKS), Workload Identity Federation (GKE)
Secret management	External Secrets Operator with provider-specific ClusterSecretStores
Storage classes	Provider-native CSI drivers mapped to common storage class names
Load balancers	Provider-native LB with NGINX Ingress Controller
DNS and TLS	cert-manager with DNS01 challenge per provider

Node Pool Architecture

The cluster uses purpose-specific node pools to isolate workloads by resource profile and criticality:

+------------------------------------------------------------------+
|                    Kubernetes Cluster                              |
|                                                                    |
|  +----------------+  +------------------+  +------------------+   |
|  | ctrlplane      |  | dataplane        |  | compute          |   |
|  | (D4s_v3/m5)    |  | (D8s_v3/m5.2xl) |  | (E16s_v3/r5.4xl)|   |
|  | Control Plane  |  | Data Plane       |  | Query Engines    |   |
|  | Services       |  | Services         |  | Trino, Spark     |   |
|  +----------------+  +------------------+  +------------------+   |
|                                                                    |
|  +----------------+  +------------------+  +------------------+   |
|  | aicompute      |  | gpu              |  | playground       |   |
|  | (D8s_v3/m5.2xl)|  | (NC6s_v3/p3.2xl)|  | (D2s_v3/t3)     |   |
|  | AI Service     |  | vLLM, Triton     |  | Free Tier        |   |
|  | ML Workloads   |  | GPU Inference    |  | Sandboxed        |   |
|  +----------------+  +------------------+  +------------------+   |
+------------------------------------------------------------------+

Each node pool uses taints and labels for workload scheduling:

# Example: Data Plane node pool
nodeSelector:
  agentpool: dataplane
tolerations:
  - key: "matih.ai/data-plane"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

Cluster Requirements

Requirement	Minimum (Dev)	Production
Kubernetes version	1.28+	1.29+
Total nodes	3	12-50+
Total vCPUs	16	128+
Total memory	64 GiB	512+ GiB
GPU nodes	0	2+ (NVIDIA A100/H100)
Storage	200 GiB SSD	2+ TiB SSD

Section Contents

Page	Description
Azure AKS	AKS cluster configuration, node pools, Azure CNI, and Workload Identity
AWS EKS	EKS cluster configuration, managed node groups, VPC CNI, and IRSA
Google GKE	GKE cluster configuration, node pools, and Workload Identity Federation
Node Pools	Node pool strategy, taints, labels, and autoscaling policies
Networking	CNI configuration, service mesh, DNS, and load balancing

Cluster Architecture Azure AKS