Cluster Architecture

The MATIH platform is designed to run on managed Kubernetes services from all three major cloud providers: Azure Kubernetes Service (AKS), Amazon Elastic Kubernetes Service (EKS), and Google Kubernetes Engine (GKE). This section covers the cluster topology, node pool design, networking configuration, identity integration, and storage provisioning for each provider.

Cluster Topology Overview

Every MATIH cluster follows a standard topology regardless of the underlying cloud provider:

+---------------------------------------------------------------+
|                    Managed Kubernetes Cluster                  |
|                                                                |
|  Control Plane (Managed by Cloud Provider)                     |
|  +----------------------------------------------------------+ |
|  | API Server | etcd | Scheduler | Controller Manager        | |
|  +----------------------------------------------------------+ |
|                                                                |
|  Node Pools                                                    |
|  +------------------+ +------------------+ +----------------+ |
|  | system           | | general          | | aicompute      | |
|  | (3 nodes)        | | (3-10 nodes)     | | (2-8 nodes)    | |
|  | CriticalAddons   | | App workloads    | | AI/ML/GPU      | |
|  | Taints           | | Control + Data   | | Taints          | |
|  +------------------+ +------------------+ +----------------+ |
|                                                                |
|  +------------------+ +------------------+                     |
|  | data             | | monitoring       |                     |
|  | (3-6 nodes)      | | (2-3 nodes)      |                    |
|  | StatefulSets     | | Prometheus, Loki |                     |
|  | Taints           | | Grafana          |                     |
|  +------------------+ +------------------+                     |
+---------------------------------------------------------------+

Node Pool Design

MATIH uses dedicated node pools to isolate different workload types and optimize resource allocation. Each node pool has a specific purpose, instance type, and scaling configuration.

Node Pool Specifications

Node Pool	Purpose	Min/Max Nodes	Instance Type (AKS)	Instance Type (EKS)	Instance Type (GKE)	Taints
system	Cluster addons, kube-system	3 / 3	Standard_D4s_v5	m6i.xlarge	e2-standard-4	`CriticalAddonsOnly=true:NoSchedule`
general	Control plane + data plane services	3 / 10	Standard_D8s_v5	m6i.2xlarge	e2-standard-8	None
aicompute	AI/ML inference and training	2 / 8	Standard_NC6s_v3	p3.2xlarge	n1-standard-8-t4	`matih.ai/ai-compute=true:NoSchedule`
data	StatefulSet workloads (databases)	3 / 6	Standard_E8s_v5	r6i.2xlarge	n2-highmem-8	`matih.ai/data=true:NoSchedule`
monitoring	Observability stack	2 / 3	Standard_D4s_v5	m6i.xlarge	e2-standard-4	`matih.ai/monitoring=true:NoSchedule`

Node Labels

Each node pool carries labels that enable workload scheduling via nodeSelector and affinity rules:

# System node pool labels
node.kubernetes.io/role: system
agentpool: system
 
# General node pool labels
node.kubernetes.io/role: general
agentpool: general
 
# AI compute node pool labels
node.kubernetes.io/role: aicompute
agentpool: aicompute
matih.ai/workload-type: ai-inference
nvidia.com/gpu.present: "true"  # On GPU-enabled pools
 
# Data node pool labels
node.kubernetes.io/role: data
agentpool: data
matih.ai/workload-type: stateful
 
# Monitoring node pool labels
node.kubernetes.io/role: monitoring
agentpool: monitoring
matih.ai/workload-type: observability

Azure Kubernetes Service (AKS)

AKS is the primary deployment target for MATIH, used in the azure-matihlabs environment.

AKS Cluster Configuration

The AKS cluster is provisioned via Terraform using the infrastructure/terraform/modules/azure/kubernetes module:

# infrastructure/terraform/modules/azure/kubernetes/main.tf (simplified)
 
resource "azurerm_kubernetes_cluster" "main" {
  name                = "matih-${var.environment}-aks"
  location            = var.location
  resource_group_name = var.resource_group_name
  dns_prefix          = "matih-${var.environment}"
  kubernetes_version  = var.kubernetes_version
 
  # Managed identity for Azure integrations
  identity {
    type = "SystemAssigned"
  }
 
  # API server access profile
  api_server_access_profile {
    authorized_ip_ranges = var.authorized_ip_ranges
  }
 
  # Azure AD integration for RBAC
  azure_active_directory_role_based_access_control {
    managed                = true
    admin_group_object_ids = var.admin_group_ids
    azure_rbac_enabled     = true
  }
 
  # Network profile
  network_profile {
    network_plugin    = "azure"
    network_policy    = "calico"
    load_balancer_sku = "standard"
    service_cidr      = var.service_cidr
    dns_service_ip    = var.dns_service_ip
  }
 
  # Azure Monitor integration
  oms_agent {
    log_analytics_workspace_id = var.log_analytics_workspace_id
  }
 
  # Maintenance window
  maintenance_window {
    allowed {
      day   = "Saturday"
      hours = [2, 6]
    }
  }
 
  tags = var.tags
}

AKS Node Pools

# System node pool (default)
resource "azurerm_kubernetes_cluster_node_pool" "system" {
  name                  = "system"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
  vm_size               = "Standard_D4s_v5"
  node_count            = 3
  mode                  = "System"
 
  os_disk_size_gb = 128
  os_disk_type    = "Managed"
  os_sku          = "AzureLinux"
 
  node_labels = {
    "agentpool"                = "system"
    "node.kubernetes.io/role"  = "system"
  }
 
  node_taints = ["CriticalAddonsOnly=true:NoSchedule"]
 
  upgrade_settings {
    max_surge = "33%"
  }
}
 
# General workload node pool
resource "azurerm_kubernetes_cluster_node_pool" "general" {
  name                  = "general"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
  vm_size               = "Standard_D8s_v5"
  min_count             = 3
  max_count             = 10
  enable_auto_scaling   = true
  mode                  = "User"
 
  os_disk_size_gb = 256
  os_disk_type    = "Managed"
  os_sku          = "AzureLinux"
 
  node_labels = {
    "agentpool"                = "general"
    "node.kubernetes.io/role"  = "general"
  }
 
  zones = [1, 2, 3]
 
  upgrade_settings {
    max_surge = "33%"
  }
}
 
# AI compute node pool with GPU
resource "azurerm_kubernetes_cluster_node_pool" "aicompute" {
  name                  = "aicompute"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
  vm_size               = "Standard_NC6s_v3"
  min_count             = 2
  max_count             = 8
  enable_auto_scaling   = true
  mode                  = "User"
 
  os_disk_size_gb = 512
  os_disk_type    = "Managed"
 
  node_labels = {
    "agentpool"                  = "aicompute"
    "node.kubernetes.io/role"    = "aicompute"
    "matih.ai/workload-type"     = "ai-inference"
  }
 
  node_taints = ["matih.ai/ai-compute=true:NoSchedule"]
 
  zones = [1, 2]
 
  upgrade_settings {
    max_surge = "33%"
  }
}
 
# Data node pool for stateful workloads
resource "azurerm_kubernetes_cluster_node_pool" "data" {
  name                  = "data"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
  vm_size               = "Standard_E8s_v5"
  min_count             = 3
  max_count             = 6
  enable_auto_scaling   = true
  mode                  = "User"
 
  os_disk_size_gb = 512
  os_disk_type    = "Managed"
 
  node_labels = {
    "agentpool"                  = "data"
    "node.kubernetes.io/role"    = "data"
    "matih.ai/workload-type"     = "stateful"
  }
 
  node_taints = ["matih.ai/data=true:NoSchedule"]
 
  zones = [1, 2, 3]
}

AKS Networking

MATIH uses Azure CNI networking for full VNET integration:

+------------------------------------------------------------------+
|  Azure Virtual Network (10.0.0.0/8)                              |
|                                                                   |
|  +----------------------------+                                   |
|  | AKS Subnet (10.1.0.0/16)  |                                  |
|  | - Pod IPs: 10.1.x.x        |                                 |
|  | - Node IPs: 10.1.0.x       |                                 |
|  +----------------------------+                                   |
|                                                                   |
|  +----------------------------+                                   |
|  | Service CIDR (10.2.0.0/16) |                                  |
|  | - ClusterIP Services       |                                  |
|  | - DNS: 10.2.0.10           |                                  |
|  +----------------------------+                                   |
|                                                                   |
|  +----------------------------+  +----------------------------+   |
|  | PostgreSQL Subnet          |  | Redis Subnet               |  |
|  | (10.3.0.0/24)              |  | (10.3.1.0/24)              |  |
|  | - Azure DB for PostgreSQL  |  | - Azure Cache for Redis    |  |
|  +----------------------------+  +----------------------------+   |
+------------------------------------------------------------------+

Key networking decisions:

Setting	Value	Rationale
Network plugin	Azure CNI	Full VNET integration, no overlay network
Network policy	Calico	Rich policy support with namespace selectors
Load balancer SKU	Standard	Required for availability zones
Service CIDR	10.2.0.0/16	Non-overlapping with VNET
DNS service IP	10.2.0.10	Within service CIDR

AKS Identity Integration

MATIH uses Azure Workload Identity for pod-to-Azure service authentication:

# Service account with workload identity annotation
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ai-service
  namespace: matih-data-plane
  annotations:
    azure.workload.identity/client-id: "<managed-identity-client-id>"
  labels:
    azure.workload.identity/use: "true"

The Terraform module creates managed identities and federated credentials:

resource "azurerm_user_assigned_identity" "ai_service" {
  name                = "matih-${var.environment}-ai-service"
  resource_group_name = var.resource_group_name
  location            = var.location
}
 
resource "azurerm_federated_identity_credential" "ai_service" {
  name                = "ai-service-federated"
  resource_group_name = var.resource_group_name
  parent_id           = azurerm_user_assigned_identity.ai_service.id
  audience            = ["api://AzureADTokenExchange"]
  issuer              = azurerm_kubernetes_cluster.main.oidc_issuer_url
  subject             = "system:serviceaccount:matih-data-plane:ai-service"
}

Amazon Elastic Kubernetes Service (EKS)

EKS is supported for AWS deployments through the aws-dev and aws-prod environments.

EKS Cluster Configuration

# infrastructure/terraform/modules/aws/eks/main.tf (simplified)
 
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 19.0"
 
  cluster_name    = "matih-${var.environment}"
  cluster_version = var.kubernetes_version
 
  vpc_id     = var.vpc_id
  subnet_ids = var.private_subnet_ids
 
  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = true
  cluster_endpoint_public_access_cidrs = var.authorized_cidrs
 
  # EKS managed node groups
  eks_managed_node_groups = {
    system = {
      name           = "system"
      instance_types = ["m6i.xlarge"]
      min_size       = 3
      max_size       = 3
      desired_size   = 3
 
      labels = {
        "agentpool"               = "system"
        "node.kubernetes.io/role" = "system"
      }
 
      taints = {
        critical = {
          key    = "CriticalAddonsOnly"
          value  = "true"
          effect = "NO_SCHEDULE"
        }
      }
    }
 
    general = {
      name           = "general"
      instance_types = ["m6i.2xlarge"]
      min_size       = 3
      max_size       = 10
      desired_size   = 3
 
      labels = {
        "agentpool"               = "general"
        "node.kubernetes.io/role" = "general"
      }
    }
 
    aicompute = {
      name           = "aicompute"
      instance_types = ["p3.2xlarge"]
      min_size       = 2
      max_size       = 8
      desired_size   = 2
 
      ami_type = "AL2_x86_64_GPU"
 
      labels = {
        "agentpool"                  = "aicompute"
        "node.kubernetes.io/role"    = "aicompute"
        "matih.ai/workload-type"     = "ai-inference"
      }
 
      taints = {
        ai_compute = {
          key    = "matih.ai/ai-compute"
          value  = "true"
          effect = "NO_SCHEDULE"
        }
      }
    }
 
    data = {
      name           = "data"
      instance_types = ["r6i.2xlarge"]
      min_size       = 3
      max_size       = 6
      desired_size   = 3
 
      labels = {
        "agentpool"                  = "data"
        "node.kubernetes.io/role"    = "data"
        "matih.ai/workload-type"     = "stateful"
      }
 
      taints = {
        data = {
          key    = "matih.ai/data"
          value  = "true"
          effect = "NO_SCHEDULE"
        }
      }
    }
  }
 
  # IRSA for pod-level IAM
  enable_irsa = true
 
  tags = var.tags
}

EKS Networking

EKS uses the VPC CNI plugin for native VPC networking:

Setting	Value	Rationale
CNI Plugin	amazon-vpc-cni-k8s	Native VPC IP addresses for pods
Pod networking	VPC secondary CIDR	Expanded IP space for pods
Service networking	In-cluster	Standard kube-proxy/iptables
Network policy	Calico (add-on)	Installed as EKS add-on

EKS Identity: IAM Roles for Service Accounts (IRSA)

# IRSA role for ai-service to access S3, Bedrock, Secrets Manager
module "ai_service_irsa" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
 
  role_name = "matih-${var.environment}-ai-service"
 
  oidc_providers = {
    main = {
      provider_arn               = module.eks.oidc_provider_arn
      namespace_service_accounts = ["matih-data-plane:ai-service"]
    }
  }
 
  role_policy_arns = {
    s3     = aws_iam_policy.ai_service_s3.arn
    bedrock = aws_iam_policy.ai_service_bedrock.arn
    secrets = aws_iam_policy.ai_service_secrets.arn
  }
}

Google Kubernetes Engine (GKE)

GKE is supported for GCP deployments through the gcp-dev and gcp-prod environments.

GKE Cluster Configuration

# infrastructure/terraform/modules/gcp/gke/main.tf (simplified)
 
resource "google_container_cluster" "main" {
  name     = "matih-${var.environment}"
  location = var.region
 
  # Use regional cluster for HA
  node_locations = var.zones
 
  # Remove default node pool
  remove_default_node_pool = true
  initial_node_count       = 1
 
  # Private cluster
  private_cluster_config {
    enable_private_nodes    = true
    enable_private_endpoint = false
    master_ipv4_cidr_block  = var.master_cidr
  }
 
  # Workload Identity
  workload_identity_config {
    workload_pool = "${var.project_id}.svc.id.goog"
  }
 
  # Network policy
  network_policy {
    enabled  = true
    provider = "CALICO"
  }
 
  # Binary authorization
  binary_authorization {
    evaluation_mode = "PROJECT_SINGLETON_POLICY_ENFORCE"
  }
 
  # Maintenance policy
  maintenance_policy {
    recurring_window {
      start_time = "2025-01-01T02:00:00Z"
      end_time   = "2025-01-01T06:00:00Z"
      recurrence = "FREQ=WEEKLY;BYDAY=SA"
    }
  }
}

GKE Node Pools

resource "google_container_node_pool" "general" {
  name     = "general"
  cluster  = google_container_cluster.main.name
  location = var.region
 
  autoscaling {
    min_node_count = 3
    max_node_count = 10
  }
 
  node_config {
    machine_type = "e2-standard-8"
    disk_size_gb = 256
    disk_type    = "pd-ssd"
 
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform",
    ]
 
    labels = {
      "agentpool"               = "general"
      "node.kubernetes.io/role" = "general"
    }
 
    workload_metadata_config {
      mode = "GKE_METADATA"
    }
 
    shielded_instance_config {
      enable_secure_boot          = true
      enable_integrity_monitoring = true
    }
  }
 
  management {
    auto_repair  = true
    auto_upgrade = true
  }
 
  upgrade_settings {
    max_surge       = 3
    max_unavailable = 1
  }
}
 
resource "google_container_node_pool" "aicompute" {
  name     = "aicompute"
  cluster  = google_container_cluster.main.name
  location = var.region
 
  autoscaling {
    min_node_count = 2
    max_node_count = 8
  }
 
  node_config {
    machine_type = "n1-standard-8"
    disk_size_gb = 512
    disk_type    = "pd-ssd"
 
    guest_accelerator {
      type  = "nvidia-tesla-t4"
      count = 1
      gpu_driver_installation_config {
        gpu_driver_version = "LATEST"
      }
    }
 
    labels = {
      "agentpool"                  = "aicompute"
      "node.kubernetes.io/role"    = "aicompute"
      "matih.ai/workload-type"     = "ai-inference"
    }
 
    taint {
      key    = "matih.ai/ai-compute"
      value  = "true"
      effect = "NO_SCHEDULE"
    }
 
    workload_metadata_config {
      mode = "GKE_METADATA"
    }
  }
}

GKE Identity: Workload Identity

resource "google_service_account" "ai_service" {
  account_id   = "matih-ai-service"
  display_name = "MATIH AI Service"
  project      = var.project_id
}
 
resource "google_service_account_iam_binding" "ai_service_workload_identity" {
  service_account_id = google_service_account.ai_service.name
  role               = "roles/iam.workloadIdentityUser"
 
  members = [
    "serviceAccount:${var.project_id}.svc.id.goog[matih-data-plane/ai-service]",
  ]
}
 
resource "google_project_iam_member" "ai_service_vertex" {
  project = var.project_id
  role    = "roles/aiplatform.user"
  member  = "serviceAccount:${google_service_account.ai_service.email}"
}

Storage Classes

MATIH provisions dedicated storage classes for different workload types:

AKS Storage Classes

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: matih-premium-ssd
provisioner: disk.csi.azure.com
parameters:
  skuName: Premium_LRS
  kind: managed
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
 
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: matih-standard-ssd
provisioner: disk.csi.azure.com
parameters:
  skuName: StandardSSD_LRS
  kind: managed
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
 
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: matih-files
provisioner: file.csi.azure.com
parameters:
  skuName: Premium_LRS
reclaimPolicy: Retain
allowVolumeExpansion: true
mountOptions:
  - dir_mode=0755
  - file_mode=0644
  - uid=1000
  - gid=1000

Storage Class Usage by Workload

Workload	Storage Class	Size	Rationale
PostgreSQL	matih-premium-ssd	100Gi	High IOPS for database operations
Kafka (Strimzi)	matih-premium-ssd	200Gi	High throughput for event streaming
Prometheus	matih-standard-ssd	50Gi	Cost-effective metrics storage
Loki	matih-standard-ssd	50Gi	Log aggregation, less IOPS sensitive
Grafana	matih-standard-ssd	10Gi	Dashboard storage
Neo4j	matih-premium-ssd	50Gi	Graph database operations
Qdrant	matih-premium-ssd	50Gi	Vector search performance
Elasticsearch	matih-premium-ssd	100Gi	Full-text search indexing

Container Registry

MATIH uses Azure Container Registry (ACR) as the primary image registry:

Registry: matihlabsacr.azurecr.io

Image Naming Convention

matihlabsacr.azurecr.io/matih/<service-name>:<tag>

Examples:

matihlabsacr.azurecr.io/matih/ai-service:1.0.0-abc1234
matihlabsacr.azurecr.io/matih/iam-service:1.0.0-abc1234
matihlabsacr.azurecr.io/matih/bi-workbench:1.0.0-abc1234

Base Images

Base Image	Tag	Purpose
`matih/base-java`	1.0.0	Java Spring Boot services (control plane)
`matih/base-python-ml`	1.0.0	Python AI/ML services (data plane)
`matih/base-node`	1.0.0	Node.js services (render service)
`matih/base-nginx`	1.25-alpine	Frontend static serving

Image Pull Secrets

Each namespace has an acr-secret for authenticating to ACR:

imagePullSecrets:
  - name: acr-secret

The secret is created during cluster provisioning and referenced by every service chart.

Cluster Add-ons

The following add-ons are deployed to every MATIH cluster:

Add-on	Version	Namespace	Purpose
Calico	3.26+	kube-system	Network policy enforcement
cert-manager	1.13+	cert-manager	TLS certificate management
External Secrets Operator	0.9+	external-secrets	Secret synchronization from vault
NGINX Ingress Controller	1.9+	matih-ingress	HTTP(S) load balancing
External DNS	0.14+	matih-system	DNS record management
Strimzi Kafka Operator	0.38+	matih-system	Kafka cluster management
Prometheus Operator	0.70+	matih-observability	Monitoring CRD management
metrics-server	0.6+	kube-system	Resource metrics for HPA/VPA
KEDA	2.12+	kube-system	Event-driven autoscaling

Cluster Upgrade Strategy

MATIH follows a controlled upgrade strategy for Kubernetes version management:

Upgrade Process

Test in development: Upgrade the dev cluster first and run the full test suite
Canary node pool: Add a new node pool with the target version alongside the existing pool
Workload migration: Gradually cordon and drain old nodes, allowing pods to reschedule on new nodes
Validation: Run health checks (scripts/disaster-recovery/health-check.sh) after each step
Control plane upgrade: Upgrade the managed control plane after worker nodes are validated
Cleanup: Remove the old node pool after successful validation

Maintenance Windows

Environment	Window	Frequency
Development	Any time	On-demand
Staging	Saturday 02:00-06:00 UTC	Weekly
Production	Saturday 02:00-06:00 UTC	Monthly

Deep Dive: Kubernetes version skew policy allows the API server to be one minor version ahead of kubelets. MATIH uses this to perform rolling upgrades: control plane first, then node pools one at a time. The max_surge setting on each node pool controls how many extra nodes can be created during an upgrade, with MATIH using 33% surge capacity for production stability.

Cloud Provider Comparison

Feature	AKS	EKS	GKE
Network plugin	Azure CNI	VPC CNI	Calico
Network policy	Calico	Calico (add-on)	Calico (built-in)
Identity	Workload Identity	IRSA	Workload Identity
GPU support	NC-series VMs	P3/P4 instances	T4/A100 accelerators
Storage	Azure Disks/Files	EBS/EFS	Persistent Disk/Filestore
Registry	ACR	ECR	Artifact Registry
Secret manager	Key Vault	Secrets Manager	Secret Manager
DNS	Azure DNS	Route 53	Cloud DNS
Load balancer	Azure LB Standard	NLB/ALB	Cloud Load Balancing
Maintenance windows	Built-in	Managed node groups	Built-in
Max pods per node	250 (Azure CNI)	110 (default)	110 (default)

Troubleshooting

Common Cluster Issues

Issue	Symptom	Resolution
Node NotReady	Pods stuck in Pending	Check node pool scaling limits; verify instance quota with cloud provider
ImagePullBackOff	Pod cannot start	Verify `acr-secret` exists in namespace; check image name and tag
DNS resolution failure	Service-to-service calls fail	Check CoreDNS pods in kube-system; verify service name and namespace
PVC Pending	StatefulSet pods stuck	Verify storage class exists; check cloud storage quota
GPU scheduling failure	AI pods in Pending	Verify GPU node pool has available capacity; check NVIDIA device plugin
Node pool at capacity	Cluster autoscaler not scaling	Check max_count on node pool; verify cloud provider instance quota

Diagnostic Commands

All diagnostic operations must be performed through the approved scripts:

# Check platform status
./scripts/tools/platform-status.sh
 
# Run comprehensive health check
./scripts/disaster-recovery/health-check.sh
 
# Check AKS-specific health
./scripts/tools/aks-health-check.sh

Next Steps

With the cluster architecture established, the next section covers the namespace topology that organizes workloads within the cluster:

Next: Namespace Topology
Previous: Chapter Overview

Overview Overview