MATIH Platform is in active MVP development. Documentation reflects current implementation status.
17. Kubernetes & Helm
Cluster Architecture

Cluster Architecture

The MATIH platform is designed to run on managed Kubernetes services from all three major cloud providers: Azure Kubernetes Service (AKS), Amazon Elastic Kubernetes Service (EKS), and Google Kubernetes Engine (GKE). This section covers the cluster topology, node pool design, networking configuration, identity integration, and storage provisioning for each provider.


Cluster Topology Overview

Every MATIH cluster follows a standard topology regardless of the underlying cloud provider:

+---------------------------------------------------------------+
|                    Managed Kubernetes Cluster                  |
|                                                                |
|  Control Plane (Managed by Cloud Provider)                     |
|  +----------------------------------------------------------+ |
|  | API Server | etcd | Scheduler | Controller Manager        | |
|  +----------------------------------------------------------+ |
|                                                                |
|  Node Pools                                                    |
|  +------------------+ +------------------+ +----------------+ |
|  | system           | | general          | | aicompute      | |
|  | (3 nodes)        | | (3-10 nodes)     | | (2-8 nodes)    | |
|  | CriticalAddons   | | App workloads    | | AI/ML/GPU      | |
|  | Taints           | | Control + Data   | | Taints          | |
|  +------------------+ +------------------+ +----------------+ |
|                                                                |
|  +------------------+ +------------------+                     |
|  | data             | | monitoring       |                     |
|  | (3-6 nodes)      | | (2-3 nodes)      |                    |
|  | StatefulSets     | | Prometheus, Loki |                     |
|  | Taints           | | Grafana          |                     |
|  +------------------+ +------------------+                     |
+---------------------------------------------------------------+

Node Pool Design

MATIH uses dedicated node pools to isolate different workload types and optimize resource allocation. Each node pool has a specific purpose, instance type, and scaling configuration.

Node Pool Specifications

Node PoolPurposeMin/Max NodesInstance Type (AKS)Instance Type (EKS)Instance Type (GKE)Taints
systemCluster addons, kube-system3 / 3Standard_D4s_v5m6i.xlargee2-standard-4CriticalAddonsOnly=true:NoSchedule
generalControl plane + data plane services3 / 10Standard_D8s_v5m6i.2xlargee2-standard-8None
aicomputeAI/ML inference and training2 / 8Standard_NC6s_v3p3.2xlargen1-standard-8-t4matih.ai/ai-compute=true:NoSchedule
dataStatefulSet workloads (databases)3 / 6Standard_E8s_v5r6i.2xlargen2-highmem-8matih.ai/data=true:NoSchedule
monitoringObservability stack2 / 3Standard_D4s_v5m6i.xlargee2-standard-4matih.ai/monitoring=true:NoSchedule

Node Labels

Each node pool carries labels that enable workload scheduling via nodeSelector and affinity rules:

# System node pool labels
node.kubernetes.io/role: system
agentpool: system
 
# General node pool labels
node.kubernetes.io/role: general
agentpool: general
 
# AI compute node pool labels
node.kubernetes.io/role: aicompute
agentpool: aicompute
matih.ai/workload-type: ai-inference
nvidia.com/gpu.present: "true"  # On GPU-enabled pools
 
# Data node pool labels
node.kubernetes.io/role: data
agentpool: data
matih.ai/workload-type: stateful
 
# Monitoring node pool labels
node.kubernetes.io/role: monitoring
agentpool: monitoring
matih.ai/workload-type: observability

Azure Kubernetes Service (AKS)

AKS is the primary deployment target for MATIH, used in the azure-matihlabs environment.

AKS Cluster Configuration

The AKS cluster is provisioned via Terraform using the infrastructure/terraform/modules/azure/kubernetes module:

# infrastructure/terraform/modules/azure/kubernetes/main.tf (simplified)
 
resource "azurerm_kubernetes_cluster" "main" {
  name                = "matih-${var.environment}-aks"
  location            = var.location
  resource_group_name = var.resource_group_name
  dns_prefix          = "matih-${var.environment}"
  kubernetes_version  = var.kubernetes_version
 
  # Managed identity for Azure integrations
  identity {
    type = "SystemAssigned"
  }
 
  # API server access profile
  api_server_access_profile {
    authorized_ip_ranges = var.authorized_ip_ranges
  }
 
  # Azure AD integration for RBAC
  azure_active_directory_role_based_access_control {
    managed                = true
    admin_group_object_ids = var.admin_group_ids
    azure_rbac_enabled     = true
  }
 
  # Network profile
  network_profile {
    network_plugin    = "azure"
    network_policy    = "calico"
    load_balancer_sku = "standard"
    service_cidr      = var.service_cidr
    dns_service_ip    = var.dns_service_ip
  }
 
  # Azure Monitor integration
  oms_agent {
    log_analytics_workspace_id = var.log_analytics_workspace_id
  }
 
  # Maintenance window
  maintenance_window {
    allowed {
      day   = "Saturday"
      hours = [2, 6]
    }
  }
 
  tags = var.tags
}

AKS Node Pools

# System node pool (default)
resource "azurerm_kubernetes_cluster_node_pool" "system" {
  name                  = "system"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
  vm_size               = "Standard_D4s_v5"
  node_count            = 3
  mode                  = "System"
 
  os_disk_size_gb = 128
  os_disk_type    = "Managed"
  os_sku          = "AzureLinux"
 
  node_labels = {
    "agentpool"                = "system"
    "node.kubernetes.io/role"  = "system"
  }
 
  node_taints = ["CriticalAddonsOnly=true:NoSchedule"]
 
  upgrade_settings {
    max_surge = "33%"
  }
}
 
# General workload node pool
resource "azurerm_kubernetes_cluster_node_pool" "general" {
  name                  = "general"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
  vm_size               = "Standard_D8s_v5"
  min_count             = 3
  max_count             = 10
  enable_auto_scaling   = true
  mode                  = "User"
 
  os_disk_size_gb = 256
  os_disk_type    = "Managed"
  os_sku          = "AzureLinux"
 
  node_labels = {
    "agentpool"                = "general"
    "node.kubernetes.io/role"  = "general"
  }
 
  zones = [1, 2, 3]
 
  upgrade_settings {
    max_surge = "33%"
  }
}
 
# AI compute node pool with GPU
resource "azurerm_kubernetes_cluster_node_pool" "aicompute" {
  name                  = "aicompute"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
  vm_size               = "Standard_NC6s_v3"
  min_count             = 2
  max_count             = 8
  enable_auto_scaling   = true
  mode                  = "User"
 
  os_disk_size_gb = 512
  os_disk_type    = "Managed"
 
  node_labels = {
    "agentpool"                  = "aicompute"
    "node.kubernetes.io/role"    = "aicompute"
    "matih.ai/workload-type"     = "ai-inference"
  }
 
  node_taints = ["matih.ai/ai-compute=true:NoSchedule"]
 
  zones = [1, 2]
 
  upgrade_settings {
    max_surge = "33%"
  }
}
 
# Data node pool for stateful workloads
resource "azurerm_kubernetes_cluster_node_pool" "data" {
  name                  = "data"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
  vm_size               = "Standard_E8s_v5"
  min_count             = 3
  max_count             = 6
  enable_auto_scaling   = true
  mode                  = "User"
 
  os_disk_size_gb = 512
  os_disk_type    = "Managed"
 
  node_labels = {
    "agentpool"                  = "data"
    "node.kubernetes.io/role"    = "data"
    "matih.ai/workload-type"     = "stateful"
  }
 
  node_taints = ["matih.ai/data=true:NoSchedule"]
 
  zones = [1, 2, 3]
}

AKS Networking

MATIH uses Azure CNI networking for full VNET integration:

+------------------------------------------------------------------+
|  Azure Virtual Network (10.0.0.0/8)                              |
|                                                                   |
|  +----------------------------+                                   |
|  | AKS Subnet (10.1.0.0/16)  |                                  |
|  | - Pod IPs: 10.1.x.x        |                                 |
|  | - Node IPs: 10.1.0.x       |                                 |
|  +----------------------------+                                   |
|                                                                   |
|  +----------------------------+                                   |
|  | Service CIDR (10.2.0.0/16) |                                  |
|  | - ClusterIP Services       |                                  |
|  | - DNS: 10.2.0.10           |                                  |
|  +----------------------------+                                   |
|                                                                   |
|  +----------------------------+  +----------------------------+   |
|  | PostgreSQL Subnet          |  | Redis Subnet               |  |
|  | (10.3.0.0/24)              |  | (10.3.1.0/24)              |  |
|  | - Azure DB for PostgreSQL  |  | - Azure Cache for Redis    |  |
|  +----------------------------+  +----------------------------+   |
+------------------------------------------------------------------+

Key networking decisions:

SettingValueRationale
Network pluginAzure CNIFull VNET integration, no overlay network
Network policyCalicoRich policy support with namespace selectors
Load balancer SKUStandardRequired for availability zones
Service CIDR10.2.0.0/16Non-overlapping with VNET
DNS service IP10.2.0.10Within service CIDR

AKS Identity Integration

MATIH uses Azure Workload Identity for pod-to-Azure service authentication:

# Service account with workload identity annotation
apiVersion: v1
kind: ServiceAccount
metadata:
  name: ai-service
  namespace: matih-data-plane
  annotations:
    azure.workload.identity/client-id: "<managed-identity-client-id>"
  labels:
    azure.workload.identity/use: "true"

The Terraform module creates managed identities and federated credentials:

resource "azurerm_user_assigned_identity" "ai_service" {
  name                = "matih-${var.environment}-ai-service"
  resource_group_name = var.resource_group_name
  location            = var.location
}
 
resource "azurerm_federated_identity_credential" "ai_service" {
  name                = "ai-service-federated"
  resource_group_name = var.resource_group_name
  parent_id           = azurerm_user_assigned_identity.ai_service.id
  audience            = ["api://AzureADTokenExchange"]
  issuer              = azurerm_kubernetes_cluster.main.oidc_issuer_url
  subject             = "system:serviceaccount:matih-data-plane:ai-service"
}

Amazon Elastic Kubernetes Service (EKS)

EKS is supported for AWS deployments through the aws-dev and aws-prod environments.

EKS Cluster Configuration

# infrastructure/terraform/modules/aws/eks/main.tf (simplified)
 
module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  version = "~> 19.0"
 
  cluster_name    = "matih-${var.environment}"
  cluster_version = var.kubernetes_version
 
  vpc_id     = var.vpc_id
  subnet_ids = var.private_subnet_ids
 
  cluster_endpoint_private_access = true
  cluster_endpoint_public_access  = true
  cluster_endpoint_public_access_cidrs = var.authorized_cidrs
 
  # EKS managed node groups
  eks_managed_node_groups = {
    system = {
      name           = "system"
      instance_types = ["m6i.xlarge"]
      min_size       = 3
      max_size       = 3
      desired_size   = 3
 
      labels = {
        "agentpool"               = "system"
        "node.kubernetes.io/role" = "system"
      }
 
      taints = {
        critical = {
          key    = "CriticalAddonsOnly"
          value  = "true"
          effect = "NO_SCHEDULE"
        }
      }
    }
 
    general = {
      name           = "general"
      instance_types = ["m6i.2xlarge"]
      min_size       = 3
      max_size       = 10
      desired_size   = 3
 
      labels = {
        "agentpool"               = "general"
        "node.kubernetes.io/role" = "general"
      }
    }
 
    aicompute = {
      name           = "aicompute"
      instance_types = ["p3.2xlarge"]
      min_size       = 2
      max_size       = 8
      desired_size   = 2
 
      ami_type = "AL2_x86_64_GPU"
 
      labels = {
        "agentpool"                  = "aicompute"
        "node.kubernetes.io/role"    = "aicompute"
        "matih.ai/workload-type"     = "ai-inference"
      }
 
      taints = {
        ai_compute = {
          key    = "matih.ai/ai-compute"
          value  = "true"
          effect = "NO_SCHEDULE"
        }
      }
    }
 
    data = {
      name           = "data"
      instance_types = ["r6i.2xlarge"]
      min_size       = 3
      max_size       = 6
      desired_size   = 3
 
      labels = {
        "agentpool"                  = "data"
        "node.kubernetes.io/role"    = "data"
        "matih.ai/workload-type"     = "stateful"
      }
 
      taints = {
        data = {
          key    = "matih.ai/data"
          value  = "true"
          effect = "NO_SCHEDULE"
        }
      }
    }
  }
 
  # IRSA for pod-level IAM
  enable_irsa = true
 
  tags = var.tags
}

EKS Networking

EKS uses the VPC CNI plugin for native VPC networking:

SettingValueRationale
CNI Pluginamazon-vpc-cni-k8sNative VPC IP addresses for pods
Pod networkingVPC secondary CIDRExpanded IP space for pods
Service networkingIn-clusterStandard kube-proxy/iptables
Network policyCalico (add-on)Installed as EKS add-on

EKS Identity: IAM Roles for Service Accounts (IRSA)

# IRSA role for ai-service to access S3, Bedrock, Secrets Manager
module "ai_service_irsa" {
  source  = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
 
  role_name = "matih-${var.environment}-ai-service"
 
  oidc_providers = {
    main = {
      provider_arn               = module.eks.oidc_provider_arn
      namespace_service_accounts = ["matih-data-plane:ai-service"]
    }
  }
 
  role_policy_arns = {
    s3     = aws_iam_policy.ai_service_s3.arn
    bedrock = aws_iam_policy.ai_service_bedrock.arn
    secrets = aws_iam_policy.ai_service_secrets.arn
  }
}

Google Kubernetes Engine (GKE)

GKE is supported for GCP deployments through the gcp-dev and gcp-prod environments.

GKE Cluster Configuration

# infrastructure/terraform/modules/gcp/gke/main.tf (simplified)
 
resource "google_container_cluster" "main" {
  name     = "matih-${var.environment}"
  location = var.region
 
  # Use regional cluster for HA
  node_locations = var.zones
 
  # Remove default node pool
  remove_default_node_pool = true
  initial_node_count       = 1
 
  # Private cluster
  private_cluster_config {
    enable_private_nodes    = true
    enable_private_endpoint = false
    master_ipv4_cidr_block  = var.master_cidr
  }
 
  # Workload Identity
  workload_identity_config {
    workload_pool = "${var.project_id}.svc.id.goog"
  }
 
  # Network policy
  network_policy {
    enabled  = true
    provider = "CALICO"
  }
 
  # Binary authorization
  binary_authorization {
    evaluation_mode = "PROJECT_SINGLETON_POLICY_ENFORCE"
  }
 
  # Maintenance policy
  maintenance_policy {
    recurring_window {
      start_time = "2025-01-01T02:00:00Z"
      end_time   = "2025-01-01T06:00:00Z"
      recurrence = "FREQ=WEEKLY;BYDAY=SA"
    }
  }
}

GKE Node Pools

resource "google_container_node_pool" "general" {
  name     = "general"
  cluster  = google_container_cluster.main.name
  location = var.region
 
  autoscaling {
    min_node_count = 3
    max_node_count = 10
  }
 
  node_config {
    machine_type = "e2-standard-8"
    disk_size_gb = 256
    disk_type    = "pd-ssd"
 
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform",
    ]
 
    labels = {
      "agentpool"               = "general"
      "node.kubernetes.io/role" = "general"
    }
 
    workload_metadata_config {
      mode = "GKE_METADATA"
    }
 
    shielded_instance_config {
      enable_secure_boot          = true
      enable_integrity_monitoring = true
    }
  }
 
  management {
    auto_repair  = true
    auto_upgrade = true
  }
 
  upgrade_settings {
    max_surge       = 3
    max_unavailable = 1
  }
}
 
resource "google_container_node_pool" "aicompute" {
  name     = "aicompute"
  cluster  = google_container_cluster.main.name
  location = var.region
 
  autoscaling {
    min_node_count = 2
    max_node_count = 8
  }
 
  node_config {
    machine_type = "n1-standard-8"
    disk_size_gb = 512
    disk_type    = "pd-ssd"
 
    guest_accelerator {
      type  = "nvidia-tesla-t4"
      count = 1
      gpu_driver_installation_config {
        gpu_driver_version = "LATEST"
      }
    }
 
    labels = {
      "agentpool"                  = "aicompute"
      "node.kubernetes.io/role"    = "aicompute"
      "matih.ai/workload-type"     = "ai-inference"
    }
 
    taint {
      key    = "matih.ai/ai-compute"
      value  = "true"
      effect = "NO_SCHEDULE"
    }
 
    workload_metadata_config {
      mode = "GKE_METADATA"
    }
  }
}

GKE Identity: Workload Identity

resource "google_service_account" "ai_service" {
  account_id   = "matih-ai-service"
  display_name = "MATIH AI Service"
  project      = var.project_id
}
 
resource "google_service_account_iam_binding" "ai_service_workload_identity" {
  service_account_id = google_service_account.ai_service.name
  role               = "roles/iam.workloadIdentityUser"
 
  members = [
    "serviceAccount:${var.project_id}.svc.id.goog[matih-data-plane/ai-service]",
  ]
}
 
resource "google_project_iam_member" "ai_service_vertex" {
  project = var.project_id
  role    = "roles/aiplatform.user"
  member  = "serviceAccount:${google_service_account.ai_service.email}"
}

Storage Classes

MATIH provisions dedicated storage classes for different workload types:

AKS Storage Classes

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: matih-premium-ssd
provisioner: disk.csi.azure.com
parameters:
  skuName: Premium_LRS
  kind: managed
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
 
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: matih-standard-ssd
provisioner: disk.csi.azure.com
parameters:
  skuName: StandardSSD_LRS
  kind: managed
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
 
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: matih-files
provisioner: file.csi.azure.com
parameters:
  skuName: Premium_LRS
reclaimPolicy: Retain
allowVolumeExpansion: true
mountOptions:
  - dir_mode=0755
  - file_mode=0644
  - uid=1000
  - gid=1000

Storage Class Usage by Workload

WorkloadStorage ClassSizeRationale
PostgreSQLmatih-premium-ssd100GiHigh IOPS for database operations
Kafka (Strimzi)matih-premium-ssd200GiHigh throughput for event streaming
Prometheusmatih-standard-ssd50GiCost-effective metrics storage
Lokimatih-standard-ssd50GiLog aggregation, less IOPS sensitive
Grafanamatih-standard-ssd10GiDashboard storage
Neo4jmatih-premium-ssd50GiGraph database operations
Qdrantmatih-premium-ssd50GiVector search performance
Elasticsearchmatih-premium-ssd100GiFull-text search indexing

Container Registry

MATIH uses Azure Container Registry (ACR) as the primary image registry:

Registry: matihlabsacr.azurecr.io

Image Naming Convention

matihlabsacr.azurecr.io/matih/<service-name>:<tag>

Examples:

  • matihlabsacr.azurecr.io/matih/ai-service:1.0.0-abc1234
  • matihlabsacr.azurecr.io/matih/iam-service:1.0.0-abc1234
  • matihlabsacr.azurecr.io/matih/bi-workbench:1.0.0-abc1234

Base Images

Base ImageTagPurpose
matih/base-java1.0.0Java Spring Boot services (control plane)
matih/base-python-ml1.0.0Python AI/ML services (data plane)
matih/base-node1.0.0Node.js services (render service)
matih/base-nginx1.25-alpineFrontend static serving

Image Pull Secrets

Each namespace has an acr-secret for authenticating to ACR:

imagePullSecrets:
  - name: acr-secret

The secret is created during cluster provisioning and referenced by every service chart.


Cluster Add-ons

The following add-ons are deployed to every MATIH cluster:

Add-onVersionNamespacePurpose
Calico3.26+kube-systemNetwork policy enforcement
cert-manager1.13+cert-managerTLS certificate management
External Secrets Operator0.9+external-secretsSecret synchronization from vault
NGINX Ingress Controller1.9+matih-ingressHTTP(S) load balancing
External DNS0.14+matih-systemDNS record management
Strimzi Kafka Operator0.38+matih-systemKafka cluster management
Prometheus Operator0.70+matih-observabilityMonitoring CRD management
metrics-server0.6+kube-systemResource metrics for HPA/VPA
KEDA2.12+kube-systemEvent-driven autoscaling

Cluster Upgrade Strategy

MATIH follows a controlled upgrade strategy for Kubernetes version management:

Upgrade Process

  1. Test in development: Upgrade the dev cluster first and run the full test suite
  2. Canary node pool: Add a new node pool with the target version alongside the existing pool
  3. Workload migration: Gradually cordon and drain old nodes, allowing pods to reschedule on new nodes
  4. Validation: Run health checks (scripts/disaster-recovery/health-check.sh) after each step
  5. Control plane upgrade: Upgrade the managed control plane after worker nodes are validated
  6. Cleanup: Remove the old node pool after successful validation

Maintenance Windows

EnvironmentWindowFrequency
DevelopmentAny timeOn-demand
StagingSaturday 02:00-06:00 UTCWeekly
ProductionSaturday 02:00-06:00 UTCMonthly

Deep Dive: Kubernetes version skew policy allows the API server to be one minor version ahead of kubelets. MATIH uses this to perform rolling upgrades: control plane first, then node pools one at a time. The max_surge setting on each node pool controls how many extra nodes can be created during an upgrade, with MATIH using 33% surge capacity for production stability.


Cloud Provider Comparison

FeatureAKSEKSGKE
Network pluginAzure CNIVPC CNICalico
Network policyCalicoCalico (add-on)Calico (built-in)
IdentityWorkload IdentityIRSAWorkload Identity
GPU supportNC-series VMsP3/P4 instancesT4/A100 accelerators
StorageAzure Disks/FilesEBS/EFSPersistent Disk/Filestore
RegistryACRECRArtifact Registry
Secret managerKey VaultSecrets ManagerSecret Manager
DNSAzure DNSRoute 53Cloud DNS
Load balancerAzure LB StandardNLB/ALBCloud Load Balancing
Maintenance windowsBuilt-inManaged node groupsBuilt-in
Max pods per node250 (Azure CNI)110 (default)110 (default)

Troubleshooting

Common Cluster Issues

IssueSymptomResolution
Node NotReadyPods stuck in PendingCheck node pool scaling limits; verify instance quota with cloud provider
ImagePullBackOffPod cannot startVerify acr-secret exists in namespace; check image name and tag
DNS resolution failureService-to-service calls failCheck CoreDNS pods in kube-system; verify service name and namespace
PVC PendingStatefulSet pods stuckVerify storage class exists; check cloud storage quota
GPU scheduling failureAI pods in PendingVerify GPU node pool has available capacity; check NVIDIA device plugin
Node pool at capacityCluster autoscaler not scalingCheck max_count on node pool; verify cloud provider instance quota

Diagnostic Commands

All diagnostic operations must be performed through the approved scripts:

# Check platform status
./scripts/tools/platform-status.sh
 
# Run comprehensive health check
./scripts/disaster-recovery/health-check.sh
 
# Check AKS-specific health
./scripts/tools/aks-health-check.sh

Next Steps

With the cluster architecture established, the next section covers the namespace topology that organizes workloads within the cluster: