Cluster Architecture
The MATIH platform is designed to run on managed Kubernetes services from all three major cloud providers: Azure Kubernetes Service (AKS), Amazon Elastic Kubernetes Service (EKS), and Google Kubernetes Engine (GKE). This section covers the cluster topology, node pool design, networking configuration, identity integration, and storage provisioning for each provider.
Cluster Topology Overview
Every MATIH cluster follows a standard topology regardless of the underlying cloud provider:
+---------------------------------------------------------------+
| Managed Kubernetes Cluster |
| |
| Control Plane (Managed by Cloud Provider) |
| +----------------------------------------------------------+ |
| | API Server | etcd | Scheduler | Controller Manager | |
| +----------------------------------------------------------+ |
| |
| Node Pools |
| +------------------+ +------------------+ +----------------+ |
| | system | | general | | aicompute | |
| | (3 nodes) | | (3-10 nodes) | | (2-8 nodes) | |
| | CriticalAddons | | App workloads | | AI/ML/GPU | |
| | Taints | | Control + Data | | Taints | |
| +------------------+ +------------------+ +----------------+ |
| |
| +------------------+ +------------------+ |
| | data | | monitoring | |
| | (3-6 nodes) | | (2-3 nodes) | |
| | StatefulSets | | Prometheus, Loki | |
| | Taints | | Grafana | |
| +------------------+ +------------------+ |
+---------------------------------------------------------------+Node Pool Design
MATIH uses dedicated node pools to isolate different workload types and optimize resource allocation. Each node pool has a specific purpose, instance type, and scaling configuration.
Node Pool Specifications
| Node Pool | Purpose | Min/Max Nodes | Instance Type (AKS) | Instance Type (EKS) | Instance Type (GKE) | Taints |
|---|---|---|---|---|---|---|
| system | Cluster addons, kube-system | 3 / 3 | Standard_D4s_v5 | m6i.xlarge | e2-standard-4 | CriticalAddonsOnly=true:NoSchedule |
| general | Control plane + data plane services | 3 / 10 | Standard_D8s_v5 | m6i.2xlarge | e2-standard-8 | None |
| aicompute | AI/ML inference and training | 2 / 8 | Standard_NC6s_v3 | p3.2xlarge | n1-standard-8-t4 | matih.ai/ai-compute=true:NoSchedule |
| data | StatefulSet workloads (databases) | 3 / 6 | Standard_E8s_v5 | r6i.2xlarge | n2-highmem-8 | matih.ai/data=true:NoSchedule |
| monitoring | Observability stack | 2 / 3 | Standard_D4s_v5 | m6i.xlarge | e2-standard-4 | matih.ai/monitoring=true:NoSchedule |
Node Labels
Each node pool carries labels that enable workload scheduling via nodeSelector and affinity rules:
# System node pool labels
node.kubernetes.io/role: system
agentpool: system
# General node pool labels
node.kubernetes.io/role: general
agentpool: general
# AI compute node pool labels
node.kubernetes.io/role: aicompute
agentpool: aicompute
matih.ai/workload-type: ai-inference
nvidia.com/gpu.present: "true" # On GPU-enabled pools
# Data node pool labels
node.kubernetes.io/role: data
agentpool: data
matih.ai/workload-type: stateful
# Monitoring node pool labels
node.kubernetes.io/role: monitoring
agentpool: monitoring
matih.ai/workload-type: observabilityAzure Kubernetes Service (AKS)
AKS is the primary deployment target for MATIH, used in the azure-matihlabs environment.
AKS Cluster Configuration
The AKS cluster is provisioned via Terraform using the infrastructure/terraform/modules/azure/kubernetes module:
# infrastructure/terraform/modules/azure/kubernetes/main.tf (simplified)
resource "azurerm_kubernetes_cluster" "main" {
name = "matih-${var.environment}-aks"
location = var.location
resource_group_name = var.resource_group_name
dns_prefix = "matih-${var.environment}"
kubernetes_version = var.kubernetes_version
# Managed identity for Azure integrations
identity {
type = "SystemAssigned"
}
# API server access profile
api_server_access_profile {
authorized_ip_ranges = var.authorized_ip_ranges
}
# Azure AD integration for RBAC
azure_active_directory_role_based_access_control {
managed = true
admin_group_object_ids = var.admin_group_ids
azure_rbac_enabled = true
}
# Network profile
network_profile {
network_plugin = "azure"
network_policy = "calico"
load_balancer_sku = "standard"
service_cidr = var.service_cidr
dns_service_ip = var.dns_service_ip
}
# Azure Monitor integration
oms_agent {
log_analytics_workspace_id = var.log_analytics_workspace_id
}
# Maintenance window
maintenance_window {
allowed {
day = "Saturday"
hours = [2, 6]
}
}
tags = var.tags
}AKS Node Pools
# System node pool (default)
resource "azurerm_kubernetes_cluster_node_pool" "system" {
name = "system"
kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
vm_size = "Standard_D4s_v5"
node_count = 3
mode = "System"
os_disk_size_gb = 128
os_disk_type = "Managed"
os_sku = "AzureLinux"
node_labels = {
"agentpool" = "system"
"node.kubernetes.io/role" = "system"
}
node_taints = ["CriticalAddonsOnly=true:NoSchedule"]
upgrade_settings {
max_surge = "33%"
}
}
# General workload node pool
resource "azurerm_kubernetes_cluster_node_pool" "general" {
name = "general"
kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
vm_size = "Standard_D8s_v5"
min_count = 3
max_count = 10
enable_auto_scaling = true
mode = "User"
os_disk_size_gb = 256
os_disk_type = "Managed"
os_sku = "AzureLinux"
node_labels = {
"agentpool" = "general"
"node.kubernetes.io/role" = "general"
}
zones = [1, 2, 3]
upgrade_settings {
max_surge = "33%"
}
}
# AI compute node pool with GPU
resource "azurerm_kubernetes_cluster_node_pool" "aicompute" {
name = "aicompute"
kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
vm_size = "Standard_NC6s_v3"
min_count = 2
max_count = 8
enable_auto_scaling = true
mode = "User"
os_disk_size_gb = 512
os_disk_type = "Managed"
node_labels = {
"agentpool" = "aicompute"
"node.kubernetes.io/role" = "aicompute"
"matih.ai/workload-type" = "ai-inference"
}
node_taints = ["matih.ai/ai-compute=true:NoSchedule"]
zones = [1, 2]
upgrade_settings {
max_surge = "33%"
}
}
# Data node pool for stateful workloads
resource "azurerm_kubernetes_cluster_node_pool" "data" {
name = "data"
kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
vm_size = "Standard_E8s_v5"
min_count = 3
max_count = 6
enable_auto_scaling = true
mode = "User"
os_disk_size_gb = 512
os_disk_type = "Managed"
node_labels = {
"agentpool" = "data"
"node.kubernetes.io/role" = "data"
"matih.ai/workload-type" = "stateful"
}
node_taints = ["matih.ai/data=true:NoSchedule"]
zones = [1, 2, 3]
}AKS Networking
MATIH uses Azure CNI networking for full VNET integration:
+------------------------------------------------------------------+
| Azure Virtual Network (10.0.0.0/8) |
| |
| +----------------------------+ |
| | AKS Subnet (10.1.0.0/16) | |
| | - Pod IPs: 10.1.x.x | |
| | - Node IPs: 10.1.0.x | |
| +----------------------------+ |
| |
| +----------------------------+ |
| | Service CIDR (10.2.0.0/16) | |
| | - ClusterIP Services | |
| | - DNS: 10.2.0.10 | |
| +----------------------------+ |
| |
| +----------------------------+ +----------------------------+ |
| | PostgreSQL Subnet | | Redis Subnet | |
| | (10.3.0.0/24) | | (10.3.1.0/24) | |
| | - Azure DB for PostgreSQL | | - Azure Cache for Redis | |
| +----------------------------+ +----------------------------+ |
+------------------------------------------------------------------+Key networking decisions:
| Setting | Value | Rationale |
|---|---|---|
| Network plugin | Azure CNI | Full VNET integration, no overlay network |
| Network policy | Calico | Rich policy support with namespace selectors |
| Load balancer SKU | Standard | Required for availability zones |
| Service CIDR | 10.2.0.0/16 | Non-overlapping with VNET |
| DNS service IP | 10.2.0.10 | Within service CIDR |
AKS Identity Integration
MATIH uses Azure Workload Identity for pod-to-Azure service authentication:
# Service account with workload identity annotation
apiVersion: v1
kind: ServiceAccount
metadata:
name: ai-service
namespace: matih-data-plane
annotations:
azure.workload.identity/client-id: "<managed-identity-client-id>"
labels:
azure.workload.identity/use: "true"The Terraform module creates managed identities and federated credentials:
resource "azurerm_user_assigned_identity" "ai_service" {
name = "matih-${var.environment}-ai-service"
resource_group_name = var.resource_group_name
location = var.location
}
resource "azurerm_federated_identity_credential" "ai_service" {
name = "ai-service-federated"
resource_group_name = var.resource_group_name
parent_id = azurerm_user_assigned_identity.ai_service.id
audience = ["api://AzureADTokenExchange"]
issuer = azurerm_kubernetes_cluster.main.oidc_issuer_url
subject = "system:serviceaccount:matih-data-plane:ai-service"
}Amazon Elastic Kubernetes Service (EKS)
EKS is supported for AWS deployments through the aws-dev and aws-prod environments.
EKS Cluster Configuration
# infrastructure/terraform/modules/aws/eks/main.tf (simplified)
module "eks" {
source = "terraform-aws-modules/eks/aws"
version = "~> 19.0"
cluster_name = "matih-${var.environment}"
cluster_version = var.kubernetes_version
vpc_id = var.vpc_id
subnet_ids = var.private_subnet_ids
cluster_endpoint_private_access = true
cluster_endpoint_public_access = true
cluster_endpoint_public_access_cidrs = var.authorized_cidrs
# EKS managed node groups
eks_managed_node_groups = {
system = {
name = "system"
instance_types = ["m6i.xlarge"]
min_size = 3
max_size = 3
desired_size = 3
labels = {
"agentpool" = "system"
"node.kubernetes.io/role" = "system"
}
taints = {
critical = {
key = "CriticalAddonsOnly"
value = "true"
effect = "NO_SCHEDULE"
}
}
}
general = {
name = "general"
instance_types = ["m6i.2xlarge"]
min_size = 3
max_size = 10
desired_size = 3
labels = {
"agentpool" = "general"
"node.kubernetes.io/role" = "general"
}
}
aicompute = {
name = "aicompute"
instance_types = ["p3.2xlarge"]
min_size = 2
max_size = 8
desired_size = 2
ami_type = "AL2_x86_64_GPU"
labels = {
"agentpool" = "aicompute"
"node.kubernetes.io/role" = "aicompute"
"matih.ai/workload-type" = "ai-inference"
}
taints = {
ai_compute = {
key = "matih.ai/ai-compute"
value = "true"
effect = "NO_SCHEDULE"
}
}
}
data = {
name = "data"
instance_types = ["r6i.2xlarge"]
min_size = 3
max_size = 6
desired_size = 3
labels = {
"agentpool" = "data"
"node.kubernetes.io/role" = "data"
"matih.ai/workload-type" = "stateful"
}
taints = {
data = {
key = "matih.ai/data"
value = "true"
effect = "NO_SCHEDULE"
}
}
}
}
# IRSA for pod-level IAM
enable_irsa = true
tags = var.tags
}EKS Networking
EKS uses the VPC CNI plugin for native VPC networking:
| Setting | Value | Rationale |
|---|---|---|
| CNI Plugin | amazon-vpc-cni-k8s | Native VPC IP addresses for pods |
| Pod networking | VPC secondary CIDR | Expanded IP space for pods |
| Service networking | In-cluster | Standard kube-proxy/iptables |
| Network policy | Calico (add-on) | Installed as EKS add-on |
EKS Identity: IAM Roles for Service Accounts (IRSA)
# IRSA role for ai-service to access S3, Bedrock, Secrets Manager
module "ai_service_irsa" {
source = "terraform-aws-modules/iam/aws//modules/iam-role-for-service-accounts-eks"
role_name = "matih-${var.environment}-ai-service"
oidc_providers = {
main = {
provider_arn = module.eks.oidc_provider_arn
namespace_service_accounts = ["matih-data-plane:ai-service"]
}
}
role_policy_arns = {
s3 = aws_iam_policy.ai_service_s3.arn
bedrock = aws_iam_policy.ai_service_bedrock.arn
secrets = aws_iam_policy.ai_service_secrets.arn
}
}Google Kubernetes Engine (GKE)
GKE is supported for GCP deployments through the gcp-dev and gcp-prod environments.
GKE Cluster Configuration
# infrastructure/terraform/modules/gcp/gke/main.tf (simplified)
resource "google_container_cluster" "main" {
name = "matih-${var.environment}"
location = var.region
# Use regional cluster for HA
node_locations = var.zones
# Remove default node pool
remove_default_node_pool = true
initial_node_count = 1
# Private cluster
private_cluster_config {
enable_private_nodes = true
enable_private_endpoint = false
master_ipv4_cidr_block = var.master_cidr
}
# Workload Identity
workload_identity_config {
workload_pool = "${var.project_id}.svc.id.goog"
}
# Network policy
network_policy {
enabled = true
provider = "CALICO"
}
# Binary authorization
binary_authorization {
evaluation_mode = "PROJECT_SINGLETON_POLICY_ENFORCE"
}
# Maintenance policy
maintenance_policy {
recurring_window {
start_time = "2025-01-01T02:00:00Z"
end_time = "2025-01-01T06:00:00Z"
recurrence = "FREQ=WEEKLY;BYDAY=SA"
}
}
}GKE Node Pools
resource "google_container_node_pool" "general" {
name = "general"
cluster = google_container_cluster.main.name
location = var.region
autoscaling {
min_node_count = 3
max_node_count = 10
}
node_config {
machine_type = "e2-standard-8"
disk_size_gb = 256
disk_type = "pd-ssd"
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform",
]
labels = {
"agentpool" = "general"
"node.kubernetes.io/role" = "general"
}
workload_metadata_config {
mode = "GKE_METADATA"
}
shielded_instance_config {
enable_secure_boot = true
enable_integrity_monitoring = true
}
}
management {
auto_repair = true
auto_upgrade = true
}
upgrade_settings {
max_surge = 3
max_unavailable = 1
}
}
resource "google_container_node_pool" "aicompute" {
name = "aicompute"
cluster = google_container_cluster.main.name
location = var.region
autoscaling {
min_node_count = 2
max_node_count = 8
}
node_config {
machine_type = "n1-standard-8"
disk_size_gb = 512
disk_type = "pd-ssd"
guest_accelerator {
type = "nvidia-tesla-t4"
count = 1
gpu_driver_installation_config {
gpu_driver_version = "LATEST"
}
}
labels = {
"agentpool" = "aicompute"
"node.kubernetes.io/role" = "aicompute"
"matih.ai/workload-type" = "ai-inference"
}
taint {
key = "matih.ai/ai-compute"
value = "true"
effect = "NO_SCHEDULE"
}
workload_metadata_config {
mode = "GKE_METADATA"
}
}
}GKE Identity: Workload Identity
resource "google_service_account" "ai_service" {
account_id = "matih-ai-service"
display_name = "MATIH AI Service"
project = var.project_id
}
resource "google_service_account_iam_binding" "ai_service_workload_identity" {
service_account_id = google_service_account.ai_service.name
role = "roles/iam.workloadIdentityUser"
members = [
"serviceAccount:${var.project_id}.svc.id.goog[matih-data-plane/ai-service]",
]
}
resource "google_project_iam_member" "ai_service_vertex" {
project = var.project_id
role = "roles/aiplatform.user"
member = "serviceAccount:${google_service_account.ai_service.email}"
}Storage Classes
MATIH provisions dedicated storage classes for different workload types:
AKS Storage Classes
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: matih-premium-ssd
provisioner: disk.csi.azure.com
parameters:
skuName: Premium_LRS
kind: managed
reclaimPolicy: Retain
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: matih-standard-ssd
provisioner: disk.csi.azure.com
parameters:
skuName: StandardSSD_LRS
kind: managed
reclaimPolicy: Delete
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: matih-files
provisioner: file.csi.azure.com
parameters:
skuName: Premium_LRS
reclaimPolicy: Retain
allowVolumeExpansion: true
mountOptions:
- dir_mode=0755
- file_mode=0644
- uid=1000
- gid=1000Storage Class Usage by Workload
| Workload | Storage Class | Size | Rationale |
|---|---|---|---|
| PostgreSQL | matih-premium-ssd | 100Gi | High IOPS for database operations |
| Kafka (Strimzi) | matih-premium-ssd | 200Gi | High throughput for event streaming |
| Prometheus | matih-standard-ssd | 50Gi | Cost-effective metrics storage |
| Loki | matih-standard-ssd | 50Gi | Log aggregation, less IOPS sensitive |
| Grafana | matih-standard-ssd | 10Gi | Dashboard storage |
| Neo4j | matih-premium-ssd | 50Gi | Graph database operations |
| Qdrant | matih-premium-ssd | 50Gi | Vector search performance |
| Elasticsearch | matih-premium-ssd | 100Gi | Full-text search indexing |
Container Registry
MATIH uses Azure Container Registry (ACR) as the primary image registry:
Registry: matihlabsacr.azurecr.ioImage Naming Convention
matihlabsacr.azurecr.io/matih/<service-name>:<tag>Examples:
matihlabsacr.azurecr.io/matih/ai-service:1.0.0-abc1234matihlabsacr.azurecr.io/matih/iam-service:1.0.0-abc1234matihlabsacr.azurecr.io/matih/bi-workbench:1.0.0-abc1234
Base Images
| Base Image | Tag | Purpose |
|---|---|---|
matih/base-java | 1.0.0 | Java Spring Boot services (control plane) |
matih/base-python-ml | 1.0.0 | Python AI/ML services (data plane) |
matih/base-node | 1.0.0 | Node.js services (render service) |
matih/base-nginx | 1.25-alpine | Frontend static serving |
Image Pull Secrets
Each namespace has an acr-secret for authenticating to ACR:
imagePullSecrets:
- name: acr-secretThe secret is created during cluster provisioning and referenced by every service chart.
Cluster Add-ons
The following add-ons are deployed to every MATIH cluster:
| Add-on | Version | Namespace | Purpose |
|---|---|---|---|
| Calico | 3.26+ | kube-system | Network policy enforcement |
| cert-manager | 1.13+ | cert-manager | TLS certificate management |
| External Secrets Operator | 0.9+ | external-secrets | Secret synchronization from vault |
| NGINX Ingress Controller | 1.9+ | matih-ingress | HTTP(S) load balancing |
| External DNS | 0.14+ | matih-system | DNS record management |
| Strimzi Kafka Operator | 0.38+ | matih-system | Kafka cluster management |
| Prometheus Operator | 0.70+ | matih-observability | Monitoring CRD management |
| metrics-server | 0.6+ | kube-system | Resource metrics for HPA/VPA |
| KEDA | 2.12+ | kube-system | Event-driven autoscaling |
Cluster Upgrade Strategy
MATIH follows a controlled upgrade strategy for Kubernetes version management:
Upgrade Process
- Test in development: Upgrade the
devcluster first and run the full test suite - Canary node pool: Add a new node pool with the target version alongside the existing pool
- Workload migration: Gradually cordon and drain old nodes, allowing pods to reschedule on new nodes
- Validation: Run health checks (
scripts/disaster-recovery/health-check.sh) after each step - Control plane upgrade: Upgrade the managed control plane after worker nodes are validated
- Cleanup: Remove the old node pool after successful validation
Maintenance Windows
| Environment | Window | Frequency |
|---|---|---|
| Development | Any time | On-demand |
| Staging | Saturday 02:00-06:00 UTC | Weekly |
| Production | Saturday 02:00-06:00 UTC | Monthly |
Deep Dive: Kubernetes version skew policy allows the API server to be one minor version ahead of kubelets. MATIH uses this to perform rolling upgrades: control plane first, then node pools one at a time. The
max_surgesetting on each node pool controls how many extra nodes can be created during an upgrade, with MATIH using 33% surge capacity for production stability.
Cloud Provider Comparison
| Feature | AKS | EKS | GKE |
|---|---|---|---|
| Network plugin | Azure CNI | VPC CNI | Calico |
| Network policy | Calico | Calico (add-on) | Calico (built-in) |
| Identity | Workload Identity | IRSA | Workload Identity |
| GPU support | NC-series VMs | P3/P4 instances | T4/A100 accelerators |
| Storage | Azure Disks/Files | EBS/EFS | Persistent Disk/Filestore |
| Registry | ACR | ECR | Artifact Registry |
| Secret manager | Key Vault | Secrets Manager | Secret Manager |
| DNS | Azure DNS | Route 53 | Cloud DNS |
| Load balancer | Azure LB Standard | NLB/ALB | Cloud Load Balancing |
| Maintenance windows | Built-in | Managed node groups | Built-in |
| Max pods per node | 250 (Azure CNI) | 110 (default) | 110 (default) |
Troubleshooting
Common Cluster Issues
| Issue | Symptom | Resolution |
|---|---|---|
| Node NotReady | Pods stuck in Pending | Check node pool scaling limits; verify instance quota with cloud provider |
| ImagePullBackOff | Pod cannot start | Verify acr-secret exists in namespace; check image name and tag |
| DNS resolution failure | Service-to-service calls fail | Check CoreDNS pods in kube-system; verify service name and namespace |
| PVC Pending | StatefulSet pods stuck | Verify storage class exists; check cloud storage quota |
| GPU scheduling failure | AI pods in Pending | Verify GPU node pool has available capacity; check NVIDIA device plugin |
| Node pool at capacity | Cluster autoscaler not scaling | Check max_count on node pool; verify cloud provider instance quota |
Diagnostic Commands
All diagnostic operations must be performed through the approved scripts:
# Check platform status
./scripts/tools/platform-status.sh
# Run comprehensive health check
./scripts/disaster-recovery/health-check.sh
# Check AKS-specific health
./scripts/tools/aks-health-check.shNext Steps
With the cluster architecture established, the next section covers the namespace topology that organizes workloads within the cluster:
- Next: Namespace Topology
- Previous: Chapter Overview