Azure Kubernetes Service (AKS)
Azure AKS is the primary deployment target for MATIH. The cluster is provisioned via Terraform with Azure CNI networking, Workload Identity for pod-level Azure access, and integration with Azure Key Vault for secret management.
Cluster Configuration
The AKS cluster is provisioned through the Terraform module at infrastructure/terraform/modules/azure/aks/:
resource "azurerm_kubernetes_cluster" "aks" {
name = "matih-${var.environment}"
location = var.location
resource_group_name = var.resource_group_name
dns_prefix = "matih-${var.environment}"
kubernetes_version = "1.29"
default_node_pool {
name = "system"
vm_size = "Standard_D4s_v3"
node_count = 3
vnet_subnet_id = var.subnet_id
os_disk_size_gb = 128
os_disk_type = "Managed"
max_pods = 110
zones = [1, 2, 3]
}
identity {
type = "SystemAssigned"
}
network_profile {
network_plugin = "azure"
network_policy = "calico"
service_cidr = "10.1.0.0/16"
dns_service_ip = "10.1.0.10"
}
oidc_issuer_enabled = true
workload_identity_enabled = true
}Node Pools
AKS uses dedicated node pools with autoscaling for each workload class:
| Node Pool | VM Size | Min/Max | Purpose | Taint |
|---|---|---|---|---|
| system | Standard_D4s_v3 | 3/3 | System components | None |
| ctrlplane | Standard_D4s_v3 | 2/5 | Control plane services | matih.ai/control-plane=true:NoSchedule |
| dataplane | Standard_D8s_v3 | 2/10 | Data plane services | matih.ai/data-plane=true:NoSchedule |
| compute | Standard_E16s_v3 | 2/10 | Trino, Spark workers | matih.ai/compute=true:NoSchedule |
| aicompute | Standard_D8s_v3 | 1/8 | AI/ML workloads | matih.ai/ai-compute=true:NoSchedule |
| gpu | Standard_NC6s_v3 | 0/4 | GPU inference | nvidia.com/gpu=true:NoSchedule |
| playground | Standard_D2s_v3 | 1/3 | Playground/free tier | matih.io/playground=true:NoSchedule |
Workload Identity
AKS Workload Identity allows pods to authenticate to Azure services without embedded credentials:
# ServiceAccount with Workload Identity annotation
apiVersion: v1
kind: ServiceAccount
metadata:
name: external-secrets
namespace: external-secrets
annotations:
azure.workload.identity/client-id: "${AKS_IDENTITY_CLIENT_ID}"The following services use Workload Identity:
| Service | Azure Resource | Purpose |
|---|---|---|
| external-secrets | Azure Key Vault | Secret synchronization |
| cert-manager | Azure DNS | DNS01 ACME challenges |
| infrastructure-service | ARM API | Tenant infrastructure provisioning |
| ai-service | Azure OpenAI | LLM inference API calls |
Azure Container Registry
Images are stored in Azure Container Registry (ACR) with AKS kubelet identity pull permissions:
# Global image configuration
global:
imageRegistry: matihlabsacr.azurecr.io/matih
imagePullSecrets:
- name: acr-secret
- name: platform-acr-secretFor multi-tenant deployments, each tenant can have its own ACR with images synced from the platform ACR by the CD pipeline stage 04a.
Network Configuration
AKS uses Azure CNI for pod networking, providing each pod with a routable IP from the VNet:
| Setting | Value |
|---|---|
| Network plugin | Azure CNI |
| Network policy | Calico |
| Service CIDR | 10.1.0.0/16 |
| DNS service IP | 10.1.0.10 |
| Pod CIDR | Allocated from VNet subnet |
| Max pods per node | 110 |
Monitoring Integration
AKS integrates with Azure Monitor and the platform observability stack:
- Container Insights: Azure-native monitoring (optional, can be disabled if Prometheus is preferred)
- Prometheus: Platform-deployed Prometheus scrapes all service metrics via ServiceMonitor CRDs
- Log Analytics: Optional forwarding to Azure Log Analytics workspace
- Grafana: Platform Grafana with Azure Monitor data source for infrastructure metrics