Cluster Autoscaler
The Cluster Autoscaler automatically adjusts the number of nodes in the Kubernetes cluster based on pod scheduling demands. When pods cannot be scheduled due to insufficient resources, the Cluster Autoscaler adds nodes. When nodes are underutilized, it removes them to reduce costs.
Cluster Autoscaler Architecture
Unschedulable Pods --> Cluster Autoscaler --> Cloud Provider API --> Add Nodes
|
Underutilized Nodes --> Cluster Autoscaler --> Cloud Provider API --> Remove NodesScaling Triggers
Scale Up
The Cluster Autoscaler adds nodes when:
| Condition | Description |
|---|---|
| Unschedulable pods | Pods in Pending state due to insufficient CPU/memory |
| HPA ceiling | HPA wants more replicas but no node capacity |
| PVC pending | Persistent volumes cannot be provisioned in current zone |
Scale Down
The Cluster Autoscaler removes nodes when:
| Condition | Description |
|---|---|
| Low utilization | Node resource utilization below threshold for 10+ minutes |
| Pods movable | All pods on the node can be rescheduled elsewhere |
| No constraints | No PDBs, local storage, or system pods preventing eviction |
Node Pool Configuration
The MATIH platform uses multiple node pools for workload isolation:
| Node Pool | Instance Type | Min | Max | Autoscale | Purpose |
|---|---|---|---|---|---|
| system | Standard_D4s_v3 | 2 | 4 | Yes | Control plane services |
| dataplane | Standard_D8s_v3 | 2 | 10 | Yes | Data plane services |
| ml-compute | Standard_D16s_v3 | 0 | 6 | Yes | ML training and inference |
| gpu | Standard_NC6s_v3 | 0 | 4 | Yes | GPU workloads (LLM, Triton) |
| monitoring | Standard_D4s_v3 | 1 | 3 | Yes | Prometheus, Grafana, Loki |
Configuration Parameters
| Parameter | Value | Description |
|---|---|---|
scan-interval | 10s | How often the autoscaler checks for unschedulable pods |
scale-down-delay-after-add | 10m | Cooldown after adding a node |
scale-down-delay-after-delete | 0s | Cooldown after removing a node |
scale-down-unneeded-time | 10m | Time node must be underutilized before removal |
scale-down-utilization-threshold | 0.5 | Node utilization below which scale-down is considered |
max-graceful-termination-sec | 600 | Max time for pod graceful termination during scale-down |
skip-nodes-with-system-pods | true | Protect nodes running kube-system pods |
skip-nodes-with-local-storage | true | Protect nodes with local PVs |
Cloud Provider Integration
| Provider | Managed Offering | API |
|---|---|---|
| Azure | AKS Cluster Autoscaler | Azure VMSS |
| AWS | EKS Cluster Autoscaler | AWS ASG |
| GCP | GKE Cluster Autoscaler | GCE MIG |
Azure AKS Configuration
For the MATIH Azure deployment, Cluster Autoscaler is managed natively by AKS:
# Node pool autoscaling is configured via Terraform
resource "azurerm_kubernetes_cluster_node_pool" "dataplane" {
name = "dataplane"
kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
vm_size = "Standard_D8s_v3"
enable_auto_scaling = true
min_count = 2
max_count = 10
node_labels = {
"matih.io/node-pool" = "dataplane"
}
}Pod Disruption Budgets
Critical services have PDBs to prevent the autoscaler from removing nodes hosting essential pods:
| Service | MinAvailable | MaxUnavailable |
|---|---|---|
| AI Service | 1 | N/A |
| Query Engine | 1 | N/A |
| API Gateway | 1 | N/A |
| PostgreSQL | N/A | 1 |
| Redis | N/A | 1 |
Monitoring
| Metric | Description |
|---|---|
cluster_autoscaler_nodes_count | Current node count by pool |
cluster_autoscaler_scaled_up_nodes_total | Nodes added by autoscaler |
cluster_autoscaler_scaled_down_nodes_total | Nodes removed by autoscaler |
cluster_autoscaler_unschedulable_pods_count | Pending unschedulable pods |
Troubleshooting
| Issue | Symptom | Resolution |
|---|---|---|
| Pods stuck Pending | Nodes not added | Check node pool max limit and quotas |
| Slow scale-up | 5+ minutes to add capacity | Check cloud API response time |
| Nodes not removed | Underutilized nodes remain | Check PDBs and local storage constraints |
| Budget exceeded | Too many nodes running | Lower max node count or utilization threshold |