Horizontal Pod Autoscaler
The Horizontal Pod Autoscaler (HPA) automatically scales the number of pod replicas for MATIH services based on CPU utilization, memory utilization, or custom metrics. HPAs are defined in the Helm chart templates and configured through values files for each environment.
HPA Configuration
The AI Service HPA template in infrastructure/helm/ai-service/templates/hpa.yaml uses the autoscaling/v2 API:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ai-service
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ai-service
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80Service HPA Configurations
| Service | Min | Max | CPU Target | Memory Target | Custom Metric |
|---|---|---|---|---|---|
| AI Service | 2 | 10 | 70% | 80% | LLM queue depth |
| Query Engine | 2 | 8 | 70% | N/A | Query backlog |
| API Gateway | 2 | 10 | 60% | N/A | Request rate |
| ML Service | 1 | 6 | 70% | 80% | N/A |
| BI Service | 1 | 4 | 70% | N/A | N/A |
| Catalog Service | 1 | 3 | 70% | N/A | N/A |
Scaling Behavior
The MATIH platform uses asymmetric scaling behavior for all HPAs:
Scale Up (Aggressive)
| Parameter | Value | Rationale |
|---|---|---|
| Stabilization window | 0 seconds | React immediately to load spikes |
| Max pods per 15s | 4 | Add capacity quickly |
| Max percentage per 15s | 100% | Double capacity if needed |
| Policy selection | Max | Use whichever policy adds more pods |
Scale Down (Conservative)
| Parameter | Value | Rationale |
|---|---|---|
| Stabilization window | 300 seconds | Wait 5 minutes before scaling down |
| Max percentage per 60s | 10% | Scale down gradually |
Helm Values Configuration
HPA settings are configured in the Helm values files:
# values.yaml (base defaults)
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
targetMemoryUtilizationPercentage: 80
# values-dev.yaml (development overrides)
autoscaling:
enabled: false # Disabled in dev to save resources
minReplicas: 1
maxReplicas: 3Prerequisites
HPA requires the Metrics Server to be deployed in the cluster:
Kubernetes Metrics Server --> kubelet cAdvisor --> Pod CPU/Memory metrics
|
v
HPA Controller --> Scale Deployment replicasTroubleshooting
| Issue | Symptom | Resolution |
|---|---|---|
| HPA not scaling | TARGETS: unknown/70% | Verify Metrics Server is running |
| Scaling too slow | Pods not added during load spike | Reduce stabilization window |
| Scaling oscillation | Replicas rapidly increase and decrease | Increase stabilization window |
| At max replicas | HPA at maxReplicas during peak | Increase maxReplicas or optimize pod resources |
Monitoring
| Metric | Description |
|---|---|
kube_hpa_status_current_replicas | Current replica count |
kube_hpa_status_desired_replicas | Desired replica count |
kube_hpa_status_condition | HPA condition (ScalingActive, AbleToScale) |
Alert rules should be configured for:
- HPA at maximum replicas for more than 30 minutes
- HPA unable to scale (metrics unavailable)
- Rapid scaling events (more than 5 scale changes in 10 minutes)