Autoscaling Overview

The MATIH platform implements a multi-layered autoscaling strategy that automatically adjusts compute resources in response to workload demands. This includes Horizontal Pod Autoscaling (HPA) for pod replicas, Vertical Pod Autoscaling (VPA) for resource requests, custom metrics-based scaling for application-specific signals, and Cluster Autoscaler for node-level capacity management.

Autoscaling Layers

Layer	Component	Scope	Response Time
Pod horizontal	HPA	Pod replicas	15-60 seconds
Pod vertical	VPA	CPU/memory requests	Minutes (on restart)
Application	Custom metrics	Application-specific	15-60 seconds
Cluster	Cluster Autoscaler	Node count	2-5 minutes

Autoscaling Architecture

Prometheus (metrics) --> Metrics Server / Prometheus Adapter
                                |
                    +-----------+-----------+
                    |           |           |
                   HPA         VPA    Custom Metrics
                    |           |           |
               Pod Replicas  Pod Resources  Pod Replicas
                    |           |           |
                    +-----------+-----------+
                                |
                        Cluster Autoscaler
                                |
                           Node Count

Services with Autoscaling

Service	HPA	VPA	Custom Metrics	Min Replicas	Max Replicas
AI Service	Yes	No	LLM queue depth	2	10
Query Engine	Yes	No	Query queue depth	2	8
API Gateway	Yes	No	Request rate	2	10
ML Service	Yes	No	Training job count	1	6
BI Service	Yes	No	Active sessions	1	4
Catalog Service	Yes	No	None	1	3
Render Service	Yes	No	Render queue	1	4

Scaling Behavior

All HPAs use controlled scaling behavior to prevent flapping:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
      - type: Percent
        value: 10
        periodSeconds: 60
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
      - type: Percent
        value: 100
        periodSeconds: 15
      - type: Pods
        value: 4
        periodSeconds: 15
    selectPolicy: Max

Setting	Scale Up	Scale Down
Stabilization window	0 seconds (immediate)	300 seconds (5 minutes)
Max change rate	100% per 15s or 4 pods per 15s	10% per 60s
Policy selection	Maximum (aggressive scale up)	Default

Monitoring

Metric	Description
`kube_hpa_status_current_replicas`	Current replica count per HPA
`kube_hpa_status_desired_replicas`	Target replica count per HPA
`kube_hpa_spec_min_replicas`	Minimum replicas configured
`kube_hpa_spec_max_replicas`	Maximum replicas configured

Detailed Sections

Section	Content
Horizontal Pod Autoscaler	CPU/memory-based pod scaling
Vertical Pod Autoscaler	Automatic resource right-sizing
Custom Metrics	Application-specific scaling signals
Cluster Autoscaler	Node-level capacity management

Autoscaling Horizontal Pod Autoscaler