Ray
Ray provides distributed computing for ML training, hyperparameter tuning, and model serving through Ray Serve.
Architecture
+------------------+ +------------------+
| Ray Head Node | | Ray Workers |
| Port: 6379, 8265 |---->| (Autoscaled) |
+------------------+ +------------------+Configuration
config:
ray:
headAddress: "ray-head:6379"
dashboardUrl: "http://ray-head:8265"Components
| Component | Port | Purpose |
|---|---|---|
| Head Node | 6379 | Cluster coordination (GCS) |
| Dashboard | 8265 | Web UI for job monitoring |
| Client | 10001 | Ray client connection |
| Serve | 8000 | Model serving endpoint |
Autoscaling
Ray workers autoscale based on pending task queue depth. The Ray autoscaler communicates with the Kubernetes cluster autoscaler to provision new GPU nodes when needed.
GPU Support
Ray workers can request GPU resources for training:
worker:
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
nvidia.com/gpu.present: "true"
tolerations:
- key: "nvidia.com/gpu"
operator: "Exists"
effect: "NoSchedule"