MATIH Platform is in active MVP development. Documentation reflects current implementation status.
17. Kubernetes & Helm
ML Infrastructure
Ray

Ray

Ray provides distributed computing for ML training, hyperparameter tuning, and model serving through Ray Serve.


Architecture

+------------------+     +------------------+
| Ray Head Node    |     | Ray Workers      |
| Port: 6379, 8265 |---->| (Autoscaled)     |
+------------------+     +------------------+

Configuration

config:
  ray:
    headAddress: "ray-head:6379"
    dashboardUrl: "http://ray-head:8265"

Components

ComponentPortPurpose
Head Node6379Cluster coordination (GCS)
Dashboard8265Web UI for job monitoring
Client10001Ray client connection
Serve8000Model serving endpoint

Autoscaling

Ray workers autoscale based on pending task queue depth. The Ray autoscaler communicates with the Kubernetes cluster autoscaler to provision new GPU nodes when needed.


GPU Support

Ray workers can request GPU resources for training:

worker:
  resources:
    limits:
      nvidia.com/gpu: 1
  nodeSelector:
    nvidia.com/gpu.present: "true"
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"