Stage 14: ML Infrastructure
Stage 14 deploys the machine learning infrastructure stack: Ray (operator and cluster), MLflow for experiment tracking, Feast for feature store, and JupyterHub for notebook environments.
Source file: scripts/stages/14-ml-infrastructure.sh
Components Deployed
| Component | Chart | Purpose |
|---|---|---|
| KubeRay Operator | kuberay/kuberay-operator (bundled in matih-ray) | Manages RayCluster CRDs |
| RayCluster | matih-ray subchart | Distributed ML training and serving |
| MLflow | matih-mlflow | Experiment tracking, model registry |
| Feast | Custom chart | Feature store for online/offline serving |
| JupyterHub | jupyterhub/jupyterhub | Notebook environments for data scientists |
Ray Deployment
The matih-ray chart bundles the KubeRay operator as a subchart dependency. Legacy standalone operator releases are cleaned up automatically:
# Remove legacy standalone kuberay-operator if exists
if helm status kuberay-operator -n matih-data-plane; then
helm uninstall kuberay-operator -n matih-data-plane --wait
fi
# Deploy bundled chart
helm upgrade --install matih-ray \
infrastructure/helm/ray \
--namespace matih-data-plane \
--values infrastructure/helm/ray/values-dev.yamlMLflow Configuration
MLflow stores experiment metadata in PostgreSQL and artifacts in MinIO (dev) or cloud object storage (production):
| Setting | Dev | Production |
|---|---|---|
| Backend store | PostgreSQL via K8s Secret | PostgreSQL via K8s Secret |
| Artifact store | MinIO (s3-compatible) | Azure Blob / S3 |
| Credentials | secretKeyRef from dev secrets | ESO from Key Vault |
Libraries Used
| Library | Purpose |
|---|---|
core/config.sh | Terraform output access |
k8s/namespace.sh | Namespace management |
helm/repo.sh | Repository management |
helm/deploy.sh | Deployment functions |
k8s/dev-secrets.sh | Dev secrets |
Dependencies
- Requires:
05b-data-plane-infrastructure,11-compute-engines - Required by:
15-ai-infrastructure
Dependency Verification
kubectl get pods -n matih-data-plane -l app.kubernetes.io/name=kuberay-operator
kubectl get raycluster -n matih-data-plane
kubectl get pods -n matih-data-plane -l app=mlflow