ML Flow
The ML flow traces the lifecycle of a machine learning model from experiment definition through training, evaluation, registration, deployment, and monitoring. This flow involves the ML Service, Ray for distributed compute, MLflow for experiment tracking, and Feast for feature engineering.
ML Lifecycle Path
ML Workbench / API
|
v
ML Service (Port 8000)
| 1. Validate experiment configuration
| 2. Apply tenant context
| 3. Prepare training job
|
v
Feature Engineering
| 4. Feast: retrieve training features
| 5. Point-in-time join for historical features
|
v
Distributed Training
| 6. Ray Train: distribute across GPU/CPU nodes
| 7. Log metrics to MLflow per epoch
| 8. Save checkpoints to MinIO
|
v
Evaluation
| 9. Run evaluation on holdout set
| 10. Compare against baseline model
|
v
Registration
| 11. Register model in MLflow Model Registry
| 12. Stage: Staging --> Production
|
v
Deployment
| 13. Deploy to Ray Serve / Triton
| 14. Configure traffic routing
|
v
Monitoring
| 15. Track prediction drift
| 16. Monitor latency and throughput
| 17. Alert on quality degradationExperiment Tracking
MLflow tracks every training run:
| Tracked Item | Example |
|---|---|
| Parameters | learning_rate: 0.001, batch_size: 32, epochs: 50 |
| Metrics | accuracy: 0.94, f1_score: 0.91, loss: 0.23 |
| Artifacts | Model files, training logs, evaluation reports |
| Tags | experiment_name, model_type, data_version |
Training Flow
| Step | Component | Description |
|---|---|---|
| Feature retrieval | Feast | Point-in-time correct features from offline store |
| Data preprocessing | Ray Data | Distributed data transformation |
| Model training | Ray Train | Distributed training across cluster |
| Hyperparameter tuning | Ray Tune | Bayesian optimization with early stopping |
| Metric logging | MLflow | Real-time metric tracking per training step |
| Checkpoint saving | MinIO | Model checkpoints for fault tolerance |
Model Registry
MLflow Model Registry manages model versions and stages:
| Stage | Description | Access |
|---|---|---|
None | Newly registered model | Internal use only |
Staging | Undergoing testing and validation | Pre-production testing |
Production | Serving live predictions | Production traffic |
Archived | Retired from service | Historical reference |
Deployment Options
| Option | Technology | Use Case | Latency |
|---|---|---|---|
| Real-time serving | Ray Serve | Low-latency predictions | Less than 50ms |
| Batch inference | Ray / Spark | Large-scale batch predictions | Minutes |
| LLM serving | vLLM | Large language model inference | Less than 100ms |
| GPU-optimized | Triton | High-throughput GPU inference | Less than 20ms |
Model Monitoring
Post-deployment monitoring tracks model health:
| Metric | Detection Method | Alert Threshold |
|---|---|---|
| Prediction drift | Statistical divergence test | Distribution shift exceeding 2 standard deviations |
| Feature drift | Input feature distribution monitoring | Significant feature value change |
| Latency degradation | p95 inference latency tracking | Exceeding SLA target |
| Error rate | Prediction failure tracking | Error rate exceeding 1% |
Event Publishing
ML lifecycle events are published to Kafka:
| Event | Published When | Consumers |
|---|---|---|
MODEL_TRAINING_STARTED | Training job begins | Audit, monitoring |
MODEL_TRAINING_COMPLETED | Training finishes | Audit, notification |
MODEL_REGISTERED | Model added to registry | Catalog, audit |
MODEL_DEPLOYED | Model serving begins | Audit, notification |
MODEL_DRIFT_DETECTED | Drift alert triggered | Notification, alerting |
Related Pages
- Pipeline Flow -- Data pipeline orchestration
- ML Engineer Persona -- ML Engineer workflow
- ML Infrastructure -- ML technology stack