ML Flow

The ML flow traces the lifecycle of a machine learning model from experiment definition through training, evaluation, registration, deployment, and monitoring. This flow involves the ML Service, Ray for distributed compute, MLflow for experiment tracking, and Feast for feature engineering.

ML Lifecycle Path

ML Workbench / API
  |
  v
ML Service (Port 8000)
  | 1. Validate experiment configuration
  | 2. Apply tenant context
  | 3. Prepare training job
  |
  v
Feature Engineering
  | 4. Feast: retrieve training features
  | 5. Point-in-time join for historical features
  |
  v
Distributed Training
  | 6. Ray Train: distribute across GPU/CPU nodes
  | 7. Log metrics to MLflow per epoch
  | 8. Save checkpoints to MinIO
  |
  v
Evaluation
  | 9. Run evaluation on holdout set
  | 10. Compare against baseline model
  |
  v
Registration
  | 11. Register model in MLflow Model Registry
  | 12. Stage: Staging --> Production
  |
  v
Deployment
  | 13. Deploy to Ray Serve / Triton
  | 14. Configure traffic routing
  |
  v
Monitoring
  | 15. Track prediction drift
  | 16. Monitor latency and throughput
  | 17. Alert on quality degradation

Experiment Tracking

MLflow tracks every training run:

Tracked Item	Example
Parameters	`learning_rate: 0.001`, `batch_size: 32`, `epochs: 50`
Metrics	`accuracy: 0.94`, `f1_score: 0.91`, `loss: 0.23`
Artifacts	Model files, training logs, evaluation reports
Tags	`experiment_name`, `model_type`, `data_version`

Training Flow

Step	Component	Description
Feature retrieval	Feast	Point-in-time correct features from offline store
Data preprocessing	Ray Data	Distributed data transformation
Model training	Ray Train	Distributed training across cluster
Hyperparameter tuning	Ray Tune	Bayesian optimization with early stopping
Metric logging	MLflow	Real-time metric tracking per training step
Checkpoint saving	MinIO	Model checkpoints for fault tolerance

Model Registry

MLflow Model Registry manages model versions and stages:

Stage	Description	Access
`None`	Newly registered model	Internal use only
`Staging`	Undergoing testing and validation	Pre-production testing
`Production`	Serving live predictions	Production traffic
`Archived`	Retired from service	Historical reference

Deployment Options

Option	Technology	Use Case	Latency
Real-time serving	Ray Serve	Low-latency predictions	Less than 50ms
Batch inference	Ray / Spark	Large-scale batch predictions	Minutes
LLM serving	vLLM	Large language model inference	Less than 100ms
GPU-optimized	Triton	High-throughput GPU inference	Less than 20ms

Model Monitoring

Post-deployment monitoring tracks model health:

Metric	Detection Method	Alert Threshold
Prediction drift	Statistical divergence test	Distribution shift exceeding 2 standard deviations
Feature drift	Input feature distribution monitoring	Significant feature value change
Latency degradation	p95 inference latency tracking	Exceeding SLA target
Error rate	Prediction failure tracking	Error rate exceeding 1%

Event Publishing

ML lifecycle events are published to Kafka:

Event	Published When	Consumers
`MODEL_TRAINING_STARTED`	Training job begins	Audit, monitoring
`MODEL_TRAINING_COMPLETED`	Training finishes	Audit, notification
`MODEL_REGISTERED`	Model added to registry	Catalog, audit
`MODEL_DEPLOYED`	Model serving begins	Audit, notification
`MODEL_DRIFT_DETECTED`	Drift alert triggered	Notification, alerting

Pipeline Flow -- Data pipeline orchestration
ML Engineer Persona -- ML Engineer workflow
ML Infrastructure -- ML technology stack

Pipeline Flow Overview