ML Engineer
The ML Engineer trains, evaluates, deploys, and monitors machine learning models within the MATIH Platform. ML Engineers work primarily in the ML Workbench (port 3001) and interact with experiment tracking, model registry, distributed training, and feature store services.
Role Summary
| Attribute | Details |
|---|---|
| Primary workbench | ML Workbench (3001) |
| Key services | ML Service, Ray, MLflow, Feast, vLLM |
| Common tasks | Train models, run experiments, deploy to production, monitor drift |
| Technical depth | High -- Python, ML frameworks, distributed computing |
Day-in-the-Life Workflow
| Time | Activity | Platform Feature |
|---|---|---|
| 9:00 AM | Review model performance metrics | MLflow dashboard, model monitoring |
| 9:30 AM | Check feature store freshness | Feast feature status, data quality |
| 10:00 AM | Launch distributed training job | Ray Train, GPU cluster allocation |
| 11:30 AM | Compare experiment results | MLflow experiment comparison |
| 1:00 PM | Register best model to registry | MLflow model registry, versioning |
| 2:00 PM | Deploy model to serving endpoint | Ray Serve or Triton deployment |
| 3:00 PM | Configure A/B test for new model | Model routing, traffic splitting |
| 4:00 PM | Monitor prediction drift | Drift detection alerts, retraining triggers |
Key Capabilities
Experiment Tracking
MLflow provides comprehensive experiment tracking:
| Feature | Description |
|---|---|
| Experiment logging | Track parameters, metrics, and artifacts per run |
| Run comparison | Side-by-side comparison of model runs |
| Artifact storage | Model files stored in MinIO (S3-compatible) |
| Versioning | Semantic versioning of registered models |
| Lineage | Track data inputs to model outputs |
Distributed Training
Ray provides distributed compute for ML workloads:
| Feature | Description |
|---|---|
| Ray Train | Distributed training across GPU nodes |
| Ray Tune | Hyperparameter optimization with early stopping |
| Ray Data | Distributed data preprocessing pipelines |
| Auto-scaling | Dynamic cluster scaling based on workload |
Model Serving
Multiple serving options for different model types:
| Serving Option | Use Case | Technology |
|---|---|---|
| Real-time inference | Low-latency predictions | Ray Serve |
| Batch inference | Large-scale batch predictions | Ray or Spark |
| LLM inference | Large language model serving | vLLM |
| GPU-optimized | High-throughput GPU inference | Triton Inference Server |
Feature Store
Feast manages feature engineering and serving:
| Feature | Description |
|---|---|
| Feature definitions | Declarative feature specifications |
| Online store | Low-latency feature serving for inference |
| Offline store | Historical features for training |
| Point-in-time joins | Correct feature values at training time |
Backend Services
| Service | Port | Interaction |
|---|---|---|
ml-service | 8000 | Model management, experiment orchestration |
| MLflow | -- | Experiment tracking, model registry |
| Ray | -- | Distributed training and serving |
| vLLM | -- | LLM inference serving |
| Feast | -- | Feature store management |
ai-service | 8000 | Natural language model diagnostics |
Related Chapters
- ML Platform Capabilities -- Full capability description
- ML Flow -- Model training and serving lifecycle
- Technology Stack: ML Infrastructure -- ML/AI technologies
- Compute Engines -- Ray, Spark, Flink