ML Engineer

The ML Engineer trains, evaluates, deploys, and monitors machine learning models within the MATIH Platform. ML Engineers work primarily in the ML Workbench (port 3001) and interact with experiment tracking, model registry, distributed training, and feature store services.

Role Summary

Attribute	Details
Primary workbench	ML Workbench (3001)
Key services	ML Service, Ray, MLflow, Feast, vLLM
Common tasks	Train models, run experiments, deploy to production, monitor drift
Technical depth	High -- Python, ML frameworks, distributed computing

Day-in-the-Life Workflow

Time	Activity	Platform Feature
9:00 AM	Review model performance metrics	MLflow dashboard, model monitoring
9:30 AM	Check feature store freshness	Feast feature status, data quality
10:00 AM	Launch distributed training job	Ray Train, GPU cluster allocation
11:30 AM	Compare experiment results	MLflow experiment comparison
1:00 PM	Register best model to registry	MLflow model registry, versioning
2:00 PM	Deploy model to serving endpoint	Ray Serve or Triton deployment
3:00 PM	Configure A/B test for new model	Model routing, traffic splitting
4:00 PM	Monitor prediction drift	Drift detection alerts, retraining triggers

Key Capabilities

Experiment Tracking

MLflow provides comprehensive experiment tracking:

Feature	Description
Experiment logging	Track parameters, metrics, and artifacts per run
Run comparison	Side-by-side comparison of model runs
Artifact storage	Model files stored in MinIO (S3-compatible)
Versioning	Semantic versioning of registered models
Lineage	Track data inputs to model outputs

Distributed Training

Ray provides distributed compute for ML workloads:

Feature	Description
Ray Train	Distributed training across GPU nodes
Ray Tune	Hyperparameter optimization with early stopping
Ray Data	Distributed data preprocessing pipelines
Auto-scaling	Dynamic cluster scaling based on workload

Model Serving

Multiple serving options for different model types:

Serving Option	Use Case	Technology
Real-time inference	Low-latency predictions	Ray Serve
Batch inference	Large-scale batch predictions	Ray or Spark
LLM inference	Large language model serving	vLLM
GPU-optimized	High-throughput GPU inference	Triton Inference Server

Feature Store

Feast manages feature engineering and serving:

Feature	Description
Feature definitions	Declarative feature specifications
Online store	Low-latency feature serving for inference
Offline store	Historical features for training
Point-in-time joins	Correct feature values at training time

Backend Services

Service	Port	Interaction
`ml-service`	8000	Model management, experiment orchestration
MLflow	--	Experiment tracking, model registry
Ray	--	Distributed training and serving
vLLM	--	LLM inference serving
Feast	--	Feature store management
`ai-service`	8000	Natural language model diagnostics

Related Chapters

ML Platform Capabilities -- Full capability description
ML Flow -- Model training and serving lifecycle
Technology Stack: ML Infrastructure -- ML/AI technologies
Compute Engines -- Ray, Spark, Flink

BI Developer Agentic User