MATIH Platform is in active MVP development. Documentation reflects current implementation status.
1. Introduction
ML Engineer

ML Engineer

The ML Engineer trains, evaluates, deploys, and monitors machine learning models within the MATIH Platform. ML Engineers work primarily in the ML Workbench (port 3001) and interact with experiment tracking, model registry, distributed training, and feature store services.


Role Summary

AttributeDetails
Primary workbenchML Workbench (3001)
Key servicesML Service, Ray, MLflow, Feast, vLLM
Common tasksTrain models, run experiments, deploy to production, monitor drift
Technical depthHigh -- Python, ML frameworks, distributed computing

Day-in-the-Life Workflow

TimeActivityPlatform Feature
9:00 AMReview model performance metricsMLflow dashboard, model monitoring
9:30 AMCheck feature store freshnessFeast feature status, data quality
10:00 AMLaunch distributed training jobRay Train, GPU cluster allocation
11:30 AMCompare experiment resultsMLflow experiment comparison
1:00 PMRegister best model to registryMLflow model registry, versioning
2:00 PMDeploy model to serving endpointRay Serve or Triton deployment
3:00 PMConfigure A/B test for new modelModel routing, traffic splitting
4:00 PMMonitor prediction driftDrift detection alerts, retraining triggers

Key Capabilities

Experiment Tracking

MLflow provides comprehensive experiment tracking:

FeatureDescription
Experiment loggingTrack parameters, metrics, and artifacts per run
Run comparisonSide-by-side comparison of model runs
Artifact storageModel files stored in MinIO (S3-compatible)
VersioningSemantic versioning of registered models
LineageTrack data inputs to model outputs

Distributed Training

Ray provides distributed compute for ML workloads:

FeatureDescription
Ray TrainDistributed training across GPU nodes
Ray TuneHyperparameter optimization with early stopping
Ray DataDistributed data preprocessing pipelines
Auto-scalingDynamic cluster scaling based on workload

Model Serving

Multiple serving options for different model types:

Serving OptionUse CaseTechnology
Real-time inferenceLow-latency predictionsRay Serve
Batch inferenceLarge-scale batch predictionsRay or Spark
LLM inferenceLarge language model servingvLLM
GPU-optimizedHigh-throughput GPU inferenceTriton Inference Server

Feature Store

Feast manages feature engineering and serving:

FeatureDescription
Feature definitionsDeclarative feature specifications
Online storeLow-latency feature serving for inference
Offline storeHistorical features for training
Point-in-time joinsCorrect feature values at training time

Backend Services

ServicePortInteraction
ml-service8000Model management, experiment orchestration
MLflow--Experiment tracking, model registry
Ray--Distributed training and serving
vLLM--LLM inference serving
Feast--Feature store management
ai-service8000Natural language model diagnostics

Related Chapters