MATIH Platform is in active MVP development. Documentation reflects current implementation status.
2. Architecture
ML Flow

ML Flow

The ML flow traces the lifecycle of a machine learning model from experiment definition through training, evaluation, registration, deployment, and monitoring. This flow involves the ML Service, Ray for distributed compute, MLflow for experiment tracking, and Feast for feature engineering.


ML Lifecycle Path

ML Workbench / API
  |
  v
ML Service (Port 8000)
  | 1. Validate experiment configuration
  | 2. Apply tenant context
  | 3. Prepare training job
  |
  v
Feature Engineering
  | 4. Feast: retrieve training features
  | 5. Point-in-time join for historical features
  |
  v
Distributed Training
  | 6. Ray Train: distribute across GPU/CPU nodes
  | 7. Log metrics to MLflow per epoch
  | 8. Save checkpoints to MinIO
  |
  v
Evaluation
  | 9. Run evaluation on holdout set
  | 10. Compare against baseline model
  |
  v
Registration
  | 11. Register model in MLflow Model Registry
  | 12. Stage: Staging --> Production
  |
  v
Deployment
  | 13. Deploy to Ray Serve / Triton
  | 14. Configure traffic routing
  |
  v
Monitoring
  | 15. Track prediction drift
  | 16. Monitor latency and throughput
  | 17. Alert on quality degradation

Experiment Tracking

MLflow tracks every training run:

Tracked ItemExample
Parameterslearning_rate: 0.001, batch_size: 32, epochs: 50
Metricsaccuracy: 0.94, f1_score: 0.91, loss: 0.23
ArtifactsModel files, training logs, evaluation reports
Tagsexperiment_name, model_type, data_version

Training Flow

StepComponentDescription
Feature retrievalFeastPoint-in-time correct features from offline store
Data preprocessingRay DataDistributed data transformation
Model trainingRay TrainDistributed training across cluster
Hyperparameter tuningRay TuneBayesian optimization with early stopping
Metric loggingMLflowReal-time metric tracking per training step
Checkpoint savingMinIOModel checkpoints for fault tolerance

Model Registry

MLflow Model Registry manages model versions and stages:

StageDescriptionAccess
NoneNewly registered modelInternal use only
StagingUndergoing testing and validationPre-production testing
ProductionServing live predictionsProduction traffic
ArchivedRetired from serviceHistorical reference

Deployment Options

OptionTechnologyUse CaseLatency
Real-time servingRay ServeLow-latency predictionsLess than 50ms
Batch inferenceRay / SparkLarge-scale batch predictionsMinutes
LLM servingvLLMLarge language model inferenceLess than 100ms
GPU-optimizedTritonHigh-throughput GPU inferenceLess than 20ms

Model Monitoring

Post-deployment monitoring tracks model health:

MetricDetection MethodAlert Threshold
Prediction driftStatistical divergence testDistribution shift exceeding 2 standard deviations
Feature driftInput feature distribution monitoringSignificant feature value change
Latency degradationp95 inference latency trackingExceeding SLA target
Error ratePrediction failure trackingError rate exceeding 1%

Event Publishing

ML lifecycle events are published to Kafka:

EventPublished WhenConsumers
MODEL_TRAINING_STARTEDTraining job beginsAudit, monitoring
MODEL_TRAINING_COMPLETEDTraining finishesAudit, notification
MODEL_REGISTEREDModel added to registryCatalog, audit
MODEL_DEPLOYEDModel serving beginsAudit, notification
MODEL_DRIFT_DETECTEDDrift alert triggeredNotification, alerting

Related Pages