MATIH Platform is in active MVP development. Documentation reflects current implementation status.
13. ML Service & MLOps
Training Overview

Training Overview

The MATIH training subsystem provides distributed model training using Ray Train, supporting PyTorch, TensorFlow, XGBoost, LightGBM, and scikit-learn models with automatic data sharding, checkpoint management, and MLflow integration.


Key Components

ComponentClassPurpose
DistributedTrainerRay Train integrationMulti-framework distributed training
HyperparameterTunerRay Tune integrationBayesian, grid, random search
CheckpointManagerCheckpoint lifecycleSave, resume, cleanup
JobManagerJob orchestrationScheduling, monitoring, resource allocation
CostCalculatorCost trackingGPU utilization, training budgets

Supported Frameworks

FrameworkRay TrainerUse Case
PyTorchTorchTrainerDeep learning, NLP, computer vision
TensorFlow/KerasTensorflowTrainerDeep learning, production models
XGBoostXGBoostTrainerGradient boosting for tabular data
LightGBMLightGBMTrainerFast gradient boosting
scikit-learnCustom trainerClassical ML, small-medium datasets

Section Contents

PageDescription
Distributed TrainingRay distributed training strategies
Hyperparameter TuningSearch strategies and optimization
Checkpoint ManagementCheckpoint lifecycle
Job ManagementTraining job orchestration
Cost ManagementGPU utilization and budgets

Source Files

FilePath
DistributedTrainerdata-plane/ml-service/src/training/distributed_trainer.py
HyperparameterTunerdata-plane/ml-service/src/training/hyperparameter_tuner.py
JobManagerdata-plane/ml-service/src/training/job_manager.py
Training APIdata-plane/ml-service/src/api/training.py