Training Overview
The MATIH training subsystem provides distributed model training using Ray Train, supporting PyTorch, TensorFlow, XGBoost, LightGBM, and scikit-learn models with automatic data sharding, checkpoint management, and MLflow integration.
Key Components
| Component | Class | Purpose |
|---|---|---|
DistributedTrainer | Ray Train integration | Multi-framework distributed training |
HyperparameterTuner | Ray Tune integration | Bayesian, grid, random search |
CheckpointManager | Checkpoint lifecycle | Save, resume, cleanup |
JobManager | Job orchestration | Scheduling, monitoring, resource allocation |
CostCalculator | Cost tracking | GPU utilization, training budgets |
Supported Frameworks
| Framework | Ray Trainer | Use Case |
|---|---|---|
| PyTorch | TorchTrainer | Deep learning, NLP, computer vision |
| TensorFlow/Keras | TensorflowTrainer | Deep learning, production models |
| XGBoost | XGBoostTrainer | Gradient boosting for tabular data |
| LightGBM | LightGBMTrainer | Fast gradient boosting |
| scikit-learn | Custom trainer | Classical ML, small-medium datasets |
Section Contents
| Page | Description |
|---|---|
| Distributed Training | Ray distributed training strategies |
| Hyperparameter Tuning | Search strategies and optimization |
| Checkpoint Management | Checkpoint lifecycle |
| Job Management | Training job orchestration |
| Cost Management | GPU utilization and budgets |
Source Files
| File | Path |
|---|---|
| DistributedTrainer | data-plane/ml-service/src/training/distributed_trainer.py |
| HyperparameterTuner | data-plane/ml-service/src/training/hyperparameter_tuner.py |
| JobManager | data-plane/ml-service/src/training/job_manager.py |
| Training API | data-plane/ml-service/src/api/training.py |