Checkpoint Management
Checkpoint management provides fault-tolerant training with the ability to save, resume, and clean up model checkpoints throughout the training lifecycle.
Checkpoint Configuration
TrainingConfig(
checkpoint_frequency=1, # Save every N epochs
keep_checkpoints=3, # Retain only the latest N checkpoints
)Checkpoint Creation
During training, checkpoints are created at the configured frequency using ray.train.Checkpoint:
def _create_checkpoint(self, model, epoch):
from ray.train import Checkpoint
import tempfile, joblib
with tempfile.TemporaryDirectory() as tmpdir:
model_path = os.path.join(tmpdir, "model.pkl")
joblib.dump(model, model_path) # sklearn models
# or: torch.save(model.state_dict(), model_path) # PyTorch models
return Checkpoint.from_directory(tmpdir)Checkpoint Storage
Checkpoints are stored at:
{storage_path}/{tenant_id}/{run_id}/checkpoints/The storage_path defaults to MATIH_STORAGE_PATH or /tmp/matih/training.
Source Files
| File | Path |
|---|---|
| CheckpointManager | data-plane/ml-service/src/training/checkpoint_manager.py |
| Enhanced Checkpoint Service | data-plane/ml-service/src/training/enhanced_checkpoint_service.py |