Checkpoint Management

Checkpoint management provides fault-tolerant training with the ability to save, resume, and clean up model checkpoints throughout the training lifecycle.

Checkpoint Configuration

TrainingConfig(
    checkpoint_frequency=1,    # Save every N epochs
    keep_checkpoints=3,        # Retain only the latest N checkpoints
)

Checkpoint Creation

During training, checkpoints are created at the configured frequency using ray.train.Checkpoint:

def _create_checkpoint(self, model, epoch):
    from ray.train import Checkpoint
    import tempfile, joblib
 
    with tempfile.TemporaryDirectory() as tmpdir:
        model_path = os.path.join(tmpdir, "model.pkl")
        joblib.dump(model, model_path)  # sklearn models
        # or: torch.save(model.state_dict(), model_path)  # PyTorch models
        return Checkpoint.from_directory(tmpdir)

Checkpoint Storage

Checkpoints are stored at:

{storage_path}/{tenant_id}/{run_id}/checkpoints/

The storage_path defaults to MATIH_STORAGE_PATH or /tmp/matih/training.

Source Files

File	Path
CheckpointManager	`data-plane/ml-service/src/training/checkpoint_manager.py`
Enhanced Checkpoint Service	`data-plane/ml-service/src/training/enhanced_checkpoint_service.py`

Hyperparameter Tuning Job Management