MATIH Platform is in active MVP development. Documentation reflects current implementation status.
13. ML Service & MLOps
Job Management

Job Management

The training job manager handles the full lifecycle of training jobs including scheduling, resource allocation, monitoring, and failure recovery.


Job Lifecycle

StateDescription
pendingJob created, awaiting resources
queuedJob queued for execution
preparingLoading data and initializing model
runningActively training
completedTraining finished successfully
failedTraining failed with error
cancelledJob was manually cancelled

Job Monitoring

GET /api/v1/training/jobs/{job_id}

Response includes real-time metrics, resource utilization, and estimated completion time.


Source Files

FilePath
JobManagerdata-plane/ml-service/src/training/job_manager.py
Job Schedulerdata-plane/ml-service/src/training/job_scheduler.py
Job Monitoring Servicedata-plane/ml-service/src/training/job_monitoring_service.py