Job Management
The training job manager handles the full lifecycle of training jobs including scheduling, resource allocation, monitoring, and failure recovery.
Job Lifecycle
| State | Description |
|---|---|
pending | Job created, awaiting resources |
queued | Job queued for execution |
preparing | Loading data and initializing model |
running | Actively training |
completed | Training finished successfully |
failed | Training failed with error |
cancelled | Job was manually cancelled |
Job Monitoring
GET /api/v1/training/jobs/{job_id}Response includes real-time metrics, resource utilization, and estimated completion time.
Source Files
| File | Path |
|---|---|
| JobManager | data-plane/ml-service/src/training/job_manager.py |
| Job Scheduler | data-plane/ml-service/src/training/job_scheduler.py |
| Job Monitoring Service | data-plane/ml-service/src/training/job_monitoring_service.py |