Cost Management
Training cost management tracks GPU utilization, compute hours, and training budgets to help teams optimize their ML spend.
Cost Tracking
The CostCalculator tracks per-job costs based on resource consumption:
- GPU hours consumed per training job
- CPU core-hours for distributed workers
- Storage costs for checkpoints and artifacts
- Per-tenant budget enforcement
Budget Controls
| Setting | Description |
|---|---|
| Monthly GPU budget | Maximum GPU hours per tenant per month |
| Per-job limit | Maximum duration for a single training job |
| Resource quotas | Maximum concurrent workers per tenant |
| Cost alerts | Notifications at budget thresholds |
Source Files
| File | Path |
|---|---|
| Cost Calculator | data-plane/ml-service/src/training/cost_calculator.py |
| Cost Management Service | data-plane/ml-service/src/cost/cost_management_service.py |
| GPU Manager | data-plane/ml-service/src/training/gpu_manager.py |
| GPU Tracker | data-plane/ml-service/src/training/gpu_tracker.py |