MATIH Platform is in active MVP development. Documentation reflects current implementation status.
13. ML Service & MLOps
Cost Management

Cost Management

Training cost management tracks GPU utilization, compute hours, and training budgets to help teams optimize their ML spend.


Cost Tracking

The CostCalculator tracks per-job costs based on resource consumption:

  • GPU hours consumed per training job
  • CPU core-hours for distributed workers
  • Storage costs for checkpoints and artifacts
  • Per-tenant budget enforcement

Budget Controls

SettingDescription
Monthly GPU budgetMaximum GPU hours per tenant per month
Per-job limitMaximum duration for a single training job
Resource quotasMaximum concurrent workers per tenant
Cost alertsNotifications at budget thresholds

Source Files

FilePath
Cost Calculatordata-plane/ml-service/src/training/cost_calculator.py
Cost Management Servicedata-plane/ml-service/src/cost/cost_management_service.py
GPU Managerdata-plane/ml-service/src/training/gpu_manager.py
GPU Trackerdata-plane/ml-service/src/training/gpu_tracker.py