Model Training
The Model Training integration enables users to submit, monitor, and evaluate training jobs through the AI Service. Training requests are forwarded to the ML Service which executes them on Ray AIR with distributed compute capabilities. The AI Service provides the conversational interface, progress tracking, and result visualization.
Training Workflow
- Configuration: User specifies algorithm, dataset, target, and hyperparameters
- Validation: AI Service validates the configuration against the schema context
- Submission: Training job is submitted to the ML Service
- Monitoring: Progress is tracked via polling or WebSocket streaming
- Evaluation: Results are returned with metrics, feature importance, and model artifacts
Supported Algorithms
| Algorithm | Task | Framework |
|---|---|---|
| XGBoost | Classification, Regression | XGBoost + Ray |
| LightGBM | Classification, Regression | LightGBM + Ray |
| Random Forest | Classification, Regression | scikit-learn + Ray |
| Linear/Logistic | Regression, Classification | scikit-learn + Ray |
| Neural Network | Classification, Regression, NLP | PyTorch + Ray Train |
| Prophet | Time Series Forecasting | Prophet |
| ARIMA | Time Series Forecasting | statsmodels |
Training Configuration
{
"name": "churn-predictor-v2",
"algorithm": "xgboost",
"task_type": "classification",
"dataset": {
"source": "sql",
"query": "SELECT * FROM ml_features.customer_churn",
"tenant_id": "acme-corp"
},
"target_column": "churned",
"feature_columns": ["tenure", "monthly_charges", "total_charges"],
"split": {
"strategy": "stratified",
"test_size": 0.2,
"random_state": 42
},
"hyperparameters": {
"n_estimators": 100,
"max_depth": 6,
"learning_rate": 0.1
}
}Training API
Submit Training Job
POST /api/v1/ml/trainGet Training Status
GET /api/v1/ml/train/:job_idCancel Training Job
DELETE /api/v1/ml/train/:job_idList Training Jobs
GET /api/v1/ml/train?tenant_id=acme-corp&status=runningTraining Results
Completed training jobs return comprehensive metrics:
{
"job_id": "train-abc123",
"status": "completed",
"metrics": {
"accuracy": 0.94,
"precision": 0.91,
"recall": 0.88,
"f1_score": 0.895,
"auc_roc": 0.96,
"confusion_matrix": [[450, 30], [25, 95]]
},
"feature_importance": {
"tenure": 0.35,
"monthly_charges": 0.28,
"total_charges": 0.22,
"contract_type": 0.15
},
"model_artifact_id": "model-xyz789",
"training_duration_seconds": 842,
"data_summary": {
"total_samples": 5000,
"train_samples": 4000,
"test_samples": 1000
}
}Resource Management
Training jobs are resource-bounded per tenant:
| Resource | Default Limit | Configurable |
|---|---|---|
| Max concurrent jobs | 5 | Yes |
| Max training duration | 4 hours | Yes |
| Max dataset size | 10 GB | Yes |
| CPU per job | 4 cores | Yes |
| Memory per job | 16 GB | Yes |
| GPU per job | 0 (CPU only) | Yes |