Model Training

The Model Training integration enables users to submit, monitor, and evaluate training jobs through the AI Service. Training requests are forwarded to the ML Service which executes them on Ray AIR with distributed compute capabilities. The AI Service provides the conversational interface, progress tracking, and result visualization.

Training Workflow

Configuration: User specifies algorithm, dataset, target, and hyperparameters
Validation: AI Service validates the configuration against the schema context
Submission: Training job is submitted to the ML Service
Monitoring: Progress is tracked via polling or WebSocket streaming
Evaluation: Results are returned with metrics, feature importance, and model artifacts

Supported Algorithms

Algorithm	Task	Framework
XGBoost	Classification, Regression	XGBoost + Ray
LightGBM	Classification, Regression	LightGBM + Ray
Random Forest	Classification, Regression	scikit-learn + Ray
Linear/Logistic	Regression, Classification	scikit-learn + Ray
Neural Network	Classification, Regression, NLP	PyTorch + Ray Train
Prophet	Time Series Forecasting	Prophet
ARIMA	Time Series Forecasting	statsmodels

Training Configuration

{
  "name": "churn-predictor-v2",
  "algorithm": "xgboost",
  "task_type": "classification",
  "dataset": {
    "source": "sql",
    "query": "SELECT * FROM ml_features.customer_churn",
    "tenant_id": "acme-corp"
  },
  "target_column": "churned",
  "feature_columns": ["tenure", "monthly_charges", "total_charges"],
  "split": {
    "strategy": "stratified",
    "test_size": 0.2,
    "random_state": 42
  },
  "hyperparameters": {
    "n_estimators": 100,
    "max_depth": 6,
    "learning_rate": 0.1
  }
}

Training API

Submit Training Job

POST /api/v1/ml/train

Get Training Status

GET /api/v1/ml/train/:job_id

Cancel Training Job

DELETE /api/v1/ml/train/:job_id

List Training Jobs

GET /api/v1/ml/train?tenant_id=acme-corp&status=running

Training Results

Completed training jobs return comprehensive metrics:

{
  "job_id": "train-abc123",
  "status": "completed",
  "metrics": {
    "accuracy": 0.94,
    "precision": 0.91,
    "recall": 0.88,
    "f1_score": 0.895,
    "auc_roc": 0.96,
    "confusion_matrix": [[450, 30], [25, 95]]
  },
  "feature_importance": {
    "tenure": 0.35,
    "monthly_charges": 0.28,
    "total_charges": 0.22,
    "contract_type": 0.15
  },
  "model_artifact_id": "model-xyz789",
  "training_duration_seconds": 842,
  "data_summary": {
    "total_samples": 5000,
    "train_samples": 4000,
    "test_samples": 1000
  }
}

Resource Management

Training jobs are resource-bounded per tenant:

Resource	Default Limit	Configurable
Max concurrent jobs	5	Yes
Max training duration	4 hours	Yes
Max dataset size	10 GB	Yes
CPU per job	4 cores	Yes
Memory per job	16 GB	Yes
GPU per job	0 (CPU only)	Yes

ML Integration Overview Hyperparameter Tuning