Managing Runs
Runs represent individual training iterations within an experiment. Each run captures a complete record of a training attempt including parameters, metrics over time, artifacts, and status.
Creating a Run
POST /api/v1/experiments/runs
Content-Type: application/json
X-Tenant-ID: acme-corp
{
"experiment_id": "550e8400-e29b-41d4-a716-446655440000",
"name": "xgboost-lr-0.01-depth-6",
"description": "XGBoost with learning rate 0.01 and max depth 6",
"tags": {
"model_type": "xgboost",
"dataset_version": "v2.1"
}
}Run Status Lifecycle
Runs progress through the following statuses:
| Status | Description |
|---|---|
running | Run is actively in progress |
finished | Run completed successfully |
failed | Run encountered an error |
killed | Run was manually terminated |
Logging Metrics
Log metrics at each training step for time-series visualization:
POST /api/v1/experiments/runs/{run_id}/log-metrics
Content-Type: application/json
{
"metrics": {
"train_loss": 0.342,
"val_loss": 0.389,
"val_accuracy": 0.876,
"val_f1": 0.851
},
"step": 15,
"timestamp": 1707723600000
}Using the SDK context manager:
from src.tracking.experiment_tracker import (
ExperimentTracker, RunConfig, RunMetrics
)
tracker = ExperimentTracker()
with tracker.start_run(
experiment_name="fraud-detection-v3",
run_config=RunConfig(
run_name="xgboost-baseline",
tenant_id="acme-corp",
user_id="alice@acme.com",
),
) as run:
# Log metrics at each epoch
for epoch in range(100):
train_loss = train_epoch(model, train_loader)
val_metrics = validate(model, val_loader)
run.log_metrics({
"train_loss": train_loss,
"val_loss": val_metrics["loss"],
"val_accuracy": val_metrics["accuracy"],
}, step=epoch)Logging Parameters
Parameters capture the hyperparameters and configuration used for a run:
POST /api/v1/experiments/runs/{run_id}/log-params
Content-Type: application/json
{
"params": {
"learning_rate": "0.01",
"max_depth": "6",
"n_estimators": "200",
"subsample": "0.8",
"colsample_bytree": "0.7"
}
}Via the SDK:
run.log_params({
"learning_rate": 0.01,
"max_depth": 6,
"n_estimators": 200,
"subsample": 0.8,
})All parameter values are converted to strings internally (MLflow requirement).
Batch Logging
For high-throughput scenarios, use batch logging to submit metrics, parameters, and tags in a single request:
POST /api/v1/experiments/runs/{run_id}/log-batch
Content-Type: application/json
{
"metrics": [
{"key": "loss", "value": 0.342, "step": 1},
{"key": "loss", "value": 0.298, "step": 2},
{"key": "loss", "value": 0.256, "step": 3}
],
"params": [
{"key": "optimizer", "value": "adam"},
{"key": "batch_size", "value": "64"}
],
"tags": [
{"key": "run_type", "value": "hyperparameter_search"},
{"key": "gpu_model", "value": "A100"}
]
}Ending a Run
POST /api/v1/experiments/runs/{run_id}/end?status=finishedThe response includes the total duration:
{
"message": "Run ended",
"run_id": "...",
"status": "finished",
"duration_seconds": 342.5
}Listing Runs
GET /api/v1/experiments/runs?experiment_id={id}&status=finished&limit=50&offset=0
X-Tenant-ID: acme-corpMetric History
Retrieve the full time series for a specific metric using the tracker SDK:
history = tracker.get_metric_history(run_id="...", key="val_loss")
# Returns: [{"key": "val_loss", "value": 0.5, "timestamp": ..., "step": 0}, ...]Source Files
| File | Path |
|---|---|
| Run Endpoints | data-plane/ml-service/src/api/experiments.py |
| ExperimentTracker | data-plane/ml-service/src/tracking/experiment_tracker.py |
| ActiveRun | data-plane/ml-service/src/tracking/experiment_tracker.py |