MATIH Platform is in active MVP development. Documentation reflects current implementation status.
13. ML Service & MLOps
Experiment Tracking
Managing Runs

Managing Runs

Runs represent individual training iterations within an experiment. Each run captures a complete record of a training attempt including parameters, metrics over time, artifacts, and status.


Creating a Run

POST /api/v1/experiments/runs
Content-Type: application/json
X-Tenant-ID: acme-corp
 
{
  "experiment_id": "550e8400-e29b-41d4-a716-446655440000",
  "name": "xgboost-lr-0.01-depth-6",
  "description": "XGBoost with learning rate 0.01 and max depth 6",
  "tags": {
    "model_type": "xgboost",
    "dataset_version": "v2.1"
  }
}

Run Status Lifecycle

Runs progress through the following statuses:

StatusDescription
runningRun is actively in progress
finishedRun completed successfully
failedRun encountered an error
killedRun was manually terminated

Logging Metrics

Log metrics at each training step for time-series visualization:

POST /api/v1/experiments/runs/{run_id}/log-metrics
Content-Type: application/json
 
{
  "metrics": {
    "train_loss": 0.342,
    "val_loss": 0.389,
    "val_accuracy": 0.876,
    "val_f1": 0.851
  },
  "step": 15,
  "timestamp": 1707723600000
}

Using the SDK context manager:

from src.tracking.experiment_tracker import (
    ExperimentTracker, RunConfig, RunMetrics
)
 
tracker = ExperimentTracker()
 
with tracker.start_run(
    experiment_name="fraud-detection-v3",
    run_config=RunConfig(
        run_name="xgboost-baseline",
        tenant_id="acme-corp",
        user_id="alice@acme.com",
    ),
) as run:
    # Log metrics at each epoch
    for epoch in range(100):
        train_loss = train_epoch(model, train_loader)
        val_metrics = validate(model, val_loader)
 
        run.log_metrics({
            "train_loss": train_loss,
            "val_loss": val_metrics["loss"],
            "val_accuracy": val_metrics["accuracy"],
        }, step=epoch)

Logging Parameters

Parameters capture the hyperparameters and configuration used for a run:

POST /api/v1/experiments/runs/{run_id}/log-params
Content-Type: application/json
 
{
  "params": {
    "learning_rate": "0.01",
    "max_depth": "6",
    "n_estimators": "200",
    "subsample": "0.8",
    "colsample_bytree": "0.7"
  }
}

Via the SDK:

run.log_params({
    "learning_rate": 0.01,
    "max_depth": 6,
    "n_estimators": 200,
    "subsample": 0.8,
})

All parameter values are converted to strings internally (MLflow requirement).


Batch Logging

For high-throughput scenarios, use batch logging to submit metrics, parameters, and tags in a single request:

POST /api/v1/experiments/runs/{run_id}/log-batch
Content-Type: application/json
 
{
  "metrics": [
    {"key": "loss", "value": 0.342, "step": 1},
    {"key": "loss", "value": 0.298, "step": 2},
    {"key": "loss", "value": 0.256, "step": 3}
  ],
  "params": [
    {"key": "optimizer", "value": "adam"},
    {"key": "batch_size", "value": "64"}
  ],
  "tags": [
    {"key": "run_type", "value": "hyperparameter_search"},
    {"key": "gpu_model", "value": "A100"}
  ]
}

Ending a Run

POST /api/v1/experiments/runs/{run_id}/end?status=finished

The response includes the total duration:

{
  "message": "Run ended",
  "run_id": "...",
  "status": "finished",
  "duration_seconds": 342.5
}

Listing Runs

GET /api/v1/experiments/runs?experiment_id={id}&status=finished&limit=50&offset=0
X-Tenant-ID: acme-corp

Metric History

Retrieve the full time series for a specific metric using the tracker SDK:

history = tracker.get_metric_history(run_id="...", key="val_loss")
# Returns: [{"key": "val_loss", "value": 0.5, "timestamp": ..., "step": 0}, ...]

Source Files

FilePath
Run Endpointsdata-plane/ml-service/src/api/experiments.py
ExperimentTrackerdata-plane/ml-service/src/tracking/experiment_tracker.py
ActiveRundata-plane/ml-service/src/tracking/experiment_tracker.py