Experiment Tracking

The MATIH ML Service provides a comprehensive experiment tracking system built on MLflow, enabling data scientists and ML engineers to organize, track, and compare machine learning experiments across the platform. Every experiment is tenant-isolated and integrates with the broader MLOps pipeline.

What is Experiment Tracking?

Experiment tracking captures the full context of each machine learning training run: hyperparameters, metrics over time, model artifacts, code versions, and environment details. This provides reproducibility, comparability, and auditability for all ML work.

Architecture

  +-----------------------+
  |   ML Workbench UI     |
  +----------+------------+
             |
  +----------v------------+
  |  Experiments API       |
  |  /api/v1/experiments   |
  +----------+------------+
             |
  +----------v------------+      +------------------+
  |  ExperimentTracker     +----->  MLflow Server    |
  |  (experiment_tracker.py)     |  (Tracking URI)   |
  +----------+------------+      +------------------+
             |
  +----------v------------+
  |  Artifact Storage      |
  |  (S3 / MinIO)          |
  +-----------------------+

Key Components

Component	File	Purpose
Experiments API	`src/api/experiments.py`	REST endpoints for CRUD, run management, metrics logging
ExperimentTracker	`src/tracking/experiment_tracker.py`	MLflow client wrapper with tenant isolation
EnhancedExperimentService	`src/tracking/enhanced_experiment_service.py`	Advanced tracking with system metrics
ArtifactManager	`src/tracking/artifact_manager.py`	Artifact upload, storage, retrieval

Core Data Models

ExperimentCreate

class ExperimentCreate(BaseModel):
    """Request to create an experiment."""
    name: str = Field(..., min_length=1, max_length=255)
    description: str = Field(default="", max_length=1000)
    artifact_location: Optional[str] = None
    tags: dict[str, str] = Field(default_factory=dict)

ExperimentResponse

class ExperimentResponse(BaseModel):
    """Experiment response."""
    id: str
    name: str
    description: str
    artifact_location: str
    lifecycle_stage: str      # "active" or "deleted"
    created_at: str
    last_updated: str
    tags: dict[str, str]

RunResponse

class RunResponse(BaseModel):
    """Run response."""
    id: str
    experiment_id: str
    name: Optional[str]
    status: str              # "running", "finished", "failed", "killed"
    start_time: str
    end_time: Optional[str]
    duration_seconds: Optional[float]
    tags: dict[str, str]
    params: dict[str, str]
    metrics: dict[str, float]
    artifacts: list[str]

Experiment Lifecycle

Experiments follow a simple lifecycle:

Create -- Define an experiment with a name, description, and tags
Run -- Create runs within the experiment to track individual training iterations
Log -- Record parameters, metrics (at each step), and artifacts to runs
Compare -- Compare metrics across runs to identify best configurations
Archive -- Soft-delete experiments that are no longer active

Tenant Isolation

All experiments are scoped by tenant. The ExperimentTracker prefixes experiment names with the tenant ID:

def get_or_create_experiment(self, name: str, tenant_id: Optional[str] = None) -> str:
    if tenant_id:
        full_name = f"{tenant_id}/{name}"
    else:
        full_name = name
    experiment = self._client.get_experiment_by_name(full_name)
    if experiment:
        return experiment.experiment_id
    return self.create_experiment(ExperimentConfig(name=full_name, tenant_id=tenant_id))

Tenant tags (matih.tenant_id) are automatically applied to all experiments and runs, enabling filtered queries across the MLflow backend.

Quick Start

# Create an experiment
curl -X POST http://localhost:8000/api/v1/experiments \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: acme-corp" \
  -d '{
    "name": "churn-prediction-v2",
    "description": "Customer churn prediction with gradient boosting",
    "tags": {"team": "data-science", "domain": "retention"}
  }'
 
# Create a run
curl -X POST http://localhost:8000/api/v1/experiments/runs \
  -H "Content-Type: application/json" \
  -H "X-Tenant-ID: acme-corp" \
  -d '{
    "experiment_id": "<experiment-id>",
    "name": "xgboost-baseline",
    "tags": {"model_type": "xgboost"}
  }'

Section Contents

Page	Description
Creating Experiments	Experiment creation, naming conventions, metadata management
Managing Runs	Run lifecycle, metrics logging, parameter tracking
Comparing Runs	Multi-run comparison and visualization
Artifacts	Artifact upload, storage backends, retrieval
MLflow Integration	MLflow compatibility and migration guide
API Reference	Complete endpoint documentation

Service Architecture Creating Experiments