Tutorial: Training a Model

In this tutorial, you will use the MATIH ML Workbench to train a machine learning model that predicts customer churn. You will prepare data, select an algorithm, train the model, evaluate its performance, and register it in the model registry for deployment.

What You Will Learn

How to navigate the ML Workbench interface
How to create an ML experiment and prepare training data
How to configure and train a classification model
How to evaluate model performance with standard metrics
How to register a trained model in the MLflow model registry
How to deploy a registered model for inference

Prerequisites

Requirement	How to Verify
MATIH platform running	`./scripts/tools/platform-status.sh` returns healthy
ML service operational	Health check on ml-service passes
MLflow tracking server	MLflow UI accessible
Sample data loaded	The `customers` and `orders` tables are available

Step 1: Open the ML Workbench

Navigate to the ML Workbench:

Local development: http://localhost:3001
Cloud deployment: https://ml.{your-tenant}.matih.ai

Section	Description
Experiments	List of ML experiments and their runs
Models	Model registry with registered models
Datasets	Available datasets for training
Notebooks	Interactive Jupyter-style notebooks
Pipelines	Automated ML pipelines

Step 2: Create a New Experiment

Click Experiments in the sidebar.
Click New Experiment.
Configure the experiment:
- Name: Customer Churn Prediction
- Description: Predict which customers are likely to churn based on purchase history and demographics
- Tags: churn, classification, retail
Click Create.

An experiment in MLflow is a container for multiple training runs. Each run tracks parameters, metrics, and artifacts.

Step 3: Prepare the Training Data

Option A: Using the Data Preparation UI

In the experiment view, click Prepare Data.
Select SQL Query as the data source.
Enter the feature engineering query:

SELECT
    c.id AS customer_id,
    c.segment,
    c.age,
    c.region,
    COALESCE(order_stats.total_orders, 0) AS total_orders,
    COALESCE(order_stats.total_spend, 0) AS total_spend,
    COALESCE(order_stats.avg_order_value, 0) AS avg_order_value,
    COALESCE(order_stats.days_since_last_order, 999) AS days_since_last_order,
    COALESCE(return_stats.return_count, 0) AS return_count,
    COALESCE(return_stats.return_rate, 0) AS return_rate,
    CASE
        WHEN order_stats.days_since_last_order > 90 THEN 1
        ELSE 0
    END AS churned
FROM customers c
LEFT JOIN (
    SELECT
        customer_id,
        COUNT(*) AS total_orders,
        SUM(total_amount) AS total_spend,
        AVG(total_amount) AS avg_order_value,
        EXTRACT(DAY FROM CURRENT_DATE - MAX(order_date)) AS days_since_last_order
    FROM orders
    GROUP BY customer_id
) order_stats ON c.id = order_stats.customer_id
LEFT JOIN (
    SELECT
        o.customer_id,
        COUNT(r.id) AS return_count,
        COUNT(r.id)::FLOAT / NULLIF(COUNT(o.id), 0) AS return_rate
    FROM orders o
    LEFT JOIN returns r ON o.id = r.order_id
    GROUP BY o.customer_id
) return_stats ON c.id = return_stats.customer_id

Click Preview to verify the data (shows the first 100 rows).
Configure the dataset:
- Name: churn_features_v1
- Target column: churned
- Feature columns: All except customer_id and churned
- Train/Test split: 80% / 20%
- Stratify by: churned (to maintain class balance)
Click Save Dataset.

Option B: Using a Notebook

Create a new notebook and use Python for data preparation:

import pandas as pd
from matih_ml import DataSource, Dataset
 
# Connect to the tenant's data source
ds = DataSource.from_tenant()
 
# Execute the feature query
df = ds.query("""
    SELECT
        c.id AS customer_id,
        c.segment,
        c.age,
        c.region,
        -- ... (same query as above)
    FROM customers c
    LEFT JOIN ...
""")
 
# Explore the data
print(f"Shape: {df.shape}")
print(f"Churn rate: {df['churned'].mean():.2%}")
print(df.describe())
 
# Save as a dataset
dataset = Dataset.create(
    name="churn_features_v1",
    dataframe=df,
    target="churned",
    test_size=0.2,
    stratify="churned"
)

Step 4: Configure the Model

In the experiment view, click New Run.
Select the dataset: churn_features_v1.
Choose the algorithm:

Available Algorithms

Algorithm	Type	Best For
Logistic Regression	Classification	Interpretable baseline
Random Forest	Classification	Robust, handles mixed features
Gradient Boosting (XGBoost)	Classification	High performance
LightGBM	Classification	Fast training on large datasets
Support Vector Machine	Classification	High-dimensional data
Neural Network (MLP)	Classification	Complex non-linear patterns

For this tutorial, select Random Forest.

Configure hyperparameters:

Parameter	Value	Description
`n_estimators`	100	Number of trees in the forest
`max_depth`	10	Maximum tree depth
`min_samples_split`	5	Minimum samples to split a node
`min_samples_leaf`	2	Minimum samples in a leaf node
`class_weight`	balanced	Handle class imbalance

Configure categorical encoding:
- segment: One-hot encoding
- region: One-hot encoding

Step 5: Train the Model

Click Start Training. The ML Workbench submits the training job to the ML service.

Training Progress

The training progress view shows:

Metric	Real-Time Display
Training progress	Progress bar with percentage
Current accuracy	Updated per epoch/iteration
Training loss	Loss curve chart
Resource usage	CPU and memory consumption
Estimated time remaining	Based on current pace

Training a Random Forest on 5,000 samples typically completes in under 30 seconds.

What Happens Behind the Scenes

The ML service retrieves the dataset from the feature store.
Categorical features are encoded using the configured strategy.
The data is split into train (80%) and test (20%) sets.
The model is trained using scikit-learn's RandomForestClassifier.
All parameters, metrics, and the model artifact are logged to MLflow.
The model is serialized and stored in the artifact store (MinIO/S3).

Step 6: Evaluate the Model

After training completes, the evaluation view appears automatically.

Classification Metrics

Metric	Value	Interpretation
Accuracy	0.87	87% of predictions are correct
Precision	0.82	82% of predicted churners actually churned
Recall	0.79	79% of actual churners were identified
F1 Score	0.80	Harmonic mean of precision and recall
AUC-ROC	0.91	Excellent discrimination ability

Confusion Matrix

	Predicted: Not Churned	Predicted: Churned
Actual: Not Churned	720 (TN)	30 (FP)
Actual: Churned	53 (FN)	197 (TP)

Feature Importance

The workbench displays a ranked list of feature importance:

Feature	Importance
`days_since_last_order`	0.32
`total_orders`	0.18
`total_spend`	0.15
`avg_order_value`	0.12
`return_rate`	0.09
`age`	0.06
`return_count`	0.04
`segment` (encoded)	0.03
`region` (encoded)	0.01

ROC Curve

The ROC curve is displayed interactively. You can hover over points to see the threshold, TPR, and FPR at each point.

Step 7: Compare Runs (Optional)

Train additional models with different algorithms or hyperparameters, then compare:

Click New Run and train a Gradient Boosting model.
After training, go to the Experiment view.
Select multiple runs and click Compare.

The comparison view shows:

Run	Algorithm	Accuracy	F1	AUC	Training Time
Run 1	Random Forest	0.87	0.80	0.91	28s
Run 2	XGBoost	0.89	0.83	0.93	45s
Run 3	Logistic Regression	0.81	0.72	0.85	3s

Step 8: Register the Best Model

Once you have identified the best model:

Open the run details for the winning model.
Click Register Model.
Configure:
- Model name: customer-churn-predictor
- Version: Auto-assigned (v1)
- Description: Random Forest classifier for predicting customer churn based on purchase behavior
- Tags: production-candidate, churn, retail
Click Register.

The model is now in the MLflow model registry with version tracking.

Model Registry Stages

Stage	Description
None	Newly registered, not assigned to any stage
Staging	Being validated for production use
Production	Actively serving predictions
Archived	Retired from active use

Transition the model to staging:

In the Model Registry, find customer-churn-predictor.
Click on Version 1.
Click Transition to Staging.
Add a note: "Passed initial evaluation with AUC 0.91."

Step 9: Test Inference

Test the registered model with sample data:

In the model version view, click Test Prediction.
Enter feature values:

{
  "segment": "Premium",
  "age": 42,
  "region": "West",
  "total_orders": 3,
  "total_spend": 450.00,
  "avg_order_value": 150.00,
  "days_since_last_order": 120,
  "return_count": 1,
  "return_rate": 0.33
}

Click Predict.

Result:

{
  "prediction": 1,
  "probability": {
    "not_churned": 0.23,
    "churned": 0.77
  },
  "model": "customer-churn-predictor",
  "version": 1
}

This customer has a 77% probability of churning.

Step 10: Deploy for Batch Scoring

To score all current customers:

Go to Pipelines in the sidebar.
Click Create Pipeline.
Select Batch Scoring template.
Configure:
- Model: customer-churn-predictor (Staging)
- Input: The same feature query from Step 3
- Output table: churn_predictions
- Schedule: Daily at 6:00 AM
Click Create and Run.

The pipeline executes the feature query, runs inference, and writes predictions to the output table:

SELECT * FROM churn_predictions ORDER BY churn_probability DESC LIMIT 10;

customer_id	churn_probability	prediction	scored_at
cust_4823	0.94	1	2026-02-12 06:00:00
cust_1247	0.91	1	2026-02-12 06:00:00
cust_3891	0.88	1	2026-02-12 06:00:00

MLflow Integration

The ML Workbench is powered by MLflow for experiment tracking and model management:

MLflow Component	MATIH Usage
Tracking	Log parameters, metrics, and artifacts for every run
Models	Model registry with version control and stage transitions
Projects	Reproducible training definitions
Model Serving	REST API for real-time inference

Access the MLflow UI directly:

Local: http://localhost:5000
Cloud: https://mlflow.{your-tenant}.matih.ai

Troubleshooting

Issue	Cause	Resolution
"Dataset not found"	Data query failed	Check data source connectivity and query syntax
Training hangs	Insufficient resources	Reduce dataset size or simplify the model
Low accuracy	Poor features or class imbalance	Engineer better features, try `class_weight='balanced'`
"Model registration failed"	MLflow tracking server down	Check MLflow service health
Batch scoring fails	Feature schema mismatch	Ensure the input query matches the training schema

Next Steps

Continue to Data Quality Exploration to learn how to profile your data and detect quality issues before they affect your models and dashboards.

Tutorial: Building a Dashboard Tutorial: Data Quality Exploration