Tutorial: Training a Model
In this tutorial, you will use the MATIH ML Workbench to train a machine learning model that predicts customer churn. You will prepare data, select an algorithm, train the model, evaluate its performance, and register it in the model registry for deployment.
What You Will Learn
- How to navigate the ML Workbench interface
- How to create an ML experiment and prepare training data
- How to configure and train a classification model
- How to evaluate model performance with standard metrics
- How to register a trained model in the MLflow model registry
- How to deploy a registered model for inference
Prerequisites
| Requirement | How to Verify |
|---|---|
| MATIH platform running | ./scripts/tools/platform-status.sh returns healthy |
| ML service operational | Health check on ml-service passes |
| MLflow tracking server | MLflow UI accessible |
| Sample data loaded | The customers and orders tables are available |
Step 1: Open the ML Workbench
Navigate to the ML Workbench:
- Local development:
http://localhost:3001 - Cloud deployment:
https://ml.{your-tenant}.matih.ai
Log in with your tenant credentials. The ML Workbench home page shows:
| Section | Description |
|---|---|
| Experiments | List of ML experiments and their runs |
| Models | Model registry with registered models |
| Datasets | Available datasets for training |
| Notebooks | Interactive Jupyter-style notebooks |
| Pipelines | Automated ML pipelines |
Step 2: Create a New Experiment
- Click Experiments in the sidebar.
- Click New Experiment.
- Configure the experiment:
- Name: Customer Churn Prediction
- Description: Predict which customers are likely to churn based on purchase history and demographics
- Tags:
churn,classification,retail
- Click Create.
An experiment in MLflow is a container for multiple training runs. Each run tracks parameters, metrics, and artifacts.
Step 3: Prepare the Training Data
Option A: Using the Data Preparation UI
- In the experiment view, click Prepare Data.
- Select SQL Query as the data source.
- Enter the feature engineering query:
SELECT
c.id AS customer_id,
c.segment,
c.age,
c.region,
COALESCE(order_stats.total_orders, 0) AS total_orders,
COALESCE(order_stats.total_spend, 0) AS total_spend,
COALESCE(order_stats.avg_order_value, 0) AS avg_order_value,
COALESCE(order_stats.days_since_last_order, 999) AS days_since_last_order,
COALESCE(return_stats.return_count, 0) AS return_count,
COALESCE(return_stats.return_rate, 0) AS return_rate,
CASE
WHEN order_stats.days_since_last_order > 90 THEN 1
ELSE 0
END AS churned
FROM customers c
LEFT JOIN (
SELECT
customer_id,
COUNT(*) AS total_orders,
SUM(total_amount) AS total_spend,
AVG(total_amount) AS avg_order_value,
EXTRACT(DAY FROM CURRENT_DATE - MAX(order_date)) AS days_since_last_order
FROM orders
GROUP BY customer_id
) order_stats ON c.id = order_stats.customer_id
LEFT JOIN (
SELECT
o.customer_id,
COUNT(r.id) AS return_count,
COUNT(r.id)::FLOAT / NULLIF(COUNT(o.id), 0) AS return_rate
FROM orders o
LEFT JOIN returns r ON o.id = r.order_id
GROUP BY o.customer_id
) return_stats ON c.id = return_stats.customer_id- Click Preview to verify the data (shows the first 100 rows).
- Configure the dataset:
- Name:
churn_features_v1 - Target column:
churned - Feature columns: All except
customer_idandchurned - Train/Test split: 80% / 20%
- Stratify by:
churned(to maintain class balance)
- Name:
- Click Save Dataset.
Option B: Using a Notebook
Create a new notebook and use Python for data preparation:
import pandas as pd
from matih_ml import DataSource, Dataset
# Connect to the tenant's data source
ds = DataSource.from_tenant()
# Execute the feature query
df = ds.query("""
SELECT
c.id AS customer_id,
c.segment,
c.age,
c.region,
-- ... (same query as above)
FROM customers c
LEFT JOIN ...
""")
# Explore the data
print(f"Shape: {df.shape}")
print(f"Churn rate: {df['churned'].mean():.2%}")
print(df.describe())
# Save as a dataset
dataset = Dataset.create(
name="churn_features_v1",
dataframe=df,
target="churned",
test_size=0.2,
stratify="churned"
)Step 4: Configure the Model
- In the experiment view, click New Run.
- Select the dataset:
churn_features_v1. - Choose the algorithm:
Available Algorithms
| Algorithm | Type | Best For |
|---|---|---|
| Logistic Regression | Classification | Interpretable baseline |
| Random Forest | Classification | Robust, handles mixed features |
| Gradient Boosting (XGBoost) | Classification | High performance |
| LightGBM | Classification | Fast training on large datasets |
| Support Vector Machine | Classification | High-dimensional data |
| Neural Network (MLP) | Classification | Complex non-linear patterns |
For this tutorial, select Random Forest.
- Configure hyperparameters:
| Parameter | Value | Description |
|---|---|---|
n_estimators | 100 | Number of trees in the forest |
max_depth | 10 | Maximum tree depth |
min_samples_split | 5 | Minimum samples to split a node |
min_samples_leaf | 2 | Minimum samples in a leaf node |
class_weight | balanced | Handle class imbalance |
- Configure categorical encoding:
- segment: One-hot encoding
- region: One-hot encoding
Step 5: Train the Model
Click Start Training. The ML Workbench submits the training job to the ML service.
Training Progress
The training progress view shows:
| Metric | Real-Time Display |
|---|---|
| Training progress | Progress bar with percentage |
| Current accuracy | Updated per epoch/iteration |
| Training loss | Loss curve chart |
| Resource usage | CPU and memory consumption |
| Estimated time remaining | Based on current pace |
Training a Random Forest on 5,000 samples typically completes in under 30 seconds.
What Happens Behind the Scenes
- The ML service retrieves the dataset from the feature store.
- Categorical features are encoded using the configured strategy.
- The data is split into train (80%) and test (20%) sets.
- The model is trained using scikit-learn's
RandomForestClassifier. - All parameters, metrics, and the model artifact are logged to MLflow.
- The model is serialized and stored in the artifact store (MinIO/S3).
Step 6: Evaluate the Model
After training completes, the evaluation view appears automatically.
Classification Metrics
| Metric | Value | Interpretation |
|---|---|---|
| Accuracy | 0.87 | 87% of predictions are correct |
| Precision | 0.82 | 82% of predicted churners actually churned |
| Recall | 0.79 | 79% of actual churners were identified |
| F1 Score | 0.80 | Harmonic mean of precision and recall |
| AUC-ROC | 0.91 | Excellent discrimination ability |
Confusion Matrix
| Predicted: Not Churned | Predicted: Churned | |
|---|---|---|
| Actual: Not Churned | 720 (TN) | 30 (FP) |
| Actual: Churned | 53 (FN) | 197 (TP) |
Feature Importance
The workbench displays a ranked list of feature importance:
| Feature | Importance |
|---|---|
days_since_last_order | 0.32 |
total_orders | 0.18 |
total_spend | 0.15 |
avg_order_value | 0.12 |
return_rate | 0.09 |
age | 0.06 |
return_count | 0.04 |
segment (encoded) | 0.03 |
region (encoded) | 0.01 |
ROC Curve
The ROC curve is displayed interactively. You can hover over points to see the threshold, TPR, and FPR at each point.
Step 7: Compare Runs (Optional)
Train additional models with different algorithms or hyperparameters, then compare:
- Click New Run and train a Gradient Boosting model.
- After training, go to the Experiment view.
- Select multiple runs and click Compare.
The comparison view shows:
| Run | Algorithm | Accuracy | F1 | AUC | Training Time |
|---|---|---|---|---|---|
| Run 1 | Random Forest | 0.87 | 0.80 | 0.91 | 28s |
| Run 2 | XGBoost | 0.89 | 0.83 | 0.93 | 45s |
| Run 3 | Logistic Regression | 0.81 | 0.72 | 0.85 | 3s |
Step 8: Register the Best Model
Once you have identified the best model:
- Open the run details for the winning model.
- Click Register Model.
- Configure:
- Model name:
customer-churn-predictor - Version: Auto-assigned (v1)
- Description: Random Forest classifier for predicting customer churn based on purchase behavior
- Tags:
production-candidate,churn,retail
- Model name:
- Click Register.
The model is now in the MLflow model registry with version tracking.
Model Registry Stages
| Stage | Description |
|---|---|
| None | Newly registered, not assigned to any stage |
| Staging | Being validated for production use |
| Production | Actively serving predictions |
| Archived | Retired from active use |
Transition the model to staging:
- In the Model Registry, find
customer-churn-predictor. - Click on Version 1.
- Click Transition to Staging.
- Add a note: "Passed initial evaluation with AUC 0.91."
Step 9: Test Inference
Test the registered model with sample data:
- In the model version view, click Test Prediction.
- Enter feature values:
{
"segment": "Premium",
"age": 42,
"region": "West",
"total_orders": 3,
"total_spend": 450.00,
"avg_order_value": 150.00,
"days_since_last_order": 120,
"return_count": 1,
"return_rate": 0.33
}- Click Predict.
Result:
{
"prediction": 1,
"probability": {
"not_churned": 0.23,
"churned": 0.77
},
"model": "customer-churn-predictor",
"version": 1
}This customer has a 77% probability of churning.
Step 10: Deploy for Batch Scoring
To score all current customers:
- Go to Pipelines in the sidebar.
- Click Create Pipeline.
- Select Batch Scoring template.
- Configure:
- Model:
customer-churn-predictor(Staging) - Input: The same feature query from Step 3
- Output table:
churn_predictions - Schedule: Daily at 6:00 AM
- Model:
- Click Create and Run.
The pipeline executes the feature query, runs inference, and writes predictions to the output table:
SELECT * FROM churn_predictions ORDER BY churn_probability DESC LIMIT 10;| customer_id | churn_probability | prediction | scored_at |
|---|---|---|---|
| cust_4823 | 0.94 | 1 | 2026-02-12 06:00:00 |
| cust_1247 | 0.91 | 1 | 2026-02-12 06:00:00 |
| cust_3891 | 0.88 | 1 | 2026-02-12 06:00:00 |
MLflow Integration
The ML Workbench is powered by MLflow for experiment tracking and model management:
| MLflow Component | MATIH Usage |
|---|---|
| Tracking | Log parameters, metrics, and artifacts for every run |
| Models | Model registry with version control and stage transitions |
| Projects | Reproducible training definitions |
| Model Serving | REST API for real-time inference |
Access the MLflow UI directly:
- Local:
http://localhost:5000 - Cloud:
https://mlflow.{your-tenant}.matih.ai
Troubleshooting
| Issue | Cause | Resolution |
|---|---|---|
| "Dataset not found" | Data query failed | Check data source connectivity and query syntax |
| Training hangs | Insufficient resources | Reduce dataset size or simplify the model |
| Low accuracy | Poor features or class imbalance | Engineer better features, try class_weight='balanced' |
| "Model registration failed" | MLflow tracking server down | Check MLflow service health |
| Batch scoring fails | Feature schema mismatch | Ensure the input query matches the training schema |
Next Steps
Continue to Data Quality Exploration to learn how to profile your data and detect quality issues before they affect your models and dashboards.