MATIH Platform is in active MVP development. Documentation reflects current implementation status.
5. Quickstart Tutorials
Tutorial: Training a Model

Tutorial: Training a Model

In this tutorial, you will use the MATIH ML Workbench to train a machine learning model that predicts customer churn. You will prepare data, select an algorithm, train the model, evaluate its performance, and register it in the model registry for deployment.


What You Will Learn

  • How to navigate the ML Workbench interface
  • How to create an ML experiment and prepare training data
  • How to configure and train a classification model
  • How to evaluate model performance with standard metrics
  • How to register a trained model in the MLflow model registry
  • How to deploy a registered model for inference

Prerequisites

RequirementHow to Verify
MATIH platform running./scripts/tools/platform-status.sh returns healthy
ML service operationalHealth check on ml-service passes
MLflow tracking serverMLflow UI accessible
Sample data loadedThe customers and orders tables are available

Step 1: Open the ML Workbench

Navigate to the ML Workbench:

  • Local development: http://localhost:3001
  • Cloud deployment: https://ml.{your-tenant}.matih.ai

Log in with your tenant credentials. The ML Workbench home page shows:

SectionDescription
ExperimentsList of ML experiments and their runs
ModelsModel registry with registered models
DatasetsAvailable datasets for training
NotebooksInteractive Jupyter-style notebooks
PipelinesAutomated ML pipelines

Step 2: Create a New Experiment

  1. Click Experiments in the sidebar.
  2. Click New Experiment.
  3. Configure the experiment:
    • Name: Customer Churn Prediction
    • Description: Predict which customers are likely to churn based on purchase history and demographics
    • Tags: churn, classification, retail
  4. Click Create.

An experiment in MLflow is a container for multiple training runs. Each run tracks parameters, metrics, and artifacts.


Step 3: Prepare the Training Data

Option A: Using the Data Preparation UI

  1. In the experiment view, click Prepare Data.
  2. Select SQL Query as the data source.
  3. Enter the feature engineering query:
SELECT
    c.id AS customer_id,
    c.segment,
    c.age,
    c.region,
    COALESCE(order_stats.total_orders, 0) AS total_orders,
    COALESCE(order_stats.total_spend, 0) AS total_spend,
    COALESCE(order_stats.avg_order_value, 0) AS avg_order_value,
    COALESCE(order_stats.days_since_last_order, 999) AS days_since_last_order,
    COALESCE(return_stats.return_count, 0) AS return_count,
    COALESCE(return_stats.return_rate, 0) AS return_rate,
    CASE
        WHEN order_stats.days_since_last_order > 90 THEN 1
        ELSE 0
    END AS churned
FROM customers c
LEFT JOIN (
    SELECT
        customer_id,
        COUNT(*) AS total_orders,
        SUM(total_amount) AS total_spend,
        AVG(total_amount) AS avg_order_value,
        EXTRACT(DAY FROM CURRENT_DATE - MAX(order_date)) AS days_since_last_order
    FROM orders
    GROUP BY customer_id
) order_stats ON c.id = order_stats.customer_id
LEFT JOIN (
    SELECT
        o.customer_id,
        COUNT(r.id) AS return_count,
        COUNT(r.id)::FLOAT / NULLIF(COUNT(o.id), 0) AS return_rate
    FROM orders o
    LEFT JOIN returns r ON o.id = r.order_id
    GROUP BY o.customer_id
) return_stats ON c.id = return_stats.customer_id
  1. Click Preview to verify the data (shows the first 100 rows).
  2. Configure the dataset:
    • Name: churn_features_v1
    • Target column: churned
    • Feature columns: All except customer_id and churned
    • Train/Test split: 80% / 20%
    • Stratify by: churned (to maintain class balance)
  3. Click Save Dataset.

Option B: Using a Notebook

Create a new notebook and use Python for data preparation:

import pandas as pd
from matih_ml import DataSource, Dataset
 
# Connect to the tenant's data source
ds = DataSource.from_tenant()
 
# Execute the feature query
df = ds.query("""
    SELECT
        c.id AS customer_id,
        c.segment,
        c.age,
        c.region,
        -- ... (same query as above)
    FROM customers c
    LEFT JOIN ...
""")
 
# Explore the data
print(f"Shape: {df.shape}")
print(f"Churn rate: {df['churned'].mean():.2%}")
print(df.describe())
 
# Save as a dataset
dataset = Dataset.create(
    name="churn_features_v1",
    dataframe=df,
    target="churned",
    test_size=0.2,
    stratify="churned"
)

Step 4: Configure the Model

  1. In the experiment view, click New Run.
  2. Select the dataset: churn_features_v1.
  3. Choose the algorithm:

Available Algorithms

AlgorithmTypeBest For
Logistic RegressionClassificationInterpretable baseline
Random ForestClassificationRobust, handles mixed features
Gradient Boosting (XGBoost)ClassificationHigh performance
LightGBMClassificationFast training on large datasets
Support Vector MachineClassificationHigh-dimensional data
Neural Network (MLP)ClassificationComplex non-linear patterns

For this tutorial, select Random Forest.

  1. Configure hyperparameters:
ParameterValueDescription
n_estimators100Number of trees in the forest
max_depth10Maximum tree depth
min_samples_split5Minimum samples to split a node
min_samples_leaf2Minimum samples in a leaf node
class_weightbalancedHandle class imbalance
  1. Configure categorical encoding:
    • segment: One-hot encoding
    • region: One-hot encoding

Step 5: Train the Model

Click Start Training. The ML Workbench submits the training job to the ML service.

Training Progress

The training progress view shows:

MetricReal-Time Display
Training progressProgress bar with percentage
Current accuracyUpdated per epoch/iteration
Training lossLoss curve chart
Resource usageCPU and memory consumption
Estimated time remainingBased on current pace

Training a Random Forest on 5,000 samples typically completes in under 30 seconds.

What Happens Behind the Scenes

  1. The ML service retrieves the dataset from the feature store.
  2. Categorical features are encoded using the configured strategy.
  3. The data is split into train (80%) and test (20%) sets.
  4. The model is trained using scikit-learn's RandomForestClassifier.
  5. All parameters, metrics, and the model artifact are logged to MLflow.
  6. The model is serialized and stored in the artifact store (MinIO/S3).

Step 6: Evaluate the Model

After training completes, the evaluation view appears automatically.

Classification Metrics

MetricValueInterpretation
Accuracy0.8787% of predictions are correct
Precision0.8282% of predicted churners actually churned
Recall0.7979% of actual churners were identified
F1 Score0.80Harmonic mean of precision and recall
AUC-ROC0.91Excellent discrimination ability

Confusion Matrix

Predicted: Not ChurnedPredicted: Churned
Actual: Not Churned720 (TN)30 (FP)
Actual: Churned53 (FN)197 (TP)

Feature Importance

The workbench displays a ranked list of feature importance:

FeatureImportance
days_since_last_order0.32
total_orders0.18
total_spend0.15
avg_order_value0.12
return_rate0.09
age0.06
return_count0.04
segment (encoded)0.03
region (encoded)0.01

ROC Curve

The ROC curve is displayed interactively. You can hover over points to see the threshold, TPR, and FPR at each point.


Step 7: Compare Runs (Optional)

Train additional models with different algorithms or hyperparameters, then compare:

  1. Click New Run and train a Gradient Boosting model.
  2. After training, go to the Experiment view.
  3. Select multiple runs and click Compare.

The comparison view shows:

RunAlgorithmAccuracyF1AUCTraining Time
Run 1Random Forest0.870.800.9128s
Run 2XGBoost0.890.830.9345s
Run 3Logistic Regression0.810.720.853s

Step 8: Register the Best Model

Once you have identified the best model:

  1. Open the run details for the winning model.
  2. Click Register Model.
  3. Configure:
    • Model name: customer-churn-predictor
    • Version: Auto-assigned (v1)
    • Description: Random Forest classifier for predicting customer churn based on purchase behavior
    • Tags: production-candidate, churn, retail
  4. Click Register.

The model is now in the MLflow model registry with version tracking.

Model Registry Stages

StageDescription
NoneNewly registered, not assigned to any stage
StagingBeing validated for production use
ProductionActively serving predictions
ArchivedRetired from active use

Transition the model to staging:

  1. In the Model Registry, find customer-churn-predictor.
  2. Click on Version 1.
  3. Click Transition to Staging.
  4. Add a note: "Passed initial evaluation with AUC 0.91."

Step 9: Test Inference

Test the registered model with sample data:

  1. In the model version view, click Test Prediction.
  2. Enter feature values:
{
  "segment": "Premium",
  "age": 42,
  "region": "West",
  "total_orders": 3,
  "total_spend": 450.00,
  "avg_order_value": 150.00,
  "days_since_last_order": 120,
  "return_count": 1,
  "return_rate": 0.33
}
  1. Click Predict.

Result:

{
  "prediction": 1,
  "probability": {
    "not_churned": 0.23,
    "churned": 0.77
  },
  "model": "customer-churn-predictor",
  "version": 1
}

This customer has a 77% probability of churning.


Step 10: Deploy for Batch Scoring

To score all current customers:

  1. Go to Pipelines in the sidebar.
  2. Click Create Pipeline.
  3. Select Batch Scoring template.
  4. Configure:
    • Model: customer-churn-predictor (Staging)
    • Input: The same feature query from Step 3
    • Output table: churn_predictions
    • Schedule: Daily at 6:00 AM
  5. Click Create and Run.

The pipeline executes the feature query, runs inference, and writes predictions to the output table:

SELECT * FROM churn_predictions ORDER BY churn_probability DESC LIMIT 10;
customer_idchurn_probabilitypredictionscored_at
cust_48230.9412026-02-12 06:00:00
cust_12470.9112026-02-12 06:00:00
cust_38910.8812026-02-12 06:00:00

MLflow Integration

The ML Workbench is powered by MLflow for experiment tracking and model management:

MLflow ComponentMATIH Usage
TrackingLog parameters, metrics, and artifacts for every run
ModelsModel registry with version control and stage transitions
ProjectsReproducible training definitions
Model ServingREST API for real-time inference

Access the MLflow UI directly:

  • Local: http://localhost:5000
  • Cloud: https://mlflow.{your-tenant}.matih.ai

Troubleshooting

IssueCauseResolution
"Dataset not found"Data query failedCheck data source connectivity and query syntax
Training hangsInsufficient resourcesReduce dataset size or simplify the model
Low accuracyPoor features or class imbalanceEngineer better features, try class_weight='balanced'
"Model registration failed"MLflow tracking server downCheck MLflow service health
Batch scoring failsFeature schema mismatchEnsure the input query matches the training schema

Next Steps

Continue to Data Quality Exploration to learn how to profile your data and detect quality issues before they affect your models and dashboards.