MATIH Platform is in active MVP development. Documentation reflects current implementation status.
21. Industry Examples & Walkthroughs
Retail & E-Commerce
Data Scientist Journey

Data Scientist Journey: Predicting Customer Churn

Persona: Priya, Senior Data Scientist at NovaMart Objective: Build and deploy a customer churn prediction model to reduce NovaMart's 18.3% churn rate Timeline: 3 weeks from exploration to production Primary Workbenches: Data Workbench, ML Workbench


Stage 1: Ingestion

Priya starts in the Data Workbench by connecting NovaMart's primary transactional database and importing supplementary data.

Connecting the PostgreSQL Source

She navigates to Data Workbench > Sources > Add Connection and selects the Airbyte PostgreSQL connector. The Ingestion Service manages the connection lifecycle.

{
  "connector": "source-postgres",
  "config": {
    "host": "novamart-prod.db.internal",
    "port": 5432,
    "database": "novamart",
    "username": "${POSTGRES_USER}",
    "password": "${POSTGRES_PASSWORD}",
    "schemas": ["public"],
    "replication_method": {
      "method": "CDC",
      "replication_slot": "matih_cdc_slot",
      "publication": "matih_pub"
    },
    "ssl_mode": "require"
  },
  "streams": [
    { "name": "customers", "sync_mode": "incremental", "cursor_field": "updated_at" },
    { "name": "orders", "sync_mode": "incremental", "cursor_field": "updated_at" },
    { "name": "order_items", "sync_mode": "incremental", "cursor_field": "created_at" },
    { "name": "products", "sync_mode": "full_refresh" },
    { "name": "returns", "sync_mode": "incremental", "cursor_field": "return_date" }
  ],
  "schedule": {
    "type": "cron",
    "expression": "0 * * * *"
  }
}

The CDC connector uses PostgreSQL's Write-Ahead Log (WAL) for change capture, meaning Priya gets near-real-time updates without polling the production database.

Importing the Customer Survey CSV

Priya also has a CSV export of customer NPS survey scores from a third-party survey tool. She uses Data Workbench > Sources > File Import to upload it.

File: customer_surveys_2025.csv
Format: CSV (UTF-8, comma-delimited, header row)
Records: 48,217
Columns: customer_id (INT), survey_date (DATE), nps_score (INT), satisfaction_rating (INT), comments (TEXT)
Destination: novamart.surveys.customer_nps

The platform validates the schema, shows a 10-row preview, and ingests the file. The Catalog Service automatically registers the new table with column-level metadata.


Stage 2: Discovery

With data flowing, Priya explores the catalog to understand what she has to work with.

Browsing the Data Catalog

In Data Workbench > Catalog, she searches for "customer" and finds:

TableSchemaColumnsRowsLast UpdatedQuality Score
customersnovamart.public142.1M12 min ago87%
customer_npsnovamart.surveys548K2 hours ago94%
customer_segmentsanalytics.derived82.1M1 day ago91%

She clicks into the customers table and reviews the data profile:

Column Profiling: novamart.public.customers
─────────────────────────────────────────────────────────
Column            Type       Nulls    Unique     Distribution
─────────────────────────────────────────────────────────
customer_id       BIGINT     0.0%     2,103,847  Uniform
email             VARCHAR    12.1%    1,849,384  -
first_name        VARCHAR    0.2%     148,293    -
last_name         VARCHAR    0.3%     312,847    -
signup_date       DATE       0.0%     1,847      2019-01 to 2025-12
last_order_date   DATE       4.7%     1,612      2020-03 to 2025-12
segment           VARCHAR    0.0%     5          [Gold, Silver, Bronze, New, Inactive]
lifetime_value    DECIMAL    0.0%     -          min=0, median=287, max=14,329
channel           VARCHAR    0.1%     3          [web: 61%, mobile: 28%, store: 11%]
city              VARCHAR    2.3%     4,891      -
state             VARCHAR    1.8%     51         -
created_at        TIMESTAMP  0.0%     -          -
updated_at        TIMESTAMP  0.0%     -          -
is_active         BOOLEAN    0.0%     2          [true: 81.7%, false: 18.3%]

Key finding: The 12.1% null rate on email means Priya cannot use email-based features for all customers. She notes this for feature engineering.

Tracing Data Lineage

She clicks the Lineage tab to see how data flows between tables:

orders ──────┐
             ├──▶ order_items ──▶ product_revenue (derived)
products ────┘                        │

customers ──────────────────────▶ customer_segments (derived)
             ▲                        │
             │                        ▼
customer_nps ┘                   churn_features (Priya will create this)

She confirms that order_items joins to both orders (via order_id) and products (via product_id), giving her the full purchase graph she needs.


Stage 3: Query

Priya switches to the Query Editor in the Data Workbench to build her churn feature set using SQL.

Defining Churn

NovaMart defines churn as a customer who has not placed an order in the last 90 days. She writes a feature engineering query that joins PostgreSQL and Snowflake data via the federated Query Engine (Trino):

-- Churn feature set: one row per customer with behavioral features
-- Federated query: PostgreSQL (orders, customers) + Snowflake (clickstream)
CREATE TABLE analytics.derived.churn_features AS
WITH order_features AS (
    SELECT
        c.customer_id,
        c.signup_date,
        c.segment,
        c.channel AS signup_channel,
        COUNT(DISTINCT o.order_id) AS total_orders,
        SUM(o.total_amount) AS total_revenue,
        AVG(o.total_amount) AS avg_order_value,
        MAX(o.order_date) AS last_order_date,
        DATE_DIFF('day', MAX(o.order_date), CURRENT_DATE) AS days_since_last_order,
        DATE_DIFF('day', c.signup_date, CURRENT_DATE) AS customer_age_days,
        COUNT(DISTINCT DATE_TRUNC('month', o.order_date)) AS active_months,
        STDDEV(o.total_amount) AS order_value_stddev
    FROM postgresql.novamart.customers c
    LEFT JOIN postgresql.novamart.orders o ON c.customer_id = o.customer_id
    WHERE c.signup_date < CURRENT_DATE - INTERVAL '90' DAY
    GROUP BY c.customer_id, c.signup_date, c.segment, c.channel
),
clickstream_features AS (
    SELECT
        customer_id,
        COUNT(*) AS total_sessions_90d,
        COUNT(DISTINCT DATE(event_timestamp)) AS active_days_90d,
        SUM(CASE WHEN event_type = 'product_view' THEN 1 ELSE 0 END) AS product_views_90d,
        SUM(CASE WHEN event_type = 'add_to_cart' THEN 1 ELSE 0 END) AS cart_adds_90d,
        SUM(CASE WHEN event_type = 'search' THEN 1 ELSE 0 END) AS searches_90d
    FROM snowflake.analytics.clickstream
    WHERE event_timestamp >= CURRENT_DATE - INTERVAL '90' DAY
    GROUP BY customer_id
),
return_features AS (
    SELECT
        o.customer_id,
        COUNT(r.return_id) AS total_returns,
        SUM(r.refund_amount) AS total_refund_amount,
        CAST(COUNT(r.return_id) AS DOUBLE) / NULLIF(COUNT(DISTINCT o.order_id), 0) AS return_rate
    FROM postgresql.novamart.orders o
    LEFT JOIN postgresql.novamart.returns r ON o.order_id = r.order_id
    GROUP BY o.customer_id
)
SELECT
    of.*,
    COALESCE(cf.total_sessions_90d, 0) AS total_sessions_90d,
    COALESCE(cf.active_days_90d, 0) AS active_days_90d,
    COALESCE(cf.product_views_90d, 0) AS product_views_90d,
    COALESCE(cf.cart_adds_90d, 0) AS cart_adds_90d,
    COALESCE(cf.searches_90d, 0) AS searches_90d,
    COALESCE(rf.total_returns, 0) AS total_returns,
    COALESCE(rf.return_rate, 0) AS return_rate,
    -- Label: churned if no order in 90 days
    CASE WHEN of.days_since_last_order > 90 THEN 1 ELSE 0 END AS is_churned
FROM order_features of
LEFT JOIN clickstream_features cf ON of.customer_id = cf.customer_id
LEFT JOIN return_features rf ON of.customer_id = rf.customer_id;

Query result: 1,987,432 rows, 21 columns. Execution time: 34 seconds (federated across PostgreSQL and Snowflake via the Trino query engine).


Stage 4: Orchestration

Priya needs this feature table refreshed daily. She creates a Temporal pipeline in the Data Workbench > Pipelines section.

Pipeline DAG

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Extract        │     │  Quality        │     │  Load to        │
│  Customer       │────▶│  Validation     │────▶│  Feature        │
│  Features       │     │  Gate           │     │  Store          │
│                 │     │                 │     │                 │
│  Run churn SQL  │     │  Check nulls,   │     │  Write to ML    │
│  query above    │     │  duplicates,    │     │  Workbench      │
│  against live   │     │  ranges         │     │  feature store  │
│  sources        │     │                 │     │                 │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Pipeline Configuration

{
  "pipeline_id": "churn-feature-refresh",
  "name": "Daily Churn Feature Refresh",
  "schedule": "0 4 * * *",
  "tasks": [
    {
      "task_id": "extract_features",
      "type": "sql_execute",
      "config": {
        "query_ref": "analytics.queries.churn_feature_set",
        "target_table": "analytics.derived.churn_features",
        "write_mode": "overwrite"
      },
      "retry_policy": {
        "max_attempts": 3,
        "backoff_coefficient": 2,
        "initial_interval": "30s"
      }
    },
    {
      "task_id": "quality_gate",
      "type": "data_quality_check",
      "depends_on": ["extract_features"],
      "config": {
        "dataset": "analytics.derived.churn_features",
        "suite": "churn_features_validation",
        "fail_pipeline_on_warning": false,
        "fail_pipeline_on_error": true
      }
    },
    {
      "task_id": "load_feature_store",
      "type": "feature_store_sync",
      "depends_on": ["quality_gate"],
      "config": {
        "source_table": "analytics.derived.churn_features",
        "feature_group": "customer_churn_features",
        "entity_key": "customer_id",
        "timestamp_column": "last_order_date"
      }
    }
  ],
  "notifications": {
    "on_failure": { "channel": "slack", "target": "#data-alerts" },
    "on_success": { "channel": "slack", "target": "#data-pipelines" }
  }
}

Stage 5: Analysis

Before training a model, Priya validates the feature data using the Data Quality Service (powered by Great Expectations).

Quality Rule Configuration

{
  "suite_name": "churn_features_validation",
  "dataset": "analytics.derived.churn_features",
  "expectations": [
    {
      "type": "expect_table_row_count_to_be_between",
      "kwargs": { "min_value": 1800000, "max_value": 2200000 }
    },
    {
      "type": "expect_column_values_to_be_unique",
      "kwargs": { "column": "customer_id" }
    },
    {
      "type": "expect_column_values_to_not_be_null",
      "kwargs": { "column": "customer_id" }
    },
    {
      "type": "expect_column_values_to_be_between",
      "kwargs": { "column": "days_since_last_order", "min_value": 0, "max_value": 3650 }
    },
    {
      "type": "expect_column_values_to_be_between",
      "kwargs": { "column": "avg_order_value", "min_value": 0, "max_value": 50000 }
    },
    {
      "type": "expect_column_proportion_of_unique_values_to_be_between",
      "kwargs": { "column": "customer_id", "min_value": 0.99, "max_value": 1.0 }
    },
    {
      "type": "expect_column_values_to_be_in_set",
      "kwargs": { "column": "segment", "value_set": ["Gold", "Silver", "Bronze", "New", "Inactive"] }
    }
  ]
}

Quality Check Results

Suite: churn_features_validation
Run Date: 2025-12-15 04:12:33 UTC
Status: PASSED (7/7 expectations met)

  [PASS] Row count: 1,987,432 (expected 1.8M - 2.2M)
  [PASS] customer_id uniqueness: 100%
  [PASS] customer_id nulls: 0.0%
  [PASS] days_since_last_order range: all values in [0, 2,187]
  [PASS] avg_order_value range: all values in [4.99, 14,329.00]
  [PASS] customer_id uniqueness proportion: 1.0
  [PASS] segment values: all in expected set

Feature Distribution Profiling

Priya profiles the target variable and discovers an important pattern:

Feature: days_since_last_order
──────────────────────────────
Count:     1,987,432
Mean:      67.3 days
Median:    42.0 days
Std Dev:   89.1 days
Min:       0 days
Max:       2,187 days

Distribution (bimodal):
  0-30 days:   ████████████████████ 41.2%  (active buyers)
  31-60 days:  ████████             16.3%
  61-90 days:  ██████               12.1%
  91-180 days: ████████████         18.7%  (recently churned)
  181+ days:   ██████               11.7%  (long-term inactive)

Churn label distribution:
  Not churned (0): 69.6%  |  Churned (1): 30.4%

The bimodal distribution confirms that customers either buy regularly (< 30 days between orders) or drift away sharply after 90 days. This informs Priya's model design: there is a clear decision boundary.


Stage 6: Productionization

Priya switches to the ML Workbench to train, evaluate, and deploy her model.

Experiment Tracking

She trains three models and logs all experiments to MLflow via the ML Workbench:

ExperimentModelFeaturesAUCPrecisionRecallF1Training Time
churn-exp-001Logistic RegressionAll 180.8230.710.680.6912s
churn-exp-002Random ForestAll 180.8560.760.730.742m 18s
churn-exp-003Gradient Boosting (LightGBM)All 180.8710.790.770.784m 42s

Model Registration

She registers the best model (churn-exp-003) in the ML Workbench model registry:

{
  "model_name": "customer-churn-predictor",
  "version": "1.0.0",
  "source_experiment": "churn-exp-003",
  "framework": "lightgbm",
  "metrics": {
    "auc": 0.871,
    "precision": 0.79,
    "recall": 0.77,
    "f1": 0.78
  },
  "feature_set": "customer_churn_features",
  "input_schema": {
    "total_orders": "int",
    "avg_order_value": "float",
    "days_since_last_order": "int",
    "customer_age_days": "int",
    "total_sessions_90d": "int",
    "product_views_90d": "int",
    "return_rate": "float",
    "segment": "string"
  },
  "tags": {
    "team": "data-science",
    "use_case": "customer-retention",
    "owner": "priya@novamart.com"
  }
}

Ray Serve Deployment

She deploys the model to Ray Serve for real-time scoring via the ML Workbench deployment interface:

{
  "deployment_name": "churn-predictor-v1",
  "model_ref": "customer-churn-predictor/1.0.0",
  "runtime": "ray_serve",
  "config": {
    "num_replicas": 2,
    "max_concurrent_queries": 100,
    "ray_actor_options": {
      "num_cpus": 1,
      "memory": 2147483648
    },
    "autoscaling_config": {
      "min_replicas": 2,
      "max_replicas": 6,
      "target_num_ongoing_requests_per_replica": 10
    }
  },
  "health_check": {
    "path": "/health",
    "interval_s": 30
  }
}

After deployment, the model is accessible at the internal endpoint:

# Test the deployed model
curl -X POST https://ml.novamart.matih.internal/serve/churn-predictor-v1/predict \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${JWT_TOKEN}" \
  -d '{
    "customer_id": 1042587,
    "total_orders": 3,
    "avg_order_value": 45.20,
    "days_since_last_order": 78,
    "customer_age_days": 412,
    "total_sessions_90d": 2,
    "product_views_90d": 8,
    "return_rate": 0.33,
    "segment": "Silver"
  }'
 
# Response
{
  "customer_id": 1042587,
  "churn_probability": 0.73,
  "churn_risk": "HIGH",
  "top_factors": [
    { "feature": "days_since_last_order", "importance": 0.31 },
    { "feature": "total_sessions_90d", "importance": 0.22 },
    { "feature": "return_rate", "importance": 0.18 }
  ]
}

Stage 7: Feedback

With the model in production, Priya configures monitoring to catch degradation early.

Drift Detection Configuration

In the ML Workbench > Monitoring section, she configures feature and prediction drift alerts:

{
  "monitor_name": "churn-predictor-drift",
  "deployment": "churn-predictor-v1",
  "monitors": [
    {
      "type": "feature_drift",
      "method": "psi",
      "features": ["avg_order_value", "days_since_last_order", "total_sessions_90d"],
      "threshold": 0.1,
      "window": "7d",
      "reference_dataset": "analytics.derived.churn_features_baseline_2025q4"
    },
    {
      "type": "prediction_drift",
      "method": "ks_test",
      "column": "churn_probability",
      "threshold": 0.05,
      "window": "7d"
    },
    {
      "type": "accuracy_tracking",
      "metric": "precision_at_k",
      "k": 1000,
      "ground_truth_table": "analytics.derived.actual_churn_labels",
      "evaluation_schedule": "0 0 * * MON",
      "alert_threshold": 0.70
    }
  ],
  "alerts": {
    "channels": ["slack:#ml-alerts", "email:priya@novamart.com"],
    "severity_mapping": {
      "feature_drift": "warning",
      "prediction_drift": "critical",
      "accuracy_drop": "critical"
    }
  }
}

Monitoring Dashboard

After two weeks in production, the monitoring dashboard shows:

Churn Predictor v1 - Production Monitoring
──────────────────────────────────────────
Period: 2025-12-15 to 2025-12-29

Prediction Volume:     142,847 requests
Avg Latency (p50):     23ms
Avg Latency (p99):     87ms
Error Rate:            0.02%

Feature Drift (PSI):
  avg_order_value:       0.03  [OK]
  days_since_last_order: 0.04  [OK]
  total_sessions_90d:    0.07  [OK]

Prediction Distribution:
  Low risk  (< 0.3):    62.1%
  Med risk  (0.3-0.7):  24.3%
  High risk (> 0.7):    13.6%

Weekly Accuracy (vs actual churn labels):
  Week 1:  Precision@1000 = 0.81  [OK]
  Week 2:  Precision@1000 = 0.79  [OK]

Stage 8: Experimentation

After the initial deployment, Priya runs a controlled experiment to compare her gradient boosting model against the original logistic regression baseline.

A/B Test Configuration

She sets up an A/B test via the ML Workbench > Experiments > A/B Tests panel:

{
  "experiment_name": "churn-model-comparison-v1-v2",
  "duration_days": 14,
  "traffic_split": {
    "control": {
      "model": "customer-churn-predictor/0.1.0",
      "description": "Logistic Regression (AUC 0.823)",
      "traffic_percentage": 50
    },
    "treatment": {
      "model": "customer-churn-predictor/1.0.0",
      "description": "LightGBM (AUC 0.871)",
      "traffic_percentage": 50
    }
  },
  "assignment_method": "customer_id_hash",
  "primary_metric": "churned_customers_identified",
  "secondary_metrics": [
    "false_positive_rate",
    "retention_campaign_cost",
    "revenue_saved"
  ],
  "guardrail_metrics": [
    { "metric": "false_positive_rate", "max_value": 0.30 }
  ]
}

A/B Test Results (After 14 Days)

MetricControl (LR)Treatment (LightGBM)Liftp-value
Churners identified (top 1000)614756+23.1%0.0003
False positive rate0.220.19-13.6%0.012
Retention campaigns sent1,0001,000----
Customers retained187231+23.5%0.008
Revenue saved (est.)$48,200$62,700+30.1%0.004
Campaign cost$15,000$15,000----
Net ROI$33,200$47,700+43.7%--

Conclusion: The LightGBM model (treatment) identifies 23% more true churners with a lower false positive rate. At the same campaign budget, it generates an estimated $14,500 more in retained revenue per cohort of 1,000 targeted customers.

Priya promotes the LightGBM model to 100% traffic and archives the logistic regression variant.

Iteration Plan

Based on the experiment results, Priya documents next steps:

  1. Feature expansion -- add customer_support_tickets data (new Zendesk connector)
  2. Segment-specific models -- Gold customers may have different churn signals than Bronze
  3. Real-time scoring -- trigger retention offers at checkout for high-risk customers
  4. Feedback loop -- connect retention campaign outcomes back to the training data

Summary

StageWhat Priya DidPlatform Components Used
IngestionConnected PostgreSQL CDC, imported survey CSVIngestion Service, Data Workbench
DiscoveryBrowsed catalog, profiled columns, traced lineageCatalog Service, Data Workbench
QueryWrote federated SQL joining PostgreSQL + SnowflakeQuery Engine (Trino), Data Workbench
OrchestrationBuilt daily pipeline with quality gatePipeline Service (Temporal), Data Workbench
AnalysisValidated features with Great ExpectationsData Quality Service, Data Workbench
ProductionizationTrained LightGBM, deployed to Ray ServeML Workbench, ML Service (MLflow, Ray Serve)
FeedbackConfigured drift detection and accuracy trackingML Workbench Monitoring
ExperimentationRan A/B test, LightGBM +23% churner identificationML Workbench Experiments

Related Walkthroughs