Data Scientist Journey: Predicting Customer Churn

Persona: Priya, Senior Data Scientist at NovaMart Objective: Build and deploy a customer churn prediction model to reduce NovaMart's 18.3% churn rate Timeline: 3 weeks from exploration to production Primary Workbenches: Data Workbench, ML Workbench

Stage 1: Ingestion

Priya starts in the Data Workbench by connecting NovaMart's primary transactional database and importing supplementary data.

Connecting the PostgreSQL Source

She navigates to Data Workbench > Sources > Add Connection and selects the Airbyte PostgreSQL connector. The Ingestion Service manages the connection lifecycle.

{
  "connector": "source-postgres",
  "config": {
    "host": "novamart-prod.db.internal",
    "port": 5432,
    "database": "novamart",
    "username": "${POSTGRES_USER}",
    "password": "${POSTGRES_PASSWORD}",
    "schemas": ["public"],
    "replication_method": {
      "method": "CDC",
      "replication_slot": "matih_cdc_slot",
      "publication": "matih_pub"
    },
    "ssl_mode": "require"
  },
  "streams": [
    { "name": "customers", "sync_mode": "incremental", "cursor_field": "updated_at" },
    { "name": "orders", "sync_mode": "incremental", "cursor_field": "updated_at" },
    { "name": "order_items", "sync_mode": "incremental", "cursor_field": "created_at" },
    { "name": "products", "sync_mode": "full_refresh" },
    { "name": "returns", "sync_mode": "incremental", "cursor_field": "return_date" }
  ],
  "schedule": {
    "type": "cron",
    "expression": "0 * * * *"
  }
}

The CDC connector uses PostgreSQL's Write-Ahead Log (WAL) for change capture, meaning Priya gets near-real-time updates without polling the production database.

Importing the Customer Survey CSV

Priya also has a CSV export of customer NPS survey scores from a third-party survey tool. She uses Data Workbench > Sources > File Import to upload it.

File: customer_surveys_2025.csv
Format: CSV (UTF-8, comma-delimited, header row)
Records: 48,217
Columns: customer_id (INT), survey_date (DATE), nps_score (INT), satisfaction_rating (INT), comments (TEXT)
Destination: novamart.surveys.customer_nps

The platform validates the schema, shows a 10-row preview, and ingests the file. The Catalog Service automatically registers the new table with column-level metadata.

Stage 2: Discovery

With data flowing, Priya explores the catalog to understand what she has to work with.

Browsing the Data Catalog

In Data Workbench > Catalog, she searches for "customer" and finds:

Table	Schema	Columns	Rows	Last Updated	Quality Score
`customers`	novamart.public	14	2.1M	12 min ago	87%
`customer_nps`	novamart.surveys	5	48K	2 hours ago	94%
`customer_segments`	analytics.derived	8	2.1M	1 day ago	91%

She clicks into the customers table and reviews the data profile:

Column Profiling: novamart.public.customers
─────────────────────────────────────────────────────────
Column            Type       Nulls    Unique     Distribution
─────────────────────────────────────────────────────────
customer_id       BIGINT     0.0%     2,103,847  Uniform
email             VARCHAR    12.1%    1,849,384  -
first_name        VARCHAR    0.2%     148,293    -
last_name         VARCHAR    0.3%     312,847    -
signup_date       DATE       0.0%     1,847      2019-01 to 2025-12
last_order_date   DATE       4.7%     1,612      2020-03 to 2025-12
segment           VARCHAR    0.0%     5          [Gold, Silver, Bronze, New, Inactive]
lifetime_value    DECIMAL    0.0%     -          min=0, median=287, max=14,329
channel           VARCHAR    0.1%     3          [web: 61%, mobile: 28%, store: 11%]
city              VARCHAR    2.3%     4,891      -
state             VARCHAR    1.8%     51         -
created_at        TIMESTAMP  0.0%     -          -
updated_at        TIMESTAMP  0.0%     -          -
is_active         BOOLEAN    0.0%     2          [true: 81.7%, false: 18.3%]

Key finding: The 12.1% null rate on email means Priya cannot use email-based features for all customers. She notes this for feature engineering.

Tracing Data Lineage

She clicks the Lineage tab to see how data flows between tables:

orders ──────┐
             ├──▶ order_items ──▶ product_revenue (derived)
products ────┘                        │
                                      ▼
customers ──────────────────────▶ customer_segments (derived)
             ▲                        │
             │                        ▼
customer_nps ┘                   churn_features (Priya will create this)

She confirms that order_items joins to both orders (via order_id) and products (via product_id), giving her the full purchase graph she needs.

Stage 3: Query

Priya switches to the Query Editor in the Data Workbench to build her churn feature set using SQL.

Defining Churn

NovaMart defines churn as a customer who has not placed an order in the last 90 days. She writes a feature engineering query that joins PostgreSQL and Snowflake data via the federated Query Engine (Trino):

-- Churn feature set: one row per customer with behavioral features
-- Federated query: PostgreSQL (orders, customers) + Snowflake (clickstream)
CREATE TABLE analytics.derived.churn_features AS
WITH order_features AS (
    SELECT
        c.customer_id,
        c.signup_date,
        c.segment,
        c.channel AS signup_channel,
        COUNT(DISTINCT o.order_id) AS total_orders,
        SUM(o.total_amount) AS total_revenue,
        AVG(o.total_amount) AS avg_order_value,
        MAX(o.order_date) AS last_order_date,
        DATE_DIFF('day', MAX(o.order_date), CURRENT_DATE) AS days_since_last_order,
        DATE_DIFF('day', c.signup_date, CURRENT_DATE) AS customer_age_days,
        COUNT(DISTINCT DATE_TRUNC('month', o.order_date)) AS active_months,
        STDDEV(o.total_amount) AS order_value_stddev
    FROM postgresql.novamart.customers c
    LEFT JOIN postgresql.novamart.orders o ON c.customer_id = o.customer_id
    WHERE c.signup_date < CURRENT_DATE - INTERVAL '90' DAY
    GROUP BY c.customer_id, c.signup_date, c.segment, c.channel
),
clickstream_features AS (
    SELECT
        customer_id,
        COUNT(*) AS total_sessions_90d,
        COUNT(DISTINCT DATE(event_timestamp)) AS active_days_90d,
        SUM(CASE WHEN event_type = 'product_view' THEN 1 ELSE 0 END) AS product_views_90d,
        SUM(CASE WHEN event_type = 'add_to_cart' THEN 1 ELSE 0 END) AS cart_adds_90d,
        SUM(CASE WHEN event_type = 'search' THEN 1 ELSE 0 END) AS searches_90d
    FROM snowflake.analytics.clickstream
    WHERE event_timestamp >= CURRENT_DATE - INTERVAL '90' DAY
    GROUP BY customer_id
),
return_features AS (
    SELECT
        o.customer_id,
        COUNT(r.return_id) AS total_returns,
        SUM(r.refund_amount) AS total_refund_amount,
        CAST(COUNT(r.return_id) AS DOUBLE) / NULLIF(COUNT(DISTINCT o.order_id), 0) AS return_rate
    FROM postgresql.novamart.orders o
    LEFT JOIN postgresql.novamart.returns r ON o.order_id = r.order_id
    GROUP BY o.customer_id
)
SELECT
    of.*,
    COALESCE(cf.total_sessions_90d, 0) AS total_sessions_90d,
    COALESCE(cf.active_days_90d, 0) AS active_days_90d,
    COALESCE(cf.product_views_90d, 0) AS product_views_90d,
    COALESCE(cf.cart_adds_90d, 0) AS cart_adds_90d,
    COALESCE(cf.searches_90d, 0) AS searches_90d,
    COALESCE(rf.total_returns, 0) AS total_returns,
    COALESCE(rf.return_rate, 0) AS return_rate,
    -- Label: churned if no order in 90 days
    CASE WHEN of.days_since_last_order > 90 THEN 1 ELSE 0 END AS is_churned
FROM order_features of
LEFT JOIN clickstream_features cf ON of.customer_id = cf.customer_id
LEFT JOIN return_features rf ON of.customer_id = rf.customer_id;

Query result: 1,987,432 rows, 21 columns. Execution time: 34 seconds (federated across PostgreSQL and Snowflake via the Trino query engine).

Stage 4: Orchestration

Priya needs this feature table refreshed daily. She creates a Temporal pipeline in the Data Workbench > Pipelines section.

Pipeline DAG

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Extract        │     │  Quality        │     │  Load to        │
│  Customer       │────▶│  Validation     │────▶│  Feature        │
│  Features       │     │  Gate           │     │  Store          │
│                 │     │                 │     │                 │
│  Run churn SQL  │     │  Check nulls,   │     │  Write to ML    │
│  query above    │     │  duplicates,    │     │  Workbench      │
│  against live   │     │  ranges         │     │  feature store  │
│  sources        │     │                 │     │                 │
└─────────────────┘     └─────────────────┘     └─────────────────┘

Pipeline Configuration

{
  "pipeline_id": "churn-feature-refresh",
  "name": "Daily Churn Feature Refresh",
  "schedule": "0 4 * * *",
  "tasks": [
    {
      "task_id": "extract_features",
      "type": "sql_execute",
      "config": {
        "query_ref": "analytics.queries.churn_feature_set",
        "target_table": "analytics.derived.churn_features",
        "write_mode": "overwrite"
      },
      "retry_policy": {
        "max_attempts": 3,
        "backoff_coefficient": 2,
        "initial_interval": "30s"
      }
    },
    {
      "task_id": "quality_gate",
      "type": "data_quality_check",
      "depends_on": ["extract_features"],
      "config": {
        "dataset": "analytics.derived.churn_features",
        "suite": "churn_features_validation",
        "fail_pipeline_on_warning": false,
        "fail_pipeline_on_error": true
      }
    },
    {
      "task_id": "load_feature_store",
      "type": "feature_store_sync",
      "depends_on": ["quality_gate"],
      "config": {
        "source_table": "analytics.derived.churn_features",
        "feature_group": "customer_churn_features",
        "entity_key": "customer_id",
        "timestamp_column": "last_order_date"
      }
    }
  ],
  "notifications": {
    "on_failure": { "channel": "slack", "target": "#data-alerts" },
    "on_success": { "channel": "slack", "target": "#data-pipelines" }
  }
}

Stage 5: Analysis

Before training a model, Priya validates the feature data using the Data Quality Service (powered by Great Expectations).

Quality Rule Configuration

{
  "suite_name": "churn_features_validation",
  "dataset": "analytics.derived.churn_features",
  "expectations": [
    {
      "type": "expect_table_row_count_to_be_between",
      "kwargs": { "min_value": 1800000, "max_value": 2200000 }
    },
    {
      "type": "expect_column_values_to_be_unique",
      "kwargs": { "column": "customer_id" }
    },
    {
      "type": "expect_column_values_to_not_be_null",
      "kwargs": { "column": "customer_id" }
    },
    {
      "type": "expect_column_values_to_be_between",
      "kwargs": { "column": "days_since_last_order", "min_value": 0, "max_value": 3650 }
    },
    {
      "type": "expect_column_values_to_be_between",
      "kwargs": { "column": "avg_order_value", "min_value": 0, "max_value": 50000 }
    },
    {
      "type": "expect_column_proportion_of_unique_values_to_be_between",
      "kwargs": { "column": "customer_id", "min_value": 0.99, "max_value": 1.0 }
    },
    {
      "type": "expect_column_values_to_be_in_set",
      "kwargs": { "column": "segment", "value_set": ["Gold", "Silver", "Bronze", "New", "Inactive"] }
    }
  ]
}

Quality Check Results

Suite: churn_features_validation
Run Date: 2025-12-15 04:12:33 UTC
Status: PASSED (7/7 expectations met)

  [PASS] Row count: 1,987,432 (expected 1.8M - 2.2M)
  [PASS] customer_id uniqueness: 100%
  [PASS] customer_id nulls: 0.0%
  [PASS] days_since_last_order range: all values in [0, 2,187]
  [PASS] avg_order_value range: all values in [4.99, 14,329.00]
  [PASS] customer_id uniqueness proportion: 1.0
  [PASS] segment values: all in expected set

Feature Distribution Profiling

Priya profiles the target variable and discovers an important pattern:

Feature: days_since_last_order
──────────────────────────────
Count:     1,987,432
Mean:      67.3 days
Median:    42.0 days
Std Dev:   89.1 days
Min:       0 days
Max:       2,187 days

Distribution (bimodal):
  0-30 days:   ████████████████████ 41.2%  (active buyers)
  31-60 days:  ████████             16.3%
  61-90 days:  ██████               12.1%
  91-180 days: ████████████         18.7%  (recently churned)
  181+ days:   ██████               11.7%  (long-term inactive)

Churn label distribution:
  Not churned (0): 69.6%  |  Churned (1): 30.4%

The bimodal distribution confirms that customers either buy regularly (< 30 days between orders) or drift away sharply after 90 days. This informs Priya's model design: there is a clear decision boundary.

Stage 6: Productionization

Priya switches to the ML Workbench to train, evaluate, and deploy her model.

Experiment Tracking

She trains three models and logs all experiments to MLflow via the ML Workbench:

Experiment	Model	Features	AUC	Precision	Recall	F1	Training Time
churn-exp-001	Logistic Regression	All 18	0.823	0.71	0.68	0.69	12s
churn-exp-002	Random Forest	All 18	0.856	0.76	0.73	0.74	2m 18s
churn-exp-003	Gradient Boosting (LightGBM)	All 18	0.871	0.79	0.77	0.78	4m 42s

Model Registration

She registers the best model (churn-exp-003) in the ML Workbench model registry:

{
  "model_name": "customer-churn-predictor",
  "version": "1.0.0",
  "source_experiment": "churn-exp-003",
  "framework": "lightgbm",
  "metrics": {
    "auc": 0.871,
    "precision": 0.79,
    "recall": 0.77,
    "f1": 0.78
  },
  "feature_set": "customer_churn_features",
  "input_schema": {
    "total_orders": "int",
    "avg_order_value": "float",
    "days_since_last_order": "int",
    "customer_age_days": "int",
    "total_sessions_90d": "int",
    "product_views_90d": "int",
    "return_rate": "float",
    "segment": "string"
  },
  "tags": {
    "team": "data-science",
    "use_case": "customer-retention",
    "owner": "priya@novamart.com"
  }
}

Ray Serve Deployment

She deploys the model to Ray Serve for real-time scoring via the ML Workbench deployment interface:

{
  "deployment_name": "churn-predictor-v1",
  "model_ref": "customer-churn-predictor/1.0.0",
  "runtime": "ray_serve",
  "config": {
    "num_replicas": 2,
    "max_concurrent_queries": 100,
    "ray_actor_options": {
      "num_cpus": 1,
      "memory": 2147483648
    },
    "autoscaling_config": {
      "min_replicas": 2,
      "max_replicas": 6,
      "target_num_ongoing_requests_per_replica": 10
    }
  },
  "health_check": {
    "path": "/health",
    "interval_s": 30
  }
}

After deployment, the model is accessible at the internal endpoint:

# Test the deployed model
curl -X POST https://ml.novamart.matih.internal/serve/churn-predictor-v1/predict \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ${JWT_TOKEN}" \
  -d '{
    "customer_id": 1042587,
    "total_orders": 3,
    "avg_order_value": 45.20,
    "days_since_last_order": 78,
    "customer_age_days": 412,
    "total_sessions_90d": 2,
    "product_views_90d": 8,
    "return_rate": 0.33,
    "segment": "Silver"
  }'
 
# Response
{
  "customer_id": 1042587,
  "churn_probability": 0.73,
  "churn_risk": "HIGH",
  "top_factors": [
    { "feature": "days_since_last_order", "importance": 0.31 },
    { "feature": "total_sessions_90d", "importance": 0.22 },
    { "feature": "return_rate", "importance": 0.18 }
  ]
}

Stage 7: Feedback

With the model in production, Priya configures monitoring to catch degradation early.

Drift Detection Configuration

In the ML Workbench > Monitoring section, she configures feature and prediction drift alerts:

{
  "monitor_name": "churn-predictor-drift",
  "deployment": "churn-predictor-v1",
  "monitors": [
    {
      "type": "feature_drift",
      "method": "psi",
      "features": ["avg_order_value", "days_since_last_order", "total_sessions_90d"],
      "threshold": 0.1,
      "window": "7d",
      "reference_dataset": "analytics.derived.churn_features_baseline_2025q4"
    },
    {
      "type": "prediction_drift",
      "method": "ks_test",
      "column": "churn_probability",
      "threshold": 0.05,
      "window": "7d"
    },
    {
      "type": "accuracy_tracking",
      "metric": "precision_at_k",
      "k": 1000,
      "ground_truth_table": "analytics.derived.actual_churn_labels",
      "evaluation_schedule": "0 0 * * MON",
      "alert_threshold": 0.70
    }
  ],
  "alerts": {
    "channels": ["slack:#ml-alerts", "email:priya@novamart.com"],
    "severity_mapping": {
      "feature_drift": "warning",
      "prediction_drift": "critical",
      "accuracy_drop": "critical"
    }
  }
}

Monitoring Dashboard

After two weeks in production, the monitoring dashboard shows:

Churn Predictor v1 - Production Monitoring
──────────────────────────────────────────
Period: 2025-12-15 to 2025-12-29

Prediction Volume:     142,847 requests
Avg Latency (p50):     23ms
Avg Latency (p99):     87ms
Error Rate:            0.02%

Feature Drift (PSI):
  avg_order_value:       0.03  [OK]
  days_since_last_order: 0.04  [OK]
  total_sessions_90d:    0.07  [OK]

Prediction Distribution:
  Low risk  (< 0.3):    62.1%
  Med risk  (0.3-0.7):  24.3%
  High risk (> 0.7):    13.6%

Weekly Accuracy (vs actual churn labels):
  Week 1:  Precision@1000 = 0.81  [OK]
  Week 2:  Precision@1000 = 0.79  [OK]

Stage 8: Experimentation

After the initial deployment, Priya runs a controlled experiment to compare her gradient boosting model against the original logistic regression baseline.

A/B Test Configuration

She sets up an A/B test via the ML Workbench > Experiments > A/B Tests panel:

{
  "experiment_name": "churn-model-comparison-v1-v2",
  "duration_days": 14,
  "traffic_split": {
    "control": {
      "model": "customer-churn-predictor/0.1.0",
      "description": "Logistic Regression (AUC 0.823)",
      "traffic_percentage": 50
    },
    "treatment": {
      "model": "customer-churn-predictor/1.0.0",
      "description": "LightGBM (AUC 0.871)",
      "traffic_percentage": 50
    }
  },
  "assignment_method": "customer_id_hash",
  "primary_metric": "churned_customers_identified",
  "secondary_metrics": [
    "false_positive_rate",
    "retention_campaign_cost",
    "revenue_saved"
  ],
  "guardrail_metrics": [
    { "metric": "false_positive_rate", "max_value": 0.30 }
  ]
}

A/B Test Results (After 14 Days)

Metric	Control (LR)	Treatment (LightGBM)	Lift	p-value
Churners identified (top 1000)	614	756	+23.1%	0.0003
False positive rate	0.22	0.19	-13.6%	0.012
Retention campaigns sent	1,000	1,000	--	--
Customers retained	187	231	+23.5%	0.008
Revenue saved (est.)	$48,200	$62,700	+30.1%	0.004
Campaign cost	$15,000	$15,000	--	--
Net ROI	$33,200	$47,700	+43.7%	--

Conclusion: The LightGBM model (treatment) identifies 23% more true churners with a lower false positive rate. At the same campaign budget, it generates an estimated $14,500 more in retained revenue per cohort of 1,000 targeted customers.

Priya promotes the LightGBM model to 100% traffic and archives the logistic regression variant.

Iteration Plan

Based on the experiment results, Priya documents next steps:

Feature expansion -- add customer_support_tickets data (new Zendesk connector)
Segment-specific models -- Gold customers may have different churn signals than Bronze
Real-time scoring -- trigger retention offers at checkout for high-risk customers
Feedback loop -- connect retention campaign outcomes back to the training data

Summary

Stage	What Priya Did	Platform Components Used
Ingestion	Connected PostgreSQL CDC, imported survey CSV	Ingestion Service, Data Workbench
Discovery	Browsed catalog, profiled columns, traced lineage	Catalog Service, Data Workbench
Query	Wrote federated SQL joining PostgreSQL + Snowflake	Query Engine (Trino), Data Workbench
Orchestration	Built daily pipeline with quality gate	Pipeline Service (Temporal), Data Workbench
Analysis	Validated features with Great Expectations	Data Quality Service, Data Workbench
Productionization	Trained LightGBM, deployed to Ray Serve	ML Workbench, ML Service (MLflow, Ray Serve)
Feedback	Configured drift detection and accuracy tracking	ML Workbench Monitoring
Experimentation	Ran A/B test, LightGBM +23% churner identification	ML Workbench Experiments

Related Walkthroughs

ML Engineer Journey -- Marcus operationalizes the demand forecasting pipeline
BI Lead Journey -- Sofia builds dashboards that consume Priya's churn scores
Executive Journey -- David uses AI to ask strategic questions about churn
Retail Overview -- NovaMart company profile and dataset descriptions

Industry Overview ML Engineer Journey