Data Scientist Journey: Predicting Customer Churn
Persona: Priya, Senior Data Scientist at NovaMart Objective: Build and deploy a customer churn prediction model to reduce NovaMart's 18.3% churn rate Timeline: 3 weeks from exploration to production Primary Workbenches: Data Workbench, ML Workbench
Stage 1: Ingestion
Priya starts in the Data Workbench by connecting NovaMart's primary transactional database and importing supplementary data.
Connecting the PostgreSQL Source
She navigates to Data Workbench > Sources > Add Connection and selects the Airbyte PostgreSQL connector. The Ingestion Service manages the connection lifecycle.
{
"connector": "source-postgres",
"config": {
"host": "novamart-prod.db.internal",
"port": 5432,
"database": "novamart",
"username": "${POSTGRES_USER}",
"password": "${POSTGRES_PASSWORD}",
"schemas": ["public"],
"replication_method": {
"method": "CDC",
"replication_slot": "matih_cdc_slot",
"publication": "matih_pub"
},
"ssl_mode": "require"
},
"streams": [
{ "name": "customers", "sync_mode": "incremental", "cursor_field": "updated_at" },
{ "name": "orders", "sync_mode": "incremental", "cursor_field": "updated_at" },
{ "name": "order_items", "sync_mode": "incremental", "cursor_field": "created_at" },
{ "name": "products", "sync_mode": "full_refresh" },
{ "name": "returns", "sync_mode": "incremental", "cursor_field": "return_date" }
],
"schedule": {
"type": "cron",
"expression": "0 * * * *"
}
}The CDC connector uses PostgreSQL's Write-Ahead Log (WAL) for change capture, meaning Priya gets near-real-time updates without polling the production database.
Importing the Customer Survey CSV
Priya also has a CSV export of customer NPS survey scores from a third-party survey tool. She uses Data Workbench > Sources > File Import to upload it.
File: customer_surveys_2025.csv
Format: CSV (UTF-8, comma-delimited, header row)
Records: 48,217
Columns: customer_id (INT), survey_date (DATE), nps_score (INT), satisfaction_rating (INT), comments (TEXT)
Destination: novamart.surveys.customer_npsThe platform validates the schema, shows a 10-row preview, and ingests the file. The Catalog Service automatically registers the new table with column-level metadata.
Stage 2: Discovery
With data flowing, Priya explores the catalog to understand what she has to work with.
Browsing the Data Catalog
In Data Workbench > Catalog, she searches for "customer" and finds:
| Table | Schema | Columns | Rows | Last Updated | Quality Score |
|---|---|---|---|---|---|
customers | novamart.public | 14 | 2.1M | 12 min ago | 87% |
customer_nps | novamart.surveys | 5 | 48K | 2 hours ago | 94% |
customer_segments | analytics.derived | 8 | 2.1M | 1 day ago | 91% |
She clicks into the customers table and reviews the data profile:
Column Profiling: novamart.public.customers
─────────────────────────────────────────────────────────
Column Type Nulls Unique Distribution
─────────────────────────────────────────────────────────
customer_id BIGINT 0.0% 2,103,847 Uniform
email VARCHAR 12.1% 1,849,384 -
first_name VARCHAR 0.2% 148,293 -
last_name VARCHAR 0.3% 312,847 -
signup_date DATE 0.0% 1,847 2019-01 to 2025-12
last_order_date DATE 4.7% 1,612 2020-03 to 2025-12
segment VARCHAR 0.0% 5 [Gold, Silver, Bronze, New, Inactive]
lifetime_value DECIMAL 0.0% - min=0, median=287, max=14,329
channel VARCHAR 0.1% 3 [web: 61%, mobile: 28%, store: 11%]
city VARCHAR 2.3% 4,891 -
state VARCHAR 1.8% 51 -
created_at TIMESTAMP 0.0% - -
updated_at TIMESTAMP 0.0% - -
is_active BOOLEAN 0.0% 2 [true: 81.7%, false: 18.3%]Key finding: The 12.1% null rate on email means Priya cannot use email-based
features for all customers. She notes this for feature engineering.
Tracing Data Lineage
She clicks the Lineage tab to see how data flows between tables:
orders ──────┐
├──▶ order_items ──▶ product_revenue (derived)
products ────┘ │
▼
customers ──────────────────────▶ customer_segments (derived)
▲ │
│ ▼
customer_nps ┘ churn_features (Priya will create this)She confirms that order_items joins to both orders (via order_id) and products
(via product_id), giving her the full purchase graph she needs.
Stage 3: Query
Priya switches to the Query Editor in the Data Workbench to build her churn feature set using SQL.
Defining Churn
NovaMart defines churn as a customer who has not placed an order in the last 90 days. She writes a feature engineering query that joins PostgreSQL and Snowflake data via the federated Query Engine (Trino):
-- Churn feature set: one row per customer with behavioral features
-- Federated query: PostgreSQL (orders, customers) + Snowflake (clickstream)
CREATE TABLE analytics.derived.churn_features AS
WITH order_features AS (
SELECT
c.customer_id,
c.signup_date,
c.segment,
c.channel AS signup_channel,
COUNT(DISTINCT o.order_id) AS total_orders,
SUM(o.total_amount) AS total_revenue,
AVG(o.total_amount) AS avg_order_value,
MAX(o.order_date) AS last_order_date,
DATE_DIFF('day', MAX(o.order_date), CURRENT_DATE) AS days_since_last_order,
DATE_DIFF('day', c.signup_date, CURRENT_DATE) AS customer_age_days,
COUNT(DISTINCT DATE_TRUNC('month', o.order_date)) AS active_months,
STDDEV(o.total_amount) AS order_value_stddev
FROM postgresql.novamart.customers c
LEFT JOIN postgresql.novamart.orders o ON c.customer_id = o.customer_id
WHERE c.signup_date < CURRENT_DATE - INTERVAL '90' DAY
GROUP BY c.customer_id, c.signup_date, c.segment, c.channel
),
clickstream_features AS (
SELECT
customer_id,
COUNT(*) AS total_sessions_90d,
COUNT(DISTINCT DATE(event_timestamp)) AS active_days_90d,
SUM(CASE WHEN event_type = 'product_view' THEN 1 ELSE 0 END) AS product_views_90d,
SUM(CASE WHEN event_type = 'add_to_cart' THEN 1 ELSE 0 END) AS cart_adds_90d,
SUM(CASE WHEN event_type = 'search' THEN 1 ELSE 0 END) AS searches_90d
FROM snowflake.analytics.clickstream
WHERE event_timestamp >= CURRENT_DATE - INTERVAL '90' DAY
GROUP BY customer_id
),
return_features AS (
SELECT
o.customer_id,
COUNT(r.return_id) AS total_returns,
SUM(r.refund_amount) AS total_refund_amount,
CAST(COUNT(r.return_id) AS DOUBLE) / NULLIF(COUNT(DISTINCT o.order_id), 0) AS return_rate
FROM postgresql.novamart.orders o
LEFT JOIN postgresql.novamart.returns r ON o.order_id = r.order_id
GROUP BY o.customer_id
)
SELECT
of.*,
COALESCE(cf.total_sessions_90d, 0) AS total_sessions_90d,
COALESCE(cf.active_days_90d, 0) AS active_days_90d,
COALESCE(cf.product_views_90d, 0) AS product_views_90d,
COALESCE(cf.cart_adds_90d, 0) AS cart_adds_90d,
COALESCE(cf.searches_90d, 0) AS searches_90d,
COALESCE(rf.total_returns, 0) AS total_returns,
COALESCE(rf.return_rate, 0) AS return_rate,
-- Label: churned if no order in 90 days
CASE WHEN of.days_since_last_order > 90 THEN 1 ELSE 0 END AS is_churned
FROM order_features of
LEFT JOIN clickstream_features cf ON of.customer_id = cf.customer_id
LEFT JOIN return_features rf ON of.customer_id = rf.customer_id;Query result: 1,987,432 rows, 21 columns. Execution time: 34 seconds (federated across PostgreSQL and Snowflake via the Trino query engine).
Stage 4: Orchestration
Priya needs this feature table refreshed daily. She creates a Temporal pipeline in the Data Workbench > Pipelines section.
Pipeline DAG
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Extract │ │ Quality │ │ Load to │
│ Customer │────▶│ Validation │────▶│ Feature │
│ Features │ │ Gate │ │ Store │
│ │ │ │ │ │
│ Run churn SQL │ │ Check nulls, │ │ Write to ML │
│ query above │ │ duplicates, │ │ Workbench │
│ against live │ │ ranges │ │ feature store │
│ sources │ │ │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘Pipeline Configuration
{
"pipeline_id": "churn-feature-refresh",
"name": "Daily Churn Feature Refresh",
"schedule": "0 4 * * *",
"tasks": [
{
"task_id": "extract_features",
"type": "sql_execute",
"config": {
"query_ref": "analytics.queries.churn_feature_set",
"target_table": "analytics.derived.churn_features",
"write_mode": "overwrite"
},
"retry_policy": {
"max_attempts": 3,
"backoff_coefficient": 2,
"initial_interval": "30s"
}
},
{
"task_id": "quality_gate",
"type": "data_quality_check",
"depends_on": ["extract_features"],
"config": {
"dataset": "analytics.derived.churn_features",
"suite": "churn_features_validation",
"fail_pipeline_on_warning": false,
"fail_pipeline_on_error": true
}
},
{
"task_id": "load_feature_store",
"type": "feature_store_sync",
"depends_on": ["quality_gate"],
"config": {
"source_table": "analytics.derived.churn_features",
"feature_group": "customer_churn_features",
"entity_key": "customer_id",
"timestamp_column": "last_order_date"
}
}
],
"notifications": {
"on_failure": { "channel": "slack", "target": "#data-alerts" },
"on_success": { "channel": "slack", "target": "#data-pipelines" }
}
}Stage 5: Analysis
Before training a model, Priya validates the feature data using the Data Quality Service (powered by Great Expectations).
Quality Rule Configuration
{
"suite_name": "churn_features_validation",
"dataset": "analytics.derived.churn_features",
"expectations": [
{
"type": "expect_table_row_count_to_be_between",
"kwargs": { "min_value": 1800000, "max_value": 2200000 }
},
{
"type": "expect_column_values_to_be_unique",
"kwargs": { "column": "customer_id" }
},
{
"type": "expect_column_values_to_not_be_null",
"kwargs": { "column": "customer_id" }
},
{
"type": "expect_column_values_to_be_between",
"kwargs": { "column": "days_since_last_order", "min_value": 0, "max_value": 3650 }
},
{
"type": "expect_column_values_to_be_between",
"kwargs": { "column": "avg_order_value", "min_value": 0, "max_value": 50000 }
},
{
"type": "expect_column_proportion_of_unique_values_to_be_between",
"kwargs": { "column": "customer_id", "min_value": 0.99, "max_value": 1.0 }
},
{
"type": "expect_column_values_to_be_in_set",
"kwargs": { "column": "segment", "value_set": ["Gold", "Silver", "Bronze", "New", "Inactive"] }
}
]
}Quality Check Results
Suite: churn_features_validation
Run Date: 2025-12-15 04:12:33 UTC
Status: PASSED (7/7 expectations met)
[PASS] Row count: 1,987,432 (expected 1.8M - 2.2M)
[PASS] customer_id uniqueness: 100%
[PASS] customer_id nulls: 0.0%
[PASS] days_since_last_order range: all values in [0, 2,187]
[PASS] avg_order_value range: all values in [4.99, 14,329.00]
[PASS] customer_id uniqueness proportion: 1.0
[PASS] segment values: all in expected setFeature Distribution Profiling
Priya profiles the target variable and discovers an important pattern:
Feature: days_since_last_order
──────────────────────────────
Count: 1,987,432
Mean: 67.3 days
Median: 42.0 days
Std Dev: 89.1 days
Min: 0 days
Max: 2,187 days
Distribution (bimodal):
0-30 days: ████████████████████ 41.2% (active buyers)
31-60 days: ████████ 16.3%
61-90 days: ██████ 12.1%
91-180 days: ████████████ 18.7% (recently churned)
181+ days: ██████ 11.7% (long-term inactive)
Churn label distribution:
Not churned (0): 69.6% | Churned (1): 30.4%The bimodal distribution confirms that customers either buy regularly (< 30 days between orders) or drift away sharply after 90 days. This informs Priya's model design: there is a clear decision boundary.
Stage 6: Productionization
Priya switches to the ML Workbench to train, evaluate, and deploy her model.
Experiment Tracking
She trains three models and logs all experiments to MLflow via the ML Workbench:
| Experiment | Model | Features | AUC | Precision | Recall | F1 | Training Time |
|---|---|---|---|---|---|---|---|
| churn-exp-001 | Logistic Regression | All 18 | 0.823 | 0.71 | 0.68 | 0.69 | 12s |
| churn-exp-002 | Random Forest | All 18 | 0.856 | 0.76 | 0.73 | 0.74 | 2m 18s |
| churn-exp-003 | Gradient Boosting (LightGBM) | All 18 | 0.871 | 0.79 | 0.77 | 0.78 | 4m 42s |
Model Registration
She registers the best model (churn-exp-003) in the ML Workbench model registry:
{
"model_name": "customer-churn-predictor",
"version": "1.0.0",
"source_experiment": "churn-exp-003",
"framework": "lightgbm",
"metrics": {
"auc": 0.871,
"precision": 0.79,
"recall": 0.77,
"f1": 0.78
},
"feature_set": "customer_churn_features",
"input_schema": {
"total_orders": "int",
"avg_order_value": "float",
"days_since_last_order": "int",
"customer_age_days": "int",
"total_sessions_90d": "int",
"product_views_90d": "int",
"return_rate": "float",
"segment": "string"
},
"tags": {
"team": "data-science",
"use_case": "customer-retention",
"owner": "priya@novamart.com"
}
}Ray Serve Deployment
She deploys the model to Ray Serve for real-time scoring via the ML Workbench deployment interface:
{
"deployment_name": "churn-predictor-v1",
"model_ref": "customer-churn-predictor/1.0.0",
"runtime": "ray_serve",
"config": {
"num_replicas": 2,
"max_concurrent_queries": 100,
"ray_actor_options": {
"num_cpus": 1,
"memory": 2147483648
},
"autoscaling_config": {
"min_replicas": 2,
"max_replicas": 6,
"target_num_ongoing_requests_per_replica": 10
}
},
"health_check": {
"path": "/health",
"interval_s": 30
}
}After deployment, the model is accessible at the internal endpoint:
# Test the deployed model
curl -X POST https://ml.novamart.matih.internal/serve/churn-predictor-v1/predict \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${JWT_TOKEN}" \
-d '{
"customer_id": 1042587,
"total_orders": 3,
"avg_order_value": 45.20,
"days_since_last_order": 78,
"customer_age_days": 412,
"total_sessions_90d": 2,
"product_views_90d": 8,
"return_rate": 0.33,
"segment": "Silver"
}'
# Response
{
"customer_id": 1042587,
"churn_probability": 0.73,
"churn_risk": "HIGH",
"top_factors": [
{ "feature": "days_since_last_order", "importance": 0.31 },
{ "feature": "total_sessions_90d", "importance": 0.22 },
{ "feature": "return_rate", "importance": 0.18 }
]
}Stage 7: Feedback
With the model in production, Priya configures monitoring to catch degradation early.
Drift Detection Configuration
In the ML Workbench > Monitoring section, she configures feature and prediction drift alerts:
{
"monitor_name": "churn-predictor-drift",
"deployment": "churn-predictor-v1",
"monitors": [
{
"type": "feature_drift",
"method": "psi",
"features": ["avg_order_value", "days_since_last_order", "total_sessions_90d"],
"threshold": 0.1,
"window": "7d",
"reference_dataset": "analytics.derived.churn_features_baseline_2025q4"
},
{
"type": "prediction_drift",
"method": "ks_test",
"column": "churn_probability",
"threshold": 0.05,
"window": "7d"
},
{
"type": "accuracy_tracking",
"metric": "precision_at_k",
"k": 1000,
"ground_truth_table": "analytics.derived.actual_churn_labels",
"evaluation_schedule": "0 0 * * MON",
"alert_threshold": 0.70
}
],
"alerts": {
"channels": ["slack:#ml-alerts", "email:priya@novamart.com"],
"severity_mapping": {
"feature_drift": "warning",
"prediction_drift": "critical",
"accuracy_drop": "critical"
}
}
}Monitoring Dashboard
After two weeks in production, the monitoring dashboard shows:
Churn Predictor v1 - Production Monitoring
──────────────────────────────────────────
Period: 2025-12-15 to 2025-12-29
Prediction Volume: 142,847 requests
Avg Latency (p50): 23ms
Avg Latency (p99): 87ms
Error Rate: 0.02%
Feature Drift (PSI):
avg_order_value: 0.03 [OK]
days_since_last_order: 0.04 [OK]
total_sessions_90d: 0.07 [OK]
Prediction Distribution:
Low risk (< 0.3): 62.1%
Med risk (0.3-0.7): 24.3%
High risk (> 0.7): 13.6%
Weekly Accuracy (vs actual churn labels):
Week 1: Precision@1000 = 0.81 [OK]
Week 2: Precision@1000 = 0.79 [OK]Stage 8: Experimentation
After the initial deployment, Priya runs a controlled experiment to compare her gradient boosting model against the original logistic regression baseline.
A/B Test Configuration
She sets up an A/B test via the ML Workbench > Experiments > A/B Tests panel:
{
"experiment_name": "churn-model-comparison-v1-v2",
"duration_days": 14,
"traffic_split": {
"control": {
"model": "customer-churn-predictor/0.1.0",
"description": "Logistic Regression (AUC 0.823)",
"traffic_percentage": 50
},
"treatment": {
"model": "customer-churn-predictor/1.0.0",
"description": "LightGBM (AUC 0.871)",
"traffic_percentage": 50
}
},
"assignment_method": "customer_id_hash",
"primary_metric": "churned_customers_identified",
"secondary_metrics": [
"false_positive_rate",
"retention_campaign_cost",
"revenue_saved"
],
"guardrail_metrics": [
{ "metric": "false_positive_rate", "max_value": 0.30 }
]
}A/B Test Results (After 14 Days)
| Metric | Control (LR) | Treatment (LightGBM) | Lift | p-value |
|---|---|---|---|---|
| Churners identified (top 1000) | 614 | 756 | +23.1% | 0.0003 |
| False positive rate | 0.22 | 0.19 | -13.6% | 0.012 |
| Retention campaigns sent | 1,000 | 1,000 | -- | -- |
| Customers retained | 187 | 231 | +23.5% | 0.008 |
| Revenue saved (est.) | $48,200 | $62,700 | +30.1% | 0.004 |
| Campaign cost | $15,000 | $15,000 | -- | -- |
| Net ROI | $33,200 | $47,700 | +43.7% | -- |
Conclusion: The LightGBM model (treatment) identifies 23% more true churners with a lower false positive rate. At the same campaign budget, it generates an estimated $14,500 more in retained revenue per cohort of 1,000 targeted customers.
Priya promotes the LightGBM model to 100% traffic and archives the logistic regression variant.
Iteration Plan
Based on the experiment results, Priya documents next steps:
- Feature expansion -- add
customer_support_ticketsdata (new Zendesk connector) - Segment-specific models -- Gold customers may have different churn signals than Bronze
- Real-time scoring -- trigger retention offers at checkout for high-risk customers
- Feedback loop -- connect retention campaign outcomes back to the training data
Summary
| Stage | What Priya Did | Platform Components Used |
|---|---|---|
| Ingestion | Connected PostgreSQL CDC, imported survey CSV | Ingestion Service, Data Workbench |
| Discovery | Browsed catalog, profiled columns, traced lineage | Catalog Service, Data Workbench |
| Query | Wrote federated SQL joining PostgreSQL + Snowflake | Query Engine (Trino), Data Workbench |
| Orchestration | Built daily pipeline with quality gate | Pipeline Service (Temporal), Data Workbench |
| Analysis | Validated features with Great Expectations | Data Quality Service, Data Workbench |
| Productionization | Trained LightGBM, deployed to Ray Serve | ML Workbench, ML Service (MLflow, Ray Serve) |
| Feedback | Configured drift detection and accuracy tracking | ML Workbench Monitoring |
| Experimentation | Ran A/B test, LightGBM +23% churner identification | ML Workbench Experiments |
Related Walkthroughs
- ML Engineer Journey -- Marcus operationalizes the demand forecasting pipeline
- BI Lead Journey -- Sofia builds dashboards that consume Priya's churn scores
- Executive Journey -- David uses AI to ask strategic questions about churn
- Retail Overview -- NovaMart company profile and dataset descriptions