Production Models Don't Die. They Drift.
March 2026 · 11 min read
This is Part 2 of the Product Intelligence Series — a 10-part deep dive into treating every data, ML, AI, and BI asset as a living product with health, ownership, and lifecycle management.
The Model That Forgot
Six months ago, a fintech company deployed a churn prediction model. It was good — 0.89 AUC on the holdout set, validated by the data science team, blessed by the product manager, deployed with a celebration Slack emoji. It predicted which customers were likely to cancel their subscription within the next 30 days, and the retention team used those predictions to prioritize outreach.
For two months, it worked beautifully. Retention rates improved 12%. The team moved on to the next project.
Then the company launched a new pricing tier. Customer behavior shifted. Users who previously showed "churn signals" — reduced login frequency, fewer transactions — were actually just migrating to the new tier, not leaving. The model did not know this. It kept predicting churn for customers who were upgrading. The retention team spent six weeks calling happy customers with discounts they did not need, eroding margin on the company's most engaged users.
Nobody checked the model. There was no system to check the model. The MLflow experiment tracker showed the training run from six months ago. The model registry showed "Production." The monitoring dashboard — there was no monitoring dashboard.
This is the default state of ML in the enterprise. Models are artifacts. They are trained, evaluated once, deployed, and forgotten. They do not know when the world has changed. They do not know when they are wrong.
The Artifact Trap
The ML industry has optimized for the wrong phase. Experiment tracking is solved. Model registries are solved. Feature stores are maturing. But these tools address model creation, not model operation. They answer "how was this model built?" but not "is this model still working?"
The distinction matters because ML models degrade by default. Unlike software, which breaks loudly (crashes, errors, exceptions), ML models degrade silently. A model producing predictions with 0.72 AUC looks exactly the same as a model producing predictions with 0.89 AUC — both return a float between 0 and 1 with high confidence. The outputs are well-formed. The outputs are wrong.
This is the artifact trap: treating a model as a static thing that was validated at training time and remains valid forever. In reality, a model is a hypothesis about the relationship between inputs and outputs — and that relationship is constantly changing.
ML Products: The Living Model
An ML Product wraps a model artifact with the same product disciplines introduced in Part 1 for Data Products: identity, health, lifecycle, ownership, and consumer contracts. But ML health is measured across different dimensions than data health.
Six Dimensions of ML Health
| Dimension | What It Measures | Alert Threshold Example |
|---|---|---|
| Accuracy | Live performance against ground truth labels (when available) or proxy metrics | AUC drops below 0.80 on rolling 7-day window |
| Drift | Statistical divergence between training distribution and live input distribution | PSI > 0.25 on any top-10 feature |
| Latency | Inference time at P50, P95, P99 | P99 latency exceeds 200ms |
| Data dependency health | Composite health of upstream Data Products the model consumes | Any upstream Data Product health drops below 0.70 |
| Freshness | Time since last successful retraining or evaluation | Model not retrained in 90 days despite drift signals |
| Fairness | Performance parity across protected demographic groups | Accuracy gap > 5% between any two demographic segments |
The critical innovation is dimension 4: data dependency health. An ML Product explicitly declares which Data Products it consumes as training data and feature inputs. If the upstream customer_transactions Data Product drops to 0.45 health because of a completeness issue, the downstream churn model is automatically flagged — not because the model itself changed, but because the foundation it stands on shifted.
This is the connection that the artifact model misses entirely. A model can be "healthy" by every internal metric (low drift, good latency, no errors) and still be producing wrong predictions because the data it consumes is broken. The ML Product model makes this dependency explicit and monitorable.
Drift Detection: The Early Warning System
Statistical drift detection is the ML Product's most important automated capability. Two tests run continuously on every published ML Product:
Population Stability Index (PSI) measures how much the distribution of input features has shifted since training. A PSI below 0.10 indicates no significant change. Between 0.10 and 0.25, the model should be monitored closely. Above 0.25, the model is operating outside its training distribution and predictions are unreliable.
Kolmogorov-Smirnov (KS) test compares the cumulative distribution functions of individual features between training and production data. It catches subtle shifts that PSI might miss — a feature whose mean is unchanged but whose variance has doubled, for example.
When drift is detected, the ML Product does not wait for a human. It follows a configured response ladder:
- PSI 0.10-0.20: Log the drift, update health score, notify owner
- PSI 0.20-0.25: Trigger automated evaluation against recent ground truth
- PSI > 0.25: Trigger retraining pipeline with the most recent training data
- Retraining fails or new model underperforms: Escalate to owner with full diagnostic
The fintech churn model from our opening story would have triggered at step 2 within two weeks of the pricing tier launch. The feature distributions for login frequency, transaction count, and plan type would have shifted measurably. The automated evaluation would have shown AUC degradation. The retraining pipeline would have incorporated the new behavioral patterns. Six weeks of misdirected outreach would have been six days.
Canary Deployment: Trust, but Verify
When an ML Product is retrained — whether triggered by drift detection or scheduled — the new model does not simply replace the old one. It enters a canary deployment pipeline:
Phase 1 (5% traffic): The new model serves 5% of predictions. Both models' outputs are logged and compared. If the new model's accuracy on the 5% sample is within threshold, proceed.
Phase 2 (25% traffic): Traffic increases to 25%. Statistical tests compare the new model's performance against the incumbent across all fairness dimensions. If any demographic segment shows degradation > 2%, roll back.
Phase 3 (100% traffic): Full rollout. The old model is retained for 72 hours as a rollback target. If any health dimension drops below threshold within the rollback window, automatic reversion.
The canary pipeline is not optional. It is part of the ML Product lifecycle. A model that cannot be safely deployed through the canary pipeline cannot be published.
Decision Traces: Every Prediction Has a Receipt
Every prediction made by an ML Product is accompanied by a decision trace — a structured record of what data was used, what features contributed most, and what the model considered.
Prediction: Customer #48291 — 78% churn probability (HIGH)
Top contributing features:
1. login_frequency_30d: 2 (vs training avg: 14.3) — contribution: +0.31
2. support_tickets_7d: 3 (vs training avg: 0.4) — contribution: +0.22
3. plan_downgrade_signal: true — contribution: +0.18
Data sources:
- customer_activity (Data Product, health: 0.97, freshness: 12 min)
- support_interactions (Data Product, health: 0.94, freshness: 3 hours)
Model version: v2.3.1 (trained 2026-02-28, AUC: 0.87)Decision traces serve three purposes. First, explainability — when a business user asks "why does the model think this customer is churning?", the trace provides the answer in business terms, not feature vector indices. Second, debugging — when a prediction is wrong, the trace shows exactly which features drove the decision, enabling root cause analysis. Third, auditability — in regulated industries, every automated decision must be traceable to its inputs and reasoning.
Industry Guardrails
The six health dimensions are universal, but their thresholds and enforcement mechanisms are industry-specific.
Healthcare requires explainability for any model that influences clinical decisions. The ML Product's fairness dimension must be evaluated across age, sex, ethnicity, and socioeconomic status. A model that shows > 3% accuracy gap across demographic segments cannot be published. Decision traces must be retained for 7 years to comply with audit requirements.
Finance requires fairness testing against protected classes under ECOA and fair lending regulations. The ML Product's drift detection must include adverse action reason codes — when a loan is denied, the applicant must receive the specific factors that contributed to the decision. Retraining pipelines must include regulatory holdout validation against a compliance-approved test set.
Retail requires A/B testing as a first-class deployment strategy. The ML Product's canary deployment integrates with the company's experimentation platform, ensuring that model changes are measured against business KPIs (conversion rate, average order value), not just statistical metrics.
Model as Artifact vs. Model as Product
| Dimension | Model as Artifact | Model as Product |
|---|---|---|
| Identity | A file in S3 with a timestamp | A registered product with name, owner, version, domain |
| Health | Unknown until someone checks | 6 dimensions, computed continuously, propagated to consumers |
| Drift detection | Manual, if it happens at all | Automated PSI + KS tests on every feature, configurable response |
| Deployment | Replace the endpoint, hope for the best | Canary pipeline: 5% → 25% → 100% with automatic rollback |
| Dependencies | Implicit — the model "uses" some tables | Explicit — declares upstream Data Products, health propagates |
| Explainability | "It's a gradient boosted tree" | Decision trace on every prediction with feature contributions |
| Lifecycle | "Production" forever | DRAFT → PUBLISHED → DEPRECATED → RETIRED with consumer contracts |
| Retraining | Quarterly, maybe | Triggered by drift, gated by evaluation, deployed through canary |
| Consumer contract | None — downstream systems query the endpoint | Subscriptions with SLAs, deprecation notices, migration windows |
The Compound Effect
ML Products do not exist in isolation. They depend on Data Products (Part 1) and are consumed by AI Products (Part 3) and BI Products (Part 4). When you connect these dependencies through a product graph, something powerful emerges: the entire analytics stack becomes self-aware.
A schema change in a source table propagates health signals through Data Products to ML Products to dashboards. A drift event in an ML Product triggers evaluation, retraining, and downstream notification — all automatically. The platform does not wait for the fintech retention team to spend six weeks calling happy customers. It acts when the world changes.
What Comes Next
ML Products give models the ability to know when they are wrong. But in the modern analytics stack, models are not consumed directly by humans — they are consumed by AI agents. In Part 3, we explore AI Products — what happens when you wrap an autonomous agent with the same health, ownership, and lifecycle discipline. Agents that are composed like microservices, evaluated before deployment, and shared across teams through a marketplace.
This is Part 2 of the Product Intelligence Series. Previous: Data Products as First-Class Citizens. Next: AI Products: Composing Agents Like Microservices.