Data Products as First-Class Citizens: Why Your Data Needs a Product Manager
March 2026 · 10 min read
This is Part 1 of the Product Intelligence Series — a 10-part deep dive into treating every data, ML, AI, and BI asset as a living product with health, ownership, and lifecycle management.
The Graveyard of Tables Nobody Owns
Sarah is a data engineer at a mid-market SaaS company. She receives a Slack message at 2:47 PM on a Thursday: "The ARR dashboard looks wrong. Revenue dropped 40% overnight. Is the data broken?"
She opens the catalog. The dashboard queries analytics.monthly_revenue. She traces that table to a dbt model that joins three upstream sources. One of them, raw.stripe_invoices, was last updated eleven days ago. The Fivetran connector silently failed after a Stripe API change. Nobody noticed because nobody owns this table.
Sarah checks the catalog for an owner. There is none. She checks for documentation. There is a one-line description from 2024: "Stripe invoice data." She checks for an SLA. There is no SLA. She checks for consumers. She has no idea who else depends on this table.
This is not an edge case. A 2025 Atlan survey found that 42% of data tables in the average enterprise have no documented owner, 67% have no freshness SLA, and 38% have not been queried in 90 days but continue to be refreshed daily, consuming compute and storage for data nobody uses.
Sarah's company does not have a data quality problem. It has a data product management problem.
Data as a Byproduct vs. Data as a Product
The distinction is not semantic. It is structural.
When data assets are treated as byproducts — side effects of pipelines, artifacts of ETL jobs, residue of operational systems — they inherit none of the disciplines that make software products reliable. No owner. No SLA. No health monitoring. No consumer feedback loop. No deprecation process. No lifecycle.
When data assets are treated as products — intentional outputs with a purpose, an audience, and an owner — they inherit all of those disciplines. And the operational difference is night and day.
| Dimension | Data as Byproduct | Data as Product |
|---|---|---|
| Ownership | "The pipeline team, I think?" | Revenue Data Product owned by @maria.chen |
| SLA | None. Freshness is whatever the cron job delivers | Fresher than 1 hour. Alerting at 45 minutes. |
| Health | Unknown until someone complains | 99.2% across 6 dimensions, checked every 5 minutes |
| Documentation | A Confluence page from 2023 | Auto-generated schema docs + business context + lineage |
| Consumers | Unknown | 47 registered consumers across 3 teams |
| Change management | "I renamed a column, hope nothing breaks" | Schema change triggers impact analysis → consumer notification |
| Retirement | Tables accumulate forever | 0 queries for 90 days → deprecation notice → archive |
Netflix articulated this principle years ago: every data asset that crosses a team boundary must have a purpose, an audience, and an owner. Spotify, Airbnb, and LinkedIn followed. The data mesh movement formalized it. But most enterprises still operate in the byproduct world — not because the concept is hard, but because the tooling to enforce it did not exist.
Until now.
The Product Kernel: Health as a First-Class Property
At the core of the data product model is a shared abstraction we call the Product Kernel — a base entity (ProductEntity) that every data product inherits from. The kernel provides three things that raw tables and views do not have: identity, health, and lifecycle.
Identity means the product has a unique identifier, a human-readable name, an owner (person or team), a description, a domain classification, and a set of tags. Identity is what makes a product discoverable and accountable.
Health is computed continuously across six dimensions:
| Dimension | What It Measures | Example Failure |
|---|---|---|
| Freshness | Time since last successful update vs. defined SLA | Pipeline delayed 3 hours beyond 1-hour SLA |
| Completeness | Percentage of non-null values in required columns | customer_email column 23% null after source migration |
| Accuracy | Statistical validation against known invariants | Revenue column contains negative values |
| Consistency | Cross-source agreement on shared entities | Customer count differs by 12% between CRM and billing |
| Timeliness | End-to-end latency from source event to product availability | Event occurred at 10:00, product reflects it at 14:00 |
| Schema stability | Frequency and severity of schema changes | 3 breaking column renames in the past 30 days |
Each dimension produces a score between 0 and 1. The composite health score is a weighted average, with weights configurable per product. A revenue table might weight accuracy at 0.3 and freshness at 0.25. A clickstream table might weight volume (a sub-dimension of completeness) at 0.4.
The critical insight: health is not a dashboard metric. It is an operational signal. When a product's health drops below its configured threshold, downstream consumers are automatically notified. Dependent ML models are flagged. BI dashboards display staleness warnings. The platform does not wait for Sarah's 2:47 PM Slack message.
Lifecycle: From Draft to Retirement
Every data product moves through a defined lifecycle. This is not bureaucracy — it is the mechanism that prevents the graveyard of tables nobody owns.
DRAFT — A new data product has been registered but is not yet ready for consumption. The owner is assigned. Schema is being stabilized. Quality checks are being configured. Consumers cannot subscribe yet.
PUBLISHED — The product has passed its quality gates and is available for consumption. Health is monitored continuously across all six dimensions. Consumers can subscribe and will receive notifications on changes. The product has an SLA.
DEPRECATED — The product has been marked for sunset. Consumers receive a deprecation notice with a timeline and migration path. New subscriptions are blocked. Existing consumers have a defined migration window (default: 90 days).
RETIRED — The product is no longer available. Data is archived according to retention policy. Compute resources are reclaimed. The product remains in the catalog for lineage history but serves no live queries.
The lifecycle is enforced by the platform, not by process documents. A product cannot move from DRAFT to PUBLISHED without passing health gate validation. A product cannot be RETIRED while it has active consumers who have not acknowledged the deprecation notice.
Auto-Discovery: Products That Register Themselves
The most common objection to data products is the overhead: "We have 4,000 tables. We cannot manually register each one as a product."
Correct. That is why auto-discovery exists.
Agents crawl your existing infrastructure — dbt project manifests, Airflow DAG definitions, Trino catalog metadata, Spark job outputs — and register candidate products automatically. The agent extracts schema, infers freshness patterns from historical update timestamps, identifies potential owners from git blame on transformation code, and suggests quality rules based on column data types and value distributions.
The result is not a finished product. It is a DRAFT with pre-populated metadata that a human reviews and promotes. The agent does the 80% work of registration. The human does the 20% work of validation, ownership assignment, and SLA definition. A team of three can productize 4,000 tables in two weeks instead of six months.
Consumer Subscriptions: The Contract Between Producer and Consumer
When a team subscribes to a data product, they are entering a contract. The producer commits to maintaining the product's health, SLA, and schema stability. The consumer commits to using the product's published interface (not raw table access) and acknowledging deprecation notices.
Subscriptions unlock three capabilities that raw table access does not provide:
Change notifications. When the producer changes the schema, adds a column, modifies a quality rule, or adjusts the SLA, every subscriber receives a notification with the change details and potential impact. No more "I renamed a column, hope nothing breaks."
Impact analysis. Before making any change, the producer sees the blast radius: 47 consumers, including 3 ML models and 2 executive dashboards. This is the information that prevents casual breaking changes.
Health propagation. When the product's health degrades, subscribers are notified proactively. A downstream ML model can automatically switch to a fallback data source. A BI dashboard can display a staleness warning instead of confidently showing wrong numbers.
Industry Patterns
The product kernel is universal, but the health weights and quality rules are domain-specific.
SaaS companies productize ARR metrics, cohort tables, and product usage events. Freshness is critical — a 2-hour delay in product usage data means the growth team is making decisions on stale signals. Health weights: freshness 0.30, completeness 0.25, accuracy 0.20.
Healthcare organizations productize patient records, clinical trial data, and claims tables. Accuracy and completeness dominate — a missing diagnosis code or an incorrect medication dosage has patient safety implications. Health weights: accuracy 0.35, completeness 0.30, consistency 0.20.
Financial institutions productize transaction data, risk metrics, and regulatory reports. Consistency and timeliness are paramount — a 15-minute delay in transaction data can mean missed fraud detection. Health weights: timeliness 0.30, consistency 0.25, accuracy 0.25.
The Transformation
Return to Sarah's Thursday afternoon. In a world where data assets are products:
The raw.stripe_invoices table is a published Data Product owned by the payments team. Its freshness SLA is 1 hour. When the Fivetran connector failed eleven days ago, the health score dropped from 0.98 to 0.12 within 65 minutes. The payments team received an alert. The 47 downstream consumers — including the ARR dashboard — received a degradation notice. The dashboard displayed a yellow banner: "Revenue data stale since March 5. Last known good: $4.2M ARR." The CEO's dashboard did not show wrong numbers. It showed honest numbers with a clear caveat.
Sarah does not receive a panicked Slack message. The system handled it.
That is the difference between data as a byproduct and data as a product. Not better pipelines. Not better monitoring. A fundamentally different relationship between data producers and data consumers — mediated by health, ownership, and lifecycle.
What Comes Next
Data Products are the foundation. But data is only one type of asset in the modern analytics stack. In Part 2 of this series, we explore ML Products — what happens when you wrap a machine learning model with the same health, ownership, and lifecycle discipline. Models that know when they are wrong, trigger their own retraining, and declare their dependencies on upstream Data Products.
This is Part 1 of the Product Intelligence Series. Next: ML Products: From Model Artifacts to Living Products That Know When They're Wrong.