Data Products as First-Class Citizens: Why Your Data Needs a Product Manager

March 2026 · 10 min read

This is Part 1 of the Product Intelligence Series — a 10-part deep dive into treating every data, ML, AI, and BI asset as a living product with health, ownership, and lifecycle management.

The Graveyard of Tables Nobody Owns

Sarah is a data engineer at a mid-market SaaS company. She receives a Slack message at 2:47 PM on a Thursday: "The ARR dashboard looks wrong. Revenue dropped 40% overnight. Is the data broken?"

She opens the catalog. The dashboard queries analytics.monthly_revenue. She traces that table to a dbt model that joins three upstream sources. One of them, raw.stripe_invoices, was last updated eleven days ago. The Fivetran connector silently failed after a Stripe API change. Nobody noticed because nobody owns this table.

Sarah checks the catalog for an owner. There is none. She checks for documentation. There is a one-line description from 2024: "Stripe invoice data." She checks for an SLA. There is no SLA. She checks for consumers. She has no idea who else depends on this table.

This is not an edge case. A 2025 Atlan survey found that 42% of data tables in the average enterprise have no documented owner, 67% have no freshness SLA, and 38% have not been queried in 90 days but continue to be refreshed daily, consuming compute and storage for data nobody uses.

Sarah's company does not have a data quality problem. It has a data product management problem.

Data as a Byproduct vs. Data as a Product

The distinction is not semantic. It is structural.

When data assets are treated as byproducts — side effects of pipelines, artifacts of ETL jobs, residue of operational systems — they inherit none of the disciplines that make software products reliable. No owner. No SLA. No health monitoring. No consumer feedback loop. No deprecation process. No lifecycle.

When data assets are treated as products — intentional outputs with a purpose, an audience, and an owner — they inherit all of those disciplines. And the operational difference is night and day.

Dimension	Data as Byproduct	Data as Product
Ownership	"The pipeline team, I think?"	Revenue Data Product owned by @maria.chen
SLA	None. Freshness is whatever the cron job delivers	Fresher than 1 hour. Alerting at 45 minutes.
Health	Unknown until someone complains	99.2% across 6 dimensions, checked every 5 minutes
Documentation	A Confluence page from 2023	Auto-generated schema docs + business context + lineage
Consumers	Unknown	47 registered consumers across 3 teams
Change management	"I renamed a column, hope nothing breaks"	Schema change triggers impact analysis → consumer notification
Retirement	Tables accumulate forever	0 queries for 90 days → deprecation notice → archive

Netflix articulated this principle years ago: every data asset that crosses a team boundary must have a purpose, an audience, and an owner. Spotify, Airbnb, and LinkedIn followed. The data mesh movement formalized it. But most enterprises still operate in the byproduct world — not because the concept is hard, but because the tooling to enforce it did not exist.

Until now.

The Product Kernel: Health as a First-Class Property

At the core of the data product model is a shared abstraction we call the Product Kernel — a base entity (ProductEntity) that every data product inherits from. The kernel provides three things that raw tables and views do not have: identity, health, and lifecycle.

Identity means the product has a unique identifier, a human-readable name, an owner (person or team), a description, a domain classification, and a set of tags. Identity is what makes a product discoverable and accountable.

Health is computed continuously across six dimensions:

Dimension	What It Measures	Example Failure
Freshness	Time since last successful update vs. defined SLA	Pipeline delayed 3 hours beyond 1-hour SLA
Completeness	Percentage of non-null values in required columns	`customer_email` column 23% null after source migration
Accuracy	Statistical validation against known invariants	Revenue column contains negative values
Consistency	Cross-source agreement on shared entities	Customer count differs by 12% between CRM and billing
Timeliness	End-to-end latency from source event to product availability	Event occurred at 10:00, product reflects it at 14:00
Schema stability	Frequency and severity of schema changes	3 breaking column renames in the past 30 days

Each dimension produces a score between 0 and 1. The composite health score is a weighted average, with weights configurable per product. A revenue table might weight accuracy at 0.3 and freshness at 0.25. A clickstream table might weight volume (a sub-dimension of completeness) at 0.4.

The critical insight: health is not a dashboard metric. It is an operational signal. When a product's health drops below its configured threshold, downstream consumers are automatically notified. Dependent ML models are flagged. BI dashboards display staleness warnings. The platform does not wait for Sarah's 2:47 PM Slack message.

Lifecycle: From Draft to Retirement

Every data product moves through a defined lifecycle. This is not bureaucracy — it is the mechanism that prevents the graveyard of tables nobody owns.

DRAFT — A new data product has been registered but is not yet ready for consumption. The owner is assigned. Schema is being stabilized. Quality checks are being configured. Consumers cannot subscribe yet.

PUBLISHED — The product has passed its quality gates and is available for consumption. Health is monitored continuously across all six dimensions. Consumers can subscribe and will receive notifications on changes. The product has an SLA.

DEPRECATED — The product has been marked for sunset. Consumers receive a deprecation notice with a timeline and migration path. New subscriptions are blocked. Existing consumers have a defined migration window (default: 90 days).

RETIRED — The product is no longer available. Data is archived according to retention policy. Compute resources are reclaimed. The product remains in the catalog for lineage history but serves no live queries.

The lifecycle is enforced by the platform, not by process documents. A product cannot move from DRAFT to PUBLISHED without passing health gate validation. A product cannot be RETIRED while it has active consumers who have not acknowledged the deprecation notice.

Auto-Discovery: Products That Register Themselves

The most common objection to data products is the overhead: "We have 4,000 tables. We cannot manually register each one as a product."

Correct. That is why auto-discovery exists.

Agents crawl your existing infrastructure — dbt project manifests, Airflow DAG definitions, Trino catalog metadata, Spark job outputs — and register candidate products automatically. The agent extracts schema, infers freshness patterns from historical update timestamps, identifies potential owners from git blame on transformation code, and suggests quality rules based on column data types and value distributions.

The result is not a finished product. It is a DRAFT with pre-populated metadata that a human reviews and promotes. The agent does the 80% work of registration. The human does the 20% work of validation, ownership assignment, and SLA definition. A team of three can productize 4,000 tables in two weeks instead of six months.

Consumer Subscriptions: The Contract Between Producer and Consumer

When a team subscribes to a data product, they are entering a contract. The producer commits to maintaining the product's health, SLA, and schema stability. The consumer commits to using the product's published interface (not raw table access) and acknowledging deprecation notices.

Subscriptions unlock three capabilities that raw table access does not provide:

Change notifications. When the producer changes the schema, adds a column, modifies a quality rule, or adjusts the SLA, every subscriber receives a notification with the change details and potential impact. No more "I renamed a column, hope nothing breaks."

Impact analysis. Before making any change, the producer sees the blast radius: 47 consumers, including 3 ML models and 2 executive dashboards. This is the information that prevents casual breaking changes.

Health propagation. When the product's health degrades, subscribers are notified proactively. A downstream ML model can automatically switch to a fallback data source. A BI dashboard can display a staleness warning instead of confidently showing wrong numbers.

Industry Patterns

The product kernel is universal, but the health weights and quality rules are domain-specific.

SaaS companies productize ARR metrics, cohort tables, and product usage events. Freshness is critical — a 2-hour delay in product usage data means the growth team is making decisions on stale signals. Health weights: freshness 0.30, completeness 0.25, accuracy 0.20.

Healthcare organizations productize patient records, clinical trial data, and claims tables. Accuracy and completeness dominate — a missing diagnosis code or an incorrect medication dosage has patient safety implications. Health weights: accuracy 0.35, completeness 0.30, consistency 0.20.

Financial institutions productize transaction data, risk metrics, and regulatory reports. Consistency and timeliness are paramount — a 15-minute delay in transaction data can mean missed fraud detection. Health weights: timeliness 0.30, consistency 0.25, accuracy 0.25.

The Transformation

Return to Sarah's Thursday afternoon. In a world where data assets are products:

The raw.stripe_invoices table is a published Data Product owned by the payments team. Its freshness SLA is 1 hour. When the Fivetran connector failed eleven days ago, the health score dropped from 0.98 to 0.12 within 65 minutes. The payments team received an alert. The 47 downstream consumers — including the ARR dashboard — received a degradation notice. The dashboard displayed a yellow banner: "Revenue data stale since March 5. Last known good: $4.2M ARR." The CEO's dashboard did not show wrong numbers. It showed honest numbers with a clear caveat.

Sarah does not receive a panicked Slack message. The system handled it.

That is the difference between data as a byproduct and data as a product. Not better pipelines. Not better monitoring. A fundamentally different relationship between data producers and data consumers — mediated by health, ownership, and lifecycle.

What Comes Next

Data Products are the foundation. But data is only one type of asset in the modern analytics stack. In Part 2 of this series, we explore ML Products — what happens when you wrap a machine learning model with the same health, ownership, and lifecycle discipline. Models that know when they are wrong, trigger their own retraining, and declare their dependencies on upstream Data Products.

This is Part 1 of the Product Intelligence Series. Next: ML Products: From Model Artifacts to Living Products That Know When They're Wrong.

11. Ontology & Semantic Layer for Agents 13. ML Products