AI Products: Composing Agents Like Microservices
March 2026 · 11 min read
This is Part 3 of the Product Intelligence Series — a 10-part deep dive into treating every data, ML, AI, and BI asset as a living product with health, ownership, and lifecycle management.
Three Weeks, Three Engineers, One Prompt
A VP of Customer Success walks into a meeting and says: "I need an agent that analyzes customer churn patterns, identifies the top drivers, and recommends retention strategies with estimated ROI for each."
Today, this request triggers a project. A prompt engineer spends a week designing the system prompt, few-shot examples, and tool descriptions. A data engineer spends a week building the data pipeline — connecting to the customer database, the billing system, the product analytics warehouse, the support ticket system. A platform engineer spends a week on deployment — containerizing the agent, setting up API endpoints, configuring rate limiting, connecting monitoring.
Three weeks later, the agent works. For this tenant. With this data schema. Using these specific tools. If the billing system changes its API, the agent breaks. If another team wants a similar agent for a different vertical, they start from scratch.
This is the state of enterprise AI agents in 2026: artisanal, fragile, unrepeatable. Every agent is a bespoke creation. There is no reuse. There is no marketplace. There is no quality standard. There is no way to know if the agent is working well — or if it ever was.
The Artisan Problem
The root issue is not technical complexity. It is the absence of product discipline around AI agents.
Consider the parallel with microservices circa 2012. Before container orchestration, every service was a snowflake — deployed differently, monitored differently, scaled differently. Kubernetes did not make microservices simpler. It made them repeatable. A service has a container image, a health check, a resource specification, and a deployment strategy. These abstractions let teams build, deploy, and operate thousands of services with consistent quality.
AI agents need the same treatment. Not simpler models or better prompts — but repeatable abstractions for building, evaluating, deploying, and governing agents with consistent quality.
That is what an AI Product provides.
The AI Product: An Agent with Product Discipline
An AI Product wraps a MarketplaceAgent — an agent configuration that includes its prompt, tools, guardrails, and evaluation criteria — with the same product kernel introduced in Part 1: identity, health, lifecycle, ownership, and consumer contracts.
Six Dimensions of AI Health
| Dimension | What It Measures | Threshold Example |
|---|---|---|
| Evaluation score | Automated eval against a curated test suite (groundedness, relevance, completeness) | Score must exceed 0.75 across all eval categories |
| Hallucination rate | Percentage of responses containing claims not supported by retrieved context | Must remain below 10% on rolling 24-hour window |
| Latency | End-to-end response time including all tool calls and reasoning steps | P95 under 8 seconds for interactive use cases |
| Tool success rate | Percentage of tool invocations that return valid results | Must exceed 95% — failed tools indicate broken integrations |
| Guardrail pass rate | Percentage of responses that pass all configured safety and governance checks | Must exceed 90% — violations indicate prompt injection or scope creep |
| User satisfaction | Aggregated thumbs-up/thumbs-down feedback from end users | Must exceed 70% positive over rolling 7-day window |
The critical dimension is hallucination rate. Unlike Data Products (where health means the data exists and is fresh) or ML Products (where health means the model's statistical properties are stable), AI Products face a uniquely dangerous failure mode: they can produce responses that are fluent, confident, and completely fabricated. A hallucinating agent does not crash. It does not return an error. It returns a well-structured answer that happens to be wrong. The only way to detect this is continuous evaluation against ground truth context.
Natural Language Composition: Building Agents in Sentences
The three-week, three-engineer process for building the churn analysis agent becomes a single interaction:
"Build me an agent that analyzes customer churn patterns using our billing and product analytics data, identifies the top 5 drivers, and recommends retention strategies with estimated ROI."
The platform processes this request through four stages:
1. Capability extraction. The natural language description is parsed into a structured capability specification: data sources needed (billing, product analytics), analysis type (churn pattern analysis, driver ranking), output format (recommendations with ROI estimates), domain (customer success).
2. Template matching. The platform's agent template library is searched for templates that match the capability specification. A "customer analytics" template provides the base prompt structure, recommended tool set, and evaluation criteria. Templates encode best practices — they know that churn analysis requires cohort segmentation, that ROI estimates need confidence intervals, that recommendations should be actionable and specific.
3. Tool configuration. Based on the data sources identified in step 1, the platform configures the agent's tool set: a SQL query tool connected to the billing database, a product analytics API tool, a statistical analysis tool for driver identification, a calculation tool for ROI estimation. Each tool comes with its own guardrails — the SQL tool cannot execute DDL, the analytics API tool respects row-level security.
4. Guardrail application. The platform applies the tenant's governance policies: PII redaction rules, data access controls, output format requirements, cost limits per interaction. Industry-specific guardrails are layered on top — healthcare tenants get HIPAA compliance checks, financial tenants get fair lending constraints.
The output is a preview — the agent configuration, its tool set, its guardrails, and a set of sample interactions showing how it would respond to typical questions. The VP reviews, adjusts if needed, and publishes. Total time: minutes, not weeks.
Evaluation-Gated Deployment
An AI Product cannot be published without passing its evaluation gate. This is the mechanism that prevents "it looks good in a demo" from becoming "it's in production serving executives."
The evaluation gate is a three-condition check:
Condition 1: Evaluation score ≥ threshold. The agent runs against a curated test suite — a set of questions with known-good answers, covering the agent's intended capabilities. Automated evaluators score each response for groundedness (is it supported by data?), relevance (does it answer the question?), and completeness (does it cover all aspects?). The composite score must exceed the configured threshold (default: 0.75).
Condition 2: Hallucination rate ≤ 0.10. Across the test suite, no more than 10% of responses may contain unsupported claims. This is measured by a dedicated hallucination detector that cross-references every factual claim in the agent's response against the data it retrieved.
Condition 3: Guardrail pass rate ≥ 0.90. Across the test suite, at least 90% of responses must pass all configured guardrails — no PII leakage, no unauthorized data access, no policy violations. The 10% tolerance accounts for edge cases in guardrail rules, not for genuine safety failures.
All three conditions must be met simultaneously. An agent that scores 0.92 on evaluation but has a 15% hallucination rate cannot be published. An agent with 0% hallucination but a 0.70 evaluation score cannot be published. The gate is conjunctive, not disjunctive.
Cross-Tenant Sharing: The Agent Marketplace
Once an AI Product is published and proven within one tenant, it can be shared across tenants through the agent marketplace. This transforms the three-week build into a five-minute install.
The marketplace operates on three principles:
Verified publishers. Not every tenant can publish to the marketplace. A verified publisher program ensures that shared agents meet quality, security, and documentation standards. Publishers commit to maintaining their agents — updating for API changes, responding to consumer feedback, and honoring deprecation timelines.
Approval workflow. When a tenant installs a marketplace agent, it goes through an approval process: the tenant admin reviews the agent's tool requirements, data access patterns, and governance policies. The agent is deployed with the installing tenant's data connections and guardrails, not the publisher's. Cross-tenant sharing means sharing the configuration, not the data.
Usage metering. Every marketplace agent tracks its usage — invocations, token consumption, tool calls, compute time. This enables chargeback models (the installing tenant pays for compute) and quality feedback (the publisher sees aggregated satisfaction scores across all tenants).
The marketplace is not an app store. It is more like npm for agents — a registry of reusable, composable, versioned agent configurations that teams can browse, fork, customize, and publish back.
Hand-Crafted Agent vs. AI Product
| Dimension | Hand-Crafted Agent | AI Product |
|---|---|---|
| Build time | 2-4 weeks with a cross-functional team | Minutes with natural language composition |
| Quality assurance | "It worked in the demo" | Evaluation gate: score ≥ 0.75, hallucination ≤ 10%, guardrails ≥ 90% |
| Deployment | Manual containerization and endpoint setup | Automated deployment with health monitoring from day one |
| Monitoring | Log files, maybe | 6-dimension health score, continuous evaluation, automated alerts |
| Versioning | Git commits on prompt files | Semantic versioning with changelog, consumer notification, rollback |
| Reusability | Copy-paste and modify | Fork from marketplace, customize tools and guardrails, publish variant |
| Governance | "The prompt says 'don't share PII'" | Automated guardrails: PII redaction, row-level security, cost limits, audit trail |
| Dependencies | Implicit — the agent "uses" some data sources | Explicit — declares Data and ML Products consumed, health propagates |
| Lifecycle | "Production" until someone deletes it | DRAFT → PUBLISHED → DEPRECATED → RETIRED with consumer contracts |
| Cost visibility | Unknown until the bill arrives | Per-interaction metering: tokens, tool calls, compute, with budget alerts |
The Composition Pattern
The real power of AI Products emerges when they compose. A complex analytics workflow is not a single monolithic agent — it is a composition of specialized AI Products:
The data exploration agent retrieves schema information and sample data. The statistical analysis agent runs distribution tests and identifies patterns. The visualization agent generates charts and dashboards. The narrative agent synthesizes findings into business-readable insights.
Each of these is an independent AI Product with its own health monitoring, evaluation gate, and consumer contract. They communicate through a structured protocol — the output of one becomes the input of the next. If the statistical analysis agent's health degrades (evaluation score drops, hallucination rate increases), the composed workflow automatically degrades gracefully — surfacing raw data instead of potentially wrong analysis.
This is the microservices pattern applied to AI: small, focused, independently deployable, independently monitorable units of intelligence that compose into complex capabilities.
What Comes Next
AI Products bring agent discipline to the analytics stack. But agents do not exist to serve other agents — they exist to serve business users. And the primary interface between business users and data is still the dashboard. In Part 4, we explore BI Products — dashboards that know when they are wrong, metrics that mean the same thing everywhere, and the death of the 500-dashboard graveyard.
This is Part 3 of the Product Intelligence Series. Previous: ML Products: From Model Artifacts to Living Products. Next: BI Products: Dashboards That Know When They're Wrong.