Proactive Intelligence: Detect Before They Ask

March 2026 · 10 min read

5:47 AM: The Incident Nobody Had to Handle

It is 6 AM. Before any human logs in, the platform has already done the following:

At 5:12 AM, the health monitor detected that the customer events pipeline had not delivered new data in 97 minutes -- breaching the 60-minute freshness SLO. Within 30 seconds, the system traced the pipeline's dependency graph and identified the blast radius: 4 downstream ML models consuming the customer events feature, 2 executive dashboards refreshed from those models, and 1 real-time recommendation engine.

By 5:14 AM, the pattern library matched the failure signature -- source connector timeout on the Shopify integration -- with 91% confidence. This pattern had occurred 43 times before, and restarting the connector resolved it 39 of those times.

At 5:15 AM, the auto-remediation policy approved the restart. The connector restarted. Data began flowing again at 5:19 AM. The pipeline caught up on the backlog by 5:41 AM. All downstream models received fresh data by 5:47 AM.

When the VP of Data opens their laptop at 8 AM, they find a single notification: "Freshness SLO violation detected and resolved at 5:47 AM. Root cause: Shopify connector timeout (auto-restarted). No impact to today's reports. Full decision trace available."

No pages. No war rooms. No four-engineer investigation. The platform detected, diagnosed, remediated, and reported -- all before the first cup of coffee.

The Reactive Trap

Most data teams operate in reactive mode. An alert fires. An engineer investigates. They dig through logs, check dashboards, query metadata, correlate timestamps, and eventually identify the root cause. They apply a fix. They write a postmortem. They go back to what they were doing before the interruption.

This model has three fatal problems:

First, alerts arrive after damage is done. By the time a freshness SLO violation triggers, stale data has already been served to models, dashboards, and decision-makers. The alert tells you the barn door is open. The horse left 90 minutes ago.

Second, investigation is expensive. Every incident requires a skilled engineer to context-switch, load the relevant systems into working memory, and trace through a complex dependency graph. Even experienced engineers take 30 to 60 minutes for routine incidents. Novel incidents can consume entire days.

Third, the same incidents keep recurring. Without systematic learning from past resolutions, every occurrence is treated as a fresh investigation. The connector timeout that has happened 43 times gets investigated 43 times.

Proactive intelligence inverts this model. Instead of alert-then-investigate-then-fix, it continuously monitors, predicts, and prevents.

Tiered Health Monitoring

Not all data assets are equally important. Monitoring everything at the highest frequency is wasteful and generates alert fatigue. Monitoring everything at the lowest frequency means critical failures go undetected for hours.

The platform implements tiered monitoring based on asset criticality:

Tier	Check Frequency	Asset Examples	Rationale
CRITICAL	Every 5 minutes	Revenue pipeline, customer-facing APIs, fraud detection model	Downtime measured in minutes costs real money
HIGH	Every 15 minutes	Core feature pipelines, training data feeds, executive dashboards	SLA breaches affect business decisions
MEDIUM	Every 1 hour	Development pipelines, staging environments, internal reports	Important but not time-sensitive
LOW	Every 24 hours	Archive pipelines, historical backfills, documentation builds	Failures can wait until business hours

Criticality is not assigned manually. The platform infers it from the context graph: assets consumed by customer-facing services or executive dashboards are automatically classified as CRITICAL. Assets with no downstream consumers are classified as LOW. Teams can override these classifications, but the defaults are informed by actual usage patterns rather than someone's guess about what matters.

The SLO Violation Pipeline

When a health check detects a violation, the platform does not just fire an alert. It runs a structured investigation pipeline:

Step 1: Detection. The monitoring agent identifies that a metric has breached its SLO threshold. This is the easy part -- every monitoring tool can do this.

Step 2: Root cause analysis via graph traversal. The platform walks the context graph upstream from the failing asset. What sources feed this pipeline? What transformations touch this data? What connectors are involved? Are any of them in a degraded state? This is the step that takes a human engineer 30 minutes and takes the platform 3 seconds, because the dependency graph is already indexed and queryable.

Step 3: Impact analysis via downstream traversal. The platform walks the graph downstream from the failing asset. What models consume this data? What dashboards render those models' outputs? What business processes depend on those dashboards? The result is a complete blast radius assessment: "This freshness violation affects 4 models, 2 dashboards, and 1 recommendation engine."

Step 4: Pattern matching. The failure signature -- the combination of asset type, failure mode, upstream state, and environmental context -- is compared against the pattern library built from decision traces. If a high-confidence match exists, the platform knows what worked before.

Step 5: Remediation or escalation. If the matched pattern has a 90%+ success rate and the remediation action falls within the auto-approval policy, the platform executes the fix. If not, it surfaces the pattern, the evidence, and a recommended action to the on-call engineer -- who now starts their investigation at step 4 instead of step 1.

Beyond SLOs: Anomaly Detection on Value Distributions

SLO monitoring catches binary failures -- is the data fresh or stale? Is the pipeline running or broken? But some of the most damaging data quality issues are not binary. They are statistical.

The platform continuously profiles value distributions for monitored columns and detects drift using the Population Stability Index (PSI):

PSI below 0.1: No significant drift. Business as usual.
PSI between 0.1 and 0.2: Minor drift. Log it, but no alert.
PSI between 0.2 and 0.5: Medium drift. Alert the data owner. Something has changed -- a new data source, a schema migration, a business process change -- and the downstream consumers need to know.
PSI above 0.5: High drift. The distribution has fundamentally changed. Halt dependent pipelines and models until a human confirms the change is intentional.

This catches the insidious failures that binary monitoring misses entirely. A column called order_amount that previously ranged from $10 to$ 500 now has values exceeding $50,000. The pipeline is running. The data is fresh. The schema is unchanged. But the business meaning of the data has shifted -- perhaps a currency conversion was removed, or a new B2B channel with larger order sizes was integrated. Without distribution monitoring, a revenue forecasting model trains on this shifted data and produces wildly inaccurate projections.

Data Debt Scanner

Not all data problems are urgent. Some are slow-burning liabilities that compound over time -- data debt.

The platform runs a background scanner that identifies data debt across four dimensions:

Unused assets: Data products that have not been queried in 90 days. These consume storage, complicate governance, and create a false sense of coverage. After 90 days of inactivity, assets are flagged for review. After 180 days, they are candidates for archival.
Undocumented assets: Tables and columns with no descriptions, no data owners, and no classification tags. These are governance blind spots -- the guardrails cannot protect data that is not classified.
Orphaned pipelines: Pipelines that produce data consumed by no downstream asset. They run, they consume compute, and their output goes nowhere.
Schema divergence: Tables that have drifted from their declared schema contracts. A column documented as NOT NULL that now contains 15% nulls. A column documented as INTEGER that now contains string values due to an upstream change.

Data debt is not an emergency. But left unaddressed, it compounds. The platform surfaces it through a debt score -- a single metric that tracks the overall health of the data estate and trends over time.

Persona-Specific Briefings

Different stakeholders need different views of the same intelligence. The platform generates role-aware briefings that present the right level of detail to the right audience:

Leadership briefing: "3 incidents detected overnight. All auto-resolved. Revenue pipeline uptime: 99.97%. Data debt score improved 4 points this week. One action item: the marketing attribution model shows medium drift (PSI 0.31) -- the data team is investigating."

Data Engineering briefing: "Shopify connector timed out at 5:12 AM (auto-restarted, pattern #47). The orders pipeline took 23% longer than usual -- likely due to the 40% volume increase from the weekend sale. No failed jobs. 2 orphaned pipelines flagged for review."

MLOps briefing: "All models healthy. Customer churn model AUC stable at 0.87. Revenue forecast model shows input feature drift on avg_order_value (PSI 0.28) -- recommend monitoring for 48 hours before triggering retrain. Fraud model inference latency P99 at 42ms (within 50ms SLA)."

Same data. Same incidents. Same platform. Three different perspectives, each tailored to what that stakeholder actually needs to know and act on.

The Self-Healing Loop

Proactive intelligence is not a one-shot detection system. It is a continuous loop:

Monitor -- Health checks run at tier-appropriate frequencies
Detect -- SLO violations and statistical anomalies are identified
Analyze -- Graph traversal determines root cause and blast radius
Match -- Pattern library surfaces past resolutions
Remediate -- Auto-execute or escalate based on confidence and policy
Learn -- The outcome is recorded as a decision trace, updating the pattern library

Each iteration makes the next one faster. The first time a connector times out, the investigation takes an hour. The tenth time, it takes 7 minutes (auto-remediated). The hundredth time, the platform has already learned that this connector tends to timeout during high-traffic periods and pre-emptively increases its timeout threshold before the weekend.

Reactive vs. Proactive: The Numbers

Incident Type	Reactive Response	Proactive Response
Pipeline freshness breach	Alert at T+60m, investigate 30m, fix 15m = 105m total	Detect at T+5m, pattern match 3s, auto-fix 4m = 9m total
Schema drift	User reports wrong dashboard data, investigate 2h	Detect at ingestion, halt pipeline, alert owner in 2m
Model input drift	Model accuracy degrades over days, noticed in weekly review	PSI threshold breached, alert in 1h, retrain triggered
Connector failure	On-call page, engineer wakes up, investigates, restarts = 45m	Auto-restart in 4m, engineer gets morning summary
Data debt accumulation	Quarterly audit reveals 200 orphaned tables	Continuous scanner flags orphans weekly, debt score trends visible

The difference is not incremental. Proactive intelligence does not make incident response 20% faster. It eliminates entire categories of incidents that never need human intervention at all.

Previously in this series, we explored Governance & Guardrails -- the trust framework that keeps agents operating safely. Proactive intelligence is what happens when that trusted framework can act autonomously. Next, we examine Natural Language Data Engineering -- how conversational interfaces transform data engineering from a specialized skill into a conversation anyone can have.

MATIH is building the unified data and AI platform where the platform itself is the first responder -- detecting, diagnosing, and resolving issues before they impact the business. Learn more about our architecture or try the platform.

18. Governance & Guardrails 20. Natural Language Data Engineering