Model Debugging

The Model Debugging module provides tools for identifying, diagnosing, and resolving model performance issues. It enables error analysis across data slices, counterfactual explanations, prediction confidence analysis, and feature sensitivity testing to help ML engineers understand and fix model failures.

Debugging Capabilities

Capability	Description	Use Case
Error Analysis	Identify data slices with high error rates	Find underperforming segments
Counterfactual Explanations	Minimal feature changes to flip prediction	Understand decision boundaries
Confidence Analysis	Distribution of prediction confidence scores	Detect uncertain predictions
Feature Sensitivity	Impact of small perturbations on predictions	Test model robustness
Slice Performance	Metrics across user-defined data slices	Evaluate fairness and coverage
Adversarial Testing	Test resilience to adversarial inputs	Security and robustness

Run Diagnostics

POST /api/v1/governance/debug

{
  "model_id": "model-xyz789",
  "dataset": {
    "source": "sql",
    "query": "SELECT * FROM ml_features.customer_churn"
  },
  "target_column": "churned",
  "diagnostics": ["error_analysis", "confidence_analysis", "slice_performance"],
  "slicing_columns": ["contract_type", "tenure_bucket", "region"]
}

Response

{
  "model_id": "model-xyz789",
  "diagnostics": {
    "error_analysis": {
      "overall_error_rate": 0.06,
      "worst_slices": [
        {
          "slice": "contract_type = 'two-year' AND tenure_bucket = '0-6mo'",
          "error_rate": 0.23,
          "sample_count": 45,
          "dominant_error": "false_negative"
        }
      ]
    },
    "confidence_analysis": {
      "mean_confidence": 0.84,
      "low_confidence_count": 320,
      "low_confidence_threshold": 0.6,
      "confidence_distribution": {
        "0.0-0.2": 15,
        "0.2-0.4": 48,
        "0.4-0.6": 257,
        "0.6-0.8": 1230,
        "0.8-1.0": 3450
      }
    }
  }
}

Counterfactual Explanations

Generate minimal changes that would flip a prediction:

POST /api/v1/governance/debug/counterfactual

{
  "model_id": "model-xyz789",
  "instance": {
    "tenure": 3,
    "monthly_charges": 89.50,
    "contract_type": "month-to-month"
  },
  "desired_outcome": 0,
  "max_changes": 2
}

Response

{
  "original_prediction": 1,
  "original_probability": 0.87,
  "counterfactuals": [
    {
      "changes": {"contract_type": "one-year"},
      "new_prediction": 0,
      "new_probability": 0.38,
      "explanation": "Switching to a one-year contract reduces churn probability from 87% to 38%"
    },
    {
      "changes": {"tenure": 12, "contract_type": "one-year"},
      "new_prediction": 0,
      "new_probability": 0.15,
      "explanation": "Increasing tenure to 12 months and switching to yearly contract reduces churn to 15%"
    }
  ]
}

Slice Performance

Compute model metrics across data slices to identify underperforming segments:

POST /api/v1/governance/debug/slices

{
  "model_id": "model-xyz789",
  "dataset": {
    "source": "sql",
    "query": "SELECT * FROM ml_features.customer_churn"
  },
  "slicing_columns": ["contract_type", "region"],
  "metrics": ["accuracy", "f1_score", "precision", "recall"]
}

Response

{
  "slices": [
    {
      "slice": "contract_type = 'month-to-month'",
      "sample_count": 3000,
      "metrics": {"accuracy": 0.91, "f1_score": 0.88}
    },
    {
      "slice": "contract_type = 'two-year'",
      "sample_count": 1200,
      "metrics": {"accuracy": 0.97, "f1_score": 0.72}
    }
  ]
}

Feature Sensitivity

Test how small perturbations to each feature affect predictions:

POST /api/v1/governance/debug/sensitivity

{
  "model_id": "model-xyz789",
  "instance": {"tenure": 12, "monthly_charges": 65.00},
  "perturbation_range": 0.1
}

Configuration

Environment Variable	Default	Description
`DEBUG_MAX_SAMPLES`	`10000`	Maximum samples for analysis
`DEBUG_COUNTERFACTUAL_LIMIT`	`5`	Max counterfactuals to generate
`DEBUG_CONFIDENCE_THRESHOLD`	`0.6`	Low confidence boundary
`DEBUG_SLICE_MIN_SAMPLES`	`30`	Minimum samples per slice

Compliance & Audit Ab Testing