Performance Monitoring

The Performance Monitoring module tracks model accuracy, latency, throughput, and resource utilization for deployed models. It provides real-time dashboards, historical trend analysis, and alerting when performance degrades below configured thresholds. The implementation is in src/monitoring/performance_monitoring_service.py.

Monitored Metrics

Prediction Quality

Metric	Type	Description
Accuracy	Classification	Percentage of correct predictions
F1 Score	Classification	Harmonic mean of precision and recall
AUC-ROC	Classification	Area under the ROC curve
RMSE	Regression	Root mean squared error
MAE	Regression	Mean absolute error
MAPE	Regression	Mean absolute percentage error

Serving Performance

Metric	Type	Description
Latency (p50, p95, p99)	Histogram	End-to-end prediction latency
Throughput	Gauge	Requests per second
Error rate	Counter	Percentage of failed predictions
Queue depth	Gauge	Pending requests in serving queue

Resource Utilization

Metric	Type	Description
CPU utilization	Gauge	Serving pod CPU usage
Memory utilization	Gauge	Serving pod memory usage
GPU utilization	Gauge	GPU compute utilization (if applicable)
Model cache hit rate	Counter	In-memory model cache effectiveness

Get Performance Summary

GET /api/v1/monitoring/performance/:model_id

Query Parameters

Parameter	Type	Required	Description
window	string	no	Time window (1h, 6h, 24h, 7d, 30d)
granularity	string	no	Metric granularity (minute, hour, day)

Response

{
  "model_id": "model-xyz789",
  "window": "24h",
  "quality_metrics": {
    "accuracy": {"current": 0.91, "baseline": 0.912, "trend": "stable"},
    "f1_score": {"current": 0.89, "baseline": 0.895, "trend": "slight_decline"}
  },
  "serving_metrics": {
    "latency_p50_ms": 8.2,
    "latency_p95_ms": 22.5,
    "latency_p99_ms": 48.1,
    "throughput_rps": 145,
    "error_rate": 0.001
  },
  "resource_metrics": {
    "cpu_utilization": 0.45,
    "memory_utilization": 0.62,
    "gpu_utilization": 0.0
  }
}

Baseline Comparison

Performance is compared against baselines established at deployment time:

Metric	Baseline Source	Alert Threshold
Accuracy	Test set evaluation at deployment	5% relative degradation
Latency p95	First 24 hours in production	50% increase
Error rate	First 24 hours in production	Above 1%
Throughput	Expected based on traffic forecast	20% below forecast

Performance Trends

The module tracks metric trends over time to detect gradual degradation:

{
  "trends": {
    "accuracy": {
      "7_day_trend": "declining",
      "slope": -0.002,
      "projected_baseline_breach": "2025-03-22T00:00:00Z",
      "confidence": 0.78
    }
  }
}

Alerting Rules

{
  "model_id": "model-xyz789",
  "alert_rules": [
    {
      "metric": "accuracy",
      "condition": "below",
      "threshold": 0.85,
      "window": "1h",
      "severity": "critical"
    },
    {
      "metric": "latency_p95",
      "condition": "above",
      "threshold": 100,
      "window": "15m",
      "severity": "warning"
    },
    {
      "metric": "error_rate",
      "condition": "above",
      "threshold": 0.05,
      "window": "5m",
      "severity": "critical"
    }
  ]
}

Prometheus Integration

All metrics are exported as Prometheus metrics for Grafana dashboards:

Prometheus Metric	Labels	Type
`ml_model_accuracy`	`model_id`, `tenant_id`	Gauge
`ml_model_latency_seconds`	`model_id`, `quantile`	Summary
`ml_model_predictions_total`	`model_id`, `status`	Counter
`ml_model_errors_total`	`model_id`, `error_type`	Counter

Configuration

Environment Variable	Default	Description
`PERF_MONITORING_INTERVAL`	`60`	Metric aggregation interval in seconds
`PERF_ACCURACY_THRESHOLD`	`0.05`	Accuracy degradation alert threshold
`PERF_LATENCY_P95_MAX_MS`	`100`	Maximum acceptable p95 latency
`PERF_ERROR_RATE_MAX`	`0.01`	Maximum acceptable error rate

Drift Detection Retraining Triggers