Monitoring Overview
The ML Monitoring module provides continuous tracking of model performance, data drift detection, and automated retraining triggers for models deployed in production. It ensures that model quality remains high over time by detecting degradation early and initiating corrective actions.
Monitoring Architecture
Production Traffic --> Prediction Logger --> Monitoring Service
|
+-----------------+-----------------+
| | |
Drift Detection Performance Monitor Retraining Trigger
| | |
Alert System Dashboard ML PipelineKey Components
| Component | Location | Purpose |
|---|---|---|
| Drift Detection Service | src/monitoring/drift_detection_service.py | Multi-dimensional drift analysis |
| Performance Monitoring | src/monitoring/performance_monitoring_service.py | Model accuracy and latency tracking |
| Model Monitoring Service | src/monitoring/model_monitoring_service.py | Overall monitoring orchestration |
| Retraining Triggers | src/monitoring/retraining_trigger_service.py | Automated retraining logic |
Monitoring Dimensions
| Dimension | Metrics | Frequency |
|---|---|---|
| Data drift | Feature distribution shifts | Hourly |
| Concept drift | Prediction quality degradation | Hourly |
| Performance | Accuracy, F1, latency, throughput | Real-time |
| Resource | CPU, memory, GPU utilization | Real-time |
| Traffic | Request volume, error rates | Real-time |
Alert Severity Levels
| Severity | Response | Example |
|---|---|---|
info | Log only | Minor feature drift below threshold |
warning | Notify team | Moderate drift detected, monitoring closely |
critical | Page on-call | Significant accuracy degradation |
emergency | Auto-remediate | Model serving errors above 5% |
API Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
/api/v1/monitoring/drift/:model_id | GET | Get drift status for a model |
/api/v1/monitoring/performance/:model_id | GET | Get performance metrics |
/api/v1/monitoring/alerts | GET | List active monitoring alerts |
/api/v1/monitoring/config/:model_id | PUT | Configure monitoring thresholds |
/api/v1/monitoring/retrain/:model_id | POST | Trigger manual retraining |
Configuration
| Environment Variable | Default | Description |
|---|---|---|
MONITORING_ENABLED | true | Enable monitoring module |
MONITORING_INTERVAL_SECONDS | 3600 | Default monitoring check interval |
MONITORING_RETENTION_DAYS | 90 | Monitoring data retention |
MONITORING_ALERT_COOLDOWN | 3600 | Minimum seconds between duplicate alerts |
Detailed Sections
| Section | Content |
|---|---|
| Drift Detection | Feature drift, concept drift, statistical tests |
| Performance Monitoring | Accuracy tracking, latency, throughput |
| Retraining Triggers | Automated and rule-based retraining |