MATIH Platform is in active MVP development. Documentation reflects current implementation status.
13. ML Service & MLOps
Monitoring Overview

Monitoring Overview

The ML Monitoring module provides continuous tracking of model performance, data drift detection, and automated retraining triggers for models deployed in production. It ensures that model quality remains high over time by detecting degradation early and initiating corrective actions.


Monitoring Architecture

Production Traffic --> Prediction Logger --> Monitoring Service
                                                |
                              +-----------------+-----------------+
                              |                 |                 |
                       Drift Detection   Performance Monitor   Retraining Trigger
                              |                 |                 |
                         Alert System      Dashboard         ML Pipeline

Key Components

ComponentLocationPurpose
Drift Detection Servicesrc/monitoring/drift_detection_service.pyMulti-dimensional drift analysis
Performance Monitoringsrc/monitoring/performance_monitoring_service.pyModel accuracy and latency tracking
Model Monitoring Servicesrc/monitoring/model_monitoring_service.pyOverall monitoring orchestration
Retraining Triggerssrc/monitoring/retraining_trigger_service.pyAutomated retraining logic

Monitoring Dimensions

DimensionMetricsFrequency
Data driftFeature distribution shiftsHourly
Concept driftPrediction quality degradationHourly
PerformanceAccuracy, F1, latency, throughputReal-time
ResourceCPU, memory, GPU utilizationReal-time
TrafficRequest volume, error ratesReal-time

Alert Severity Levels

SeverityResponseExample
infoLog onlyMinor feature drift below threshold
warningNotify teamModerate drift detected, monitoring closely
criticalPage on-callSignificant accuracy degradation
emergencyAuto-remediateModel serving errors above 5%

API Endpoints

EndpointMethodPurpose
/api/v1/monitoring/drift/:model_idGETGet drift status for a model
/api/v1/monitoring/performance/:model_idGETGet performance metrics
/api/v1/monitoring/alertsGETList active monitoring alerts
/api/v1/monitoring/config/:model_idPUTConfigure monitoring thresholds
/api/v1/monitoring/retrain/:model_idPOSTTrigger manual retraining

Configuration

Environment VariableDefaultDescription
MONITORING_ENABLEDtrueEnable monitoring module
MONITORING_INTERVAL_SECONDS3600Default monitoring check interval
MONITORING_RETENTION_DAYS90Monitoring data retention
MONITORING_ALERT_COOLDOWN3600Minimum seconds between duplicate alerts

Detailed Sections

SectionContent
Drift DetectionFeature drift, concept drift, statistical tests
Performance MonitoringAccuracy tracking, latency, throughput
Retraining TriggersAutomated and rule-based retraining