Model Serving
The Model Serving integration enables deploying trained models for real-time and batch inference through the AI Service. Models are served via Ray Serve or NVIDIA Triton Inference Server in the ML Service, with the AI Service providing the API gateway, request routing, and monitoring overlay.
Serving Architecture
Client Request --> AI Service --> ML Service (Ray Serve / Triton) --> Prediction
|
Model Registry (version resolution)
|
Feature Store (feature retrieval)Deployment Models
| Deployment Type | Description | Use Case |
|---|---|---|
| Real-time | Low-latency single predictions | User-facing applications |
| Batch | High-throughput bulk predictions | Scheduled scoring jobs |
| Streaming | Continuous predictions on event streams | Real-time alerting |
Deploy Model
Deploys a model from the registry to a serving endpoint:
POST /api/v1/ml/serving/deploy{
"model_id": "model-xyz789",
"model_version": "v2",
"serving_config": {
"min_replicas": 1,
"max_replicas": 4,
"target_latency_ms": 50,
"resources": {
"cpu": 2,
"memory_gb": 4
}
},
"traffic_config": {
"strategy": "canary",
"canary_percentage": 10
}
}Get Prediction
Runs inference against a deployed model:
POST /api/v1/ml/serving/:model_id/predict{
"features": {
"tenure": 24,
"monthly_charges": 79.50,
"total_charges": 1908.00,
"contract_type": "month-to-month"
}
}Response
{
"prediction": 1,
"probability": 0.82,
"model_id": "model-xyz789",
"model_version": "v2",
"latency_ms": 12,
"features_used": ["tenure", "monthly_charges", "total_charges", "contract_type"]
}Traffic Management
The serving layer supports advanced traffic management for safe rollouts:
| Strategy | Description |
|---|---|
| Direct | 100% traffic to the specified version |
| Canary | Gradual percentage shift to new version |
| Shadow | Mirror traffic to new version without serving responses |
| A/B Test | Split traffic by user segment for comparison |
| Blue/Green | Instant switch between two deployed versions |
Autoscaling
Serving endpoints autoscale based on request volume and latency:
{
"autoscaling": {
"min_replicas": 1,
"max_replicas": 10,
"target_requests_per_second": 100,
"scale_up_threshold": 0.8,
"scale_down_threshold": 0.3,
"cooldown_seconds": 60
}
}Health and Monitoring
| Endpoint | Purpose |
|---|---|
GET /api/v1/ml/serving/:model_id/health | Model serving health check |
GET /api/v1/ml/serving/:model_id/metrics | Latency, throughput, error rate |
GET /api/v1/ml/serving/deployments | List all active deployments |
Key Metrics
| Metric | Type | Description |
|---|---|---|
ml_serving_latency_ms | Histogram | End-to-end prediction latency |
ml_serving_requests_total | Counter | Total prediction requests |
ml_serving_errors_total | Counter | Failed prediction requests |
ml_serving_replicas | Gauge | Active serving replicas |
Configuration
| Environment Variable | Default | Description |
|---|---|---|
ML_SERVING_BACKEND | ray_serve | Serving backend (ray_serve, triton) |
ML_SERVING_DEFAULT_TIMEOUT | 30 | Default prediction timeout in seconds |
ML_SERVING_MAX_BATCH_SIZE | 64 | Maximum batch size for batched inference |
ML_SERVING_CACHE_ENABLED | true | Cache predictions for identical inputs |