Model Serving

The Model Serving integration enables deploying trained models for real-time and batch inference through the AI Service. Models are served via Ray Serve or NVIDIA Triton Inference Server in the ML Service, with the AI Service providing the API gateway, request routing, and monitoring overlay.

Serving Architecture

Client Request --> AI Service --> ML Service (Ray Serve / Triton) --> Prediction
                      |
                Model Registry (version resolution)
                      |
                Feature Store (feature retrieval)

Deployment Models

Deployment Type	Description	Use Case
Real-time	Low-latency single predictions	User-facing applications
Batch	High-throughput bulk predictions	Scheduled scoring jobs
Streaming	Continuous predictions on event streams	Real-time alerting

Deploy Model

Deploys a model from the registry to a serving endpoint:

POST /api/v1/ml/serving/deploy

{
  "model_id": "model-xyz789",
  "model_version": "v2",
  "serving_config": {
    "min_replicas": 1,
    "max_replicas": 4,
    "target_latency_ms": 50,
    "resources": {
      "cpu": 2,
      "memory_gb": 4
    }
  },
  "traffic_config": {
    "strategy": "canary",
    "canary_percentage": 10
  }
}

Get Prediction

Runs inference against a deployed model:

POST /api/v1/ml/serving/:model_id/predict

{
  "features": {
    "tenure": 24,
    "monthly_charges": 79.50,
    "total_charges": 1908.00,
    "contract_type": "month-to-month"
  }
}

Response

{
  "prediction": 1,
  "probability": 0.82,
  "model_id": "model-xyz789",
  "model_version": "v2",
  "latency_ms": 12,
  "features_used": ["tenure", "monthly_charges", "total_charges", "contract_type"]
}

Traffic Management

The serving layer supports advanced traffic management for safe rollouts:

Strategy	Description
Direct	100% traffic to the specified version
Canary	Gradual percentage shift to new version
Shadow	Mirror traffic to new version without serving responses
A/B Test	Split traffic by user segment for comparison
Blue/Green	Instant switch between two deployed versions

Autoscaling

Serving endpoints autoscale based on request volume and latency:

{
  "autoscaling": {
    "min_replicas": 1,
    "max_replicas": 10,
    "target_requests_per_second": 100,
    "scale_up_threshold": 0.8,
    "scale_down_threshold": 0.3,
    "cooldown_seconds": 60
  }
}

Health and Monitoring

Endpoint	Purpose
`GET /api/v1/ml/serving/:model_id/health`	Model serving health check
`GET /api/v1/ml/serving/:model_id/metrics`	Latency, throughput, error rate
`GET /api/v1/ml/serving/deployments`	List all active deployments

Key Metrics

Metric	Type	Description
`ml_serving_latency_ms`	Histogram	End-to-end prediction latency
`ml_serving_requests_total`	Counter	Total prediction requests
`ml_serving_errors_total`	Counter	Failed prediction requests
`ml_serving_replicas`	Gauge	Active serving replicas

Configuration

Environment Variable	Default	Description
`ML_SERVING_BACKEND`	`ray_serve`	Serving backend (ray_serve, triton)
`ML_SERVING_DEFAULT_TIMEOUT`	`30`	Default prediction timeout in seconds
`ML_SERVING_MAX_BATCH_SIZE`	`64`	Maximum batch size for batched inference
`ML_SERVING_CACHE_ENABLED`	`true`	Cache predictions for identical inputs

Feature Engineering Model Registry