MATIH Platform is in active MVP development. Documentation reflects current implementation status.
12. AI Service
ML Integration
Model Serving

Model Serving

The Model Serving integration enables deploying trained models for real-time and batch inference through the AI Service. Models are served via Ray Serve or NVIDIA Triton Inference Server in the ML Service, with the AI Service providing the API gateway, request routing, and monitoring overlay.


Serving Architecture

Client Request --> AI Service --> ML Service (Ray Serve / Triton) --> Prediction
                      |
                Model Registry (version resolution)
                      |
                Feature Store (feature retrieval)

Deployment Models

Deployment TypeDescriptionUse Case
Real-timeLow-latency single predictionsUser-facing applications
BatchHigh-throughput bulk predictionsScheduled scoring jobs
StreamingContinuous predictions on event streamsReal-time alerting

Deploy Model

Deploys a model from the registry to a serving endpoint:

POST /api/v1/ml/serving/deploy
{
  "model_id": "model-xyz789",
  "model_version": "v2",
  "serving_config": {
    "min_replicas": 1,
    "max_replicas": 4,
    "target_latency_ms": 50,
    "resources": {
      "cpu": 2,
      "memory_gb": 4
    }
  },
  "traffic_config": {
    "strategy": "canary",
    "canary_percentage": 10
  }
}

Get Prediction

Runs inference against a deployed model:

POST /api/v1/ml/serving/:model_id/predict
{
  "features": {
    "tenure": 24,
    "monthly_charges": 79.50,
    "total_charges": 1908.00,
    "contract_type": "month-to-month"
  }
}

Response

{
  "prediction": 1,
  "probability": 0.82,
  "model_id": "model-xyz789",
  "model_version": "v2",
  "latency_ms": 12,
  "features_used": ["tenure", "monthly_charges", "total_charges", "contract_type"]
}

Traffic Management

The serving layer supports advanced traffic management for safe rollouts:

StrategyDescription
Direct100% traffic to the specified version
CanaryGradual percentage shift to new version
ShadowMirror traffic to new version without serving responses
A/B TestSplit traffic by user segment for comparison
Blue/GreenInstant switch between two deployed versions

Autoscaling

Serving endpoints autoscale based on request volume and latency:

{
  "autoscaling": {
    "min_replicas": 1,
    "max_replicas": 10,
    "target_requests_per_second": 100,
    "scale_up_threshold": 0.8,
    "scale_down_threshold": 0.3,
    "cooldown_seconds": 60
  }
}

Health and Monitoring

EndpointPurpose
GET /api/v1/ml/serving/:model_id/healthModel serving health check
GET /api/v1/ml/serving/:model_id/metricsLatency, throughput, error rate
GET /api/v1/ml/serving/deploymentsList all active deployments

Key Metrics

MetricTypeDescription
ml_serving_latency_msHistogramEnd-to-end prediction latency
ml_serving_requests_totalCounterTotal prediction requests
ml_serving_errors_totalCounterFailed prediction requests
ml_serving_replicasGaugeActive serving replicas

Configuration

Environment VariableDefaultDescription
ML_SERVING_BACKENDray_serveServing backend (ray_serve, triton)
ML_SERVING_DEFAULT_TIMEOUT30Default prediction timeout in seconds
ML_SERVING_MAX_BATCH_SIZE64Maximum batch size for batched inference
ML_SERVING_CACHE_ENABLEDtrueCache predictions for identical inputs