Inference Overview
The Inference module provides the complete model serving stack for the MATIH ML Service, supporting real-time predictions, batch inference pipelines, ensemble routing, shadow deployments, and inference optimization. It integrates with Ray Serve and NVIDIA Triton to deliver low-latency, high-throughput model predictions at scale.
Inference Architecture
Client Request --> API Gateway --> ML Service (Inference Router)
|
+-------------------+-------------------+
| | |
Ray Serve Triton Server Batch Pipeline
(Python models) (Optimized models) (Spark/Ray)
| | |
Model Cache TensorRT/ONNX Object StoreServing Backends
| Backend | Use Case | Models | Optimization |
|---|---|---|---|
| Ray Serve | General-purpose Python models | XGBoost, scikit-learn, PyTorch | Dynamic batching |
| Triton | High-performance optimized models | ONNX, TensorRT, TensorFlow SavedModel | GPU acceleration |
| Batch Pipeline | Scheduled bulk inference | Any model format | Distributed compute via Ray/Spark |
Inference Modes
| Mode | Latency | Throughput | Trigger |
|---|---|---|---|
| Real-time | Milliseconds | Single request | API call |
| Mini-batch | Sub-second | 10-100 requests | Adaptive batching |
| Batch | Minutes to hours | Millions of records | Schedule or API |
| Streaming | Sub-second | Continuous | Kafka event |
Key Components
| Component | Location | Purpose |
|---|---|---|
| Batch Inference | src/inference/batch_inference_pipeline.py | Distributed batch scoring |
| Batch Processor | src/inference/batch_processor.py | Batch job orchestration |
| Ensemble Router | src/inference/ensemble_router.py | Multi-model routing and aggregation |
| Shadow Deployment | src/inference/shadow_deployment.py | Traffic mirroring for validation |
| Inference Optimizer | src/inference/inference_optimizer.py | Model optimization and quantization |
| Model Cache | src/inference/model_cache.py | In-memory model caching |
API Endpoints
| Endpoint | Method | Purpose |
|---|---|---|
/api/v1/inference/predict | POST | Single prediction |
/api/v1/inference/batch | POST | Submit batch inference job |
/api/v1/inference/batch/:job_id | GET | Get batch job status |
/api/v1/inference/deployments | GET | List active deployments |
/api/v1/inference/deploy | POST | Deploy a model |
/api/v1/inference/undeploy | POST | Remove a deployment |
Metrics
| Metric | Type | Description |
|---|---|---|
ml_inference_latency_ms | Histogram | End-to-end prediction latency |
ml_inference_throughput_rps | Gauge | Requests per second |
ml_inference_errors_total | Counter | Failed predictions |
ml_inference_batch_duration_s | Histogram | Batch job duration |
ml_inference_model_cache_hits | Counter | Model cache hit rate |
Detailed Sections
| Section | Content |
|---|---|
| Batch Inference | Distributed batch processing pipeline |
| Triton Serving | NVIDIA Triton Inference Server integration |
| Ensemble Routing | Multi-model routing and aggregation |
| Shadow Deployment | Traffic mirroring for safe rollouts |
| Inference Optimization | Quantization, pruning, and compilation |