Inference Overview

The Inference module provides the complete model serving stack for the MATIH ML Service, supporting real-time predictions, batch inference pipelines, ensemble routing, shadow deployments, and inference optimization. It integrates with Ray Serve and NVIDIA Triton to deliver low-latency, high-throughput model predictions at scale.

Inference Architecture

Client Request --> API Gateway --> ML Service (Inference Router)
                                        |
                    +-------------------+-------------------+
                    |                   |                   |
              Ray Serve           Triton Server        Batch Pipeline
           (Python models)       (Optimized models)    (Spark/Ray)
                    |                   |                   |
              Model Cache          TensorRT/ONNX       Object Store

Serving Backends

Backend	Use Case	Models	Optimization
Ray Serve	General-purpose Python models	XGBoost, scikit-learn, PyTorch	Dynamic batching
Triton	High-performance optimized models	ONNX, TensorRT, TensorFlow SavedModel	GPU acceleration
Batch Pipeline	Scheduled bulk inference	Any model format	Distributed compute via Ray/Spark

Inference Modes

Mode	Latency	Throughput	Trigger
Real-time	Milliseconds	Single request	API call
Mini-batch	Sub-second	10-100 requests	Adaptive batching
Batch	Minutes to hours	Millions of records	Schedule or API
Streaming	Sub-second	Continuous	Kafka event

Key Components

Component	Location	Purpose
Batch Inference	`src/inference/batch_inference_pipeline.py`	Distributed batch scoring
Batch Processor	`src/inference/batch_processor.py`	Batch job orchestration
Ensemble Router	`src/inference/ensemble_router.py`	Multi-model routing and aggregation
Shadow Deployment	`src/inference/shadow_deployment.py`	Traffic mirroring for validation
Inference Optimizer	`src/inference/inference_optimizer.py`	Model optimization and quantization
Model Cache	`src/inference/model_cache.py`	In-memory model caching

API Endpoints

Endpoint	Method	Purpose
`/api/v1/inference/predict`	POST	Single prediction
`/api/v1/inference/batch`	POST	Submit batch inference job
`/api/v1/inference/batch/:job_id`	GET	Get batch job status
`/api/v1/inference/deployments`	GET	List active deployments
`/api/v1/inference/deploy`	POST	Deploy a model
`/api/v1/inference/undeploy`	POST	Remove a deployment

Metrics

Metric	Type	Description
`ml_inference_latency_ms`	Histogram	End-to-end prediction latency
`ml_inference_throughput_rps`	Gauge	Requests per second
`ml_inference_errors_total`	Counter	Failed predictions
`ml_inference_batch_duration_s`	Histogram	Batch job duration
`ml_inference_model_cache_hits`	Counter	Model cache hit rate

Detailed Sections

Section	Content
Batch Inference	Distributed batch processing pipeline
Triton Serving	NVIDIA Triton Inference Server integration
Ensemble Routing	Multi-model routing and aggregation
Shadow Deployment	Traffic mirroring for safe rollouts
Inference Optimization	Quantization, pruning, and compilation

Rollback Procedures Batch Inference