MATIH Platform is in active MVP development. Documentation reflects current implementation status.
13. ML Service & MLOps
Inference & Serving
Inference Overview

Inference Overview

The Inference module provides the complete model serving stack for the MATIH ML Service, supporting real-time predictions, batch inference pipelines, ensemble routing, shadow deployments, and inference optimization. It integrates with Ray Serve and NVIDIA Triton to deliver low-latency, high-throughput model predictions at scale.


Inference Architecture

Client Request --> API Gateway --> ML Service (Inference Router)
                                        |
                    +-------------------+-------------------+
                    |                   |                   |
              Ray Serve           Triton Server        Batch Pipeline
           (Python models)       (Optimized models)    (Spark/Ray)
                    |                   |                   |
              Model Cache          TensorRT/ONNX       Object Store

Serving Backends

BackendUse CaseModelsOptimization
Ray ServeGeneral-purpose Python modelsXGBoost, scikit-learn, PyTorchDynamic batching
TritonHigh-performance optimized modelsONNX, TensorRT, TensorFlow SavedModelGPU acceleration
Batch PipelineScheduled bulk inferenceAny model formatDistributed compute via Ray/Spark

Inference Modes

ModeLatencyThroughputTrigger
Real-timeMillisecondsSingle requestAPI call
Mini-batchSub-second10-100 requestsAdaptive batching
BatchMinutes to hoursMillions of recordsSchedule or API
StreamingSub-secondContinuousKafka event

Key Components

ComponentLocationPurpose
Batch Inferencesrc/inference/batch_inference_pipeline.pyDistributed batch scoring
Batch Processorsrc/inference/batch_processor.pyBatch job orchestration
Ensemble Routersrc/inference/ensemble_router.pyMulti-model routing and aggregation
Shadow Deploymentsrc/inference/shadow_deployment.pyTraffic mirroring for validation
Inference Optimizersrc/inference/inference_optimizer.pyModel optimization and quantization
Model Cachesrc/inference/model_cache.pyIn-memory model caching

API Endpoints

EndpointMethodPurpose
/api/v1/inference/predictPOSTSingle prediction
/api/v1/inference/batchPOSTSubmit batch inference job
/api/v1/inference/batch/:job_idGETGet batch job status
/api/v1/inference/deploymentsGETList active deployments
/api/v1/inference/deployPOSTDeploy a model
/api/v1/inference/undeployPOSTRemove a deployment

Metrics

MetricTypeDescription
ml_inference_latency_msHistogramEnd-to-end prediction latency
ml_inference_throughput_rpsGaugeRequests per second
ml_inference_errors_totalCounterFailed predictions
ml_inference_batch_duration_sHistogramBatch job duration
ml_inference_model_cache_hitsCounterModel cache hit rate

Detailed Sections

SectionContent
Batch InferenceDistributed batch processing pipeline
Triton ServingNVIDIA Triton Inference Server integration
Ensemble RoutingMulti-model routing and aggregation
Shadow DeploymentTraffic mirroring for safe rollouts
Inference OptimizationQuantization, pruning, and compilation