MATIH Platform is in active MVP development. Documentation reflects current implementation status.
13. ML Service & MLOps
Inference & Serving
Triton Serving

Triton Serving

The ML Service integrates with NVIDIA Triton Inference Server for high-performance model serving with GPU acceleration, dynamic batching, and optimized model formats. Triton is used for latency-sensitive production workloads where models have been converted to optimized formats like ONNX, TensorRT, or TensorFlow SavedModel.


Triton Architecture

ML Service API --> Triton Client --> Triton Inference Server
                                         |
                          +--------------+--------------+
                          |              |              |
                      ONNX Runtime   TensorRT    TF SavedModel
                          |              |              |
                       CPU/GPU        GPU Only       CPU/GPU

Supported Model Formats

FormatBackendHardwareOptimization Level
ONNXONNX RuntimeCPU + GPUMedium (quantization available)
TensorRTTensorRTGPU onlyHigh (layer fusion, precision)
TensorFlow SavedModelTensorFlowCPU + GPUMedium
PyTorch TorchScriptLibTorchCPU + GPUMedium
Python BackendCustom PythonCPU + GPUNone (for custom logic)

Model Repository Structure

Triton models are organized in a repository on shared storage:

model-repository/
  churn-predictor/
    config.pbtxt
    1/
      model.onnx
    2/
      model.onnx
  revenue-forecast/
    config.pbtxt
    1/
      model.plan

Model Configuration

Each Triton model requires a config.pbtxt specifying inputs, outputs, and serving parameters:

name: "churn-predictor"
platform: "onnxruntime_onnx"
max_batch_size: 64
input [
  {
    name: "features"
    data_type: TYPE_FP32
    dims: [ 4 ]
  }
]
output [
  {
    name: "prediction"
    data_type: TYPE_FP32
    dims: [ 1 ]
  }
]
dynamic_batching {
  preferred_batch_size: [ 16, 32, 64 ]
  max_queue_delay_microseconds: 100
}
instance_group [
  {
    count: 2
    kind: KIND_GPU
  }
]

Triton Client Integration

The ML Service communicates with Triton via gRPC:

class TritonInferenceService:
    def __init__(self, url: str = "localhost:8001"):
        self.client = grpcclient.InferenceServerClient(url=url)
 
    async def predict(self, model_name: str, features: np.ndarray):
        inputs = [grpcclient.InferInput("features", features.shape, "FP32")]
        inputs[0].set_data_from_numpy(features)
        result = self.client.infer(model_name, inputs)
        return result.as_numpy("prediction")

Dynamic Batching

Triton automatically batches incoming requests for improved throughput:

SettingDescriptionDefault
preferred_batch_sizeOptimal batch sizes[16, 32, 64]
max_queue_delay_microsecondsMax wait time for batching100
preserve_orderingMaintain request ordertrue

Model Versioning

Triton supports multiple model versions with configurable version policy:

PolicyDescription
LatestServe only the latest version
AllServe all available versions
SpecificServe explicitly listed versions

Health Monitoring

EndpointPurpose
/v2/health/liveTriton server liveness
/v2/health/readyTriton server readiness
/v2/models/:model/readySpecific model readiness
/v2/models/:model/versions/:ver/statsVersion-level statistics

Configuration

Environment VariableDefaultDescription
TRITON_URLlocalhost:8001Triton gRPC endpoint
TRITON_HTTP_URLlocalhost:8000Triton HTTP endpoint
TRITON_MODEL_REPOSITORY/modelsModel repository path
TRITON_MAX_BATCH_SIZE64Maximum batch size
TRITON_GPU_ENABLEDfalseEnable GPU backends