Triton Serving

The ML Service integrates with NVIDIA Triton Inference Server for high-performance model serving with GPU acceleration, dynamic batching, and optimized model formats. Triton is used for latency-sensitive production workloads where models have been converted to optimized formats like ONNX, TensorRT, or TensorFlow SavedModel.

Triton Architecture

ML Service API --> Triton Client --> Triton Inference Server
                                         |
                          +--------------+--------------+
                          |              |              |
                      ONNX Runtime   TensorRT    TF SavedModel
                          |              |              |
                       CPU/GPU        GPU Only       CPU/GPU

Supported Model Formats

Format	Backend	Hardware	Optimization Level
ONNX	ONNX Runtime	CPU + GPU	Medium (quantization available)
TensorRT	TensorRT	GPU only	High (layer fusion, precision)
TensorFlow SavedModel	TensorFlow	CPU + GPU	Medium
PyTorch TorchScript	LibTorch	CPU + GPU	Medium
Python Backend	Custom Python	CPU + GPU	None (for custom logic)

Model Repository Structure

Triton models are organized in a repository on shared storage:

model-repository/
  churn-predictor/
    config.pbtxt
    1/
      model.onnx
    2/
      model.onnx
  revenue-forecast/
    config.pbtxt
    1/
      model.plan

Model Configuration

Each Triton model requires a config.pbtxt specifying inputs, outputs, and serving parameters:

name: "churn-predictor"
platform: "onnxruntime_onnx"
max_batch_size: 64
input [
  {
    name: "features"
    data_type: TYPE_FP32
    dims: [ 4 ]
  }
]
output [
  {
    name: "prediction"
    data_type: TYPE_FP32
    dims: [ 1 ]
  }
]
dynamic_batching {
  preferred_batch_size: [ 16, 32, 64 ]
  max_queue_delay_microseconds: 100
}
instance_group [
  {
    count: 2
    kind: KIND_GPU
  }
]

Triton Client Integration

The ML Service communicates with Triton via gRPC:

class TritonInferenceService:
    def __init__(self, url: str = "localhost:8001"):
        self.client = grpcclient.InferenceServerClient(url=url)
 
    async def predict(self, model_name: str, features: np.ndarray):
        inputs = [grpcclient.InferInput("features", features.shape, "FP32")]
        inputs[0].set_data_from_numpy(features)
        result = self.client.infer(model_name, inputs)
        return result.as_numpy("prediction")

Dynamic Batching

Triton automatically batches incoming requests for improved throughput:

Setting	Description	Default
`preferred_batch_size`	Optimal batch sizes	[16, 32, 64]
`max_queue_delay_microseconds`	Max wait time for batching	100
`preserve_ordering`	Maintain request order	true

Model Versioning

Triton supports multiple model versions with configurable version policy:

Policy	Description
Latest	Serve only the latest version
All	Serve all available versions
Specific	Serve explicitly listed versions

Health Monitoring

Endpoint	Purpose
`/v2/health/live`	Triton server liveness
`/v2/health/ready`	Triton server readiness
`/v2/models/:model/ready`	Specific model readiness
`/v2/models/:model/versions/:ver/stats`	Version-level statistics

Configuration

Environment Variable	Default	Description
`TRITON_URL`	`localhost:8001`	Triton gRPC endpoint
`TRITON_HTTP_URL`	`localhost:8000`	Triton HTTP endpoint
`TRITON_MODEL_REPOSITORY`	`/models`	Model repository path
`TRITON_MAX_BATCH_SIZE`	`64`	Maximum batch size
`TRITON_GPU_ENABLED`	`false`	Enable GPU backends

Batch Inference Ensemble Routing