Triton Serving
The ML Service integrates with NVIDIA Triton Inference Server for high-performance model serving with GPU acceleration, dynamic batching, and optimized model formats. Triton is used for latency-sensitive production workloads where models have been converted to optimized formats like ONNX, TensorRT, or TensorFlow SavedModel.
Triton Architecture
ML Service API --> Triton Client --> Triton Inference Server
|
+--------------+--------------+
| | |
ONNX Runtime TensorRT TF SavedModel
| | |
CPU/GPU GPU Only CPU/GPUSupported Model Formats
| Format | Backend | Hardware | Optimization Level |
|---|---|---|---|
| ONNX | ONNX Runtime | CPU + GPU | Medium (quantization available) |
| TensorRT | TensorRT | GPU only | High (layer fusion, precision) |
| TensorFlow SavedModel | TensorFlow | CPU + GPU | Medium |
| PyTorch TorchScript | LibTorch | CPU + GPU | Medium |
| Python Backend | Custom Python | CPU + GPU | None (for custom logic) |
Model Repository Structure
Triton models are organized in a repository on shared storage:
model-repository/
churn-predictor/
config.pbtxt
1/
model.onnx
2/
model.onnx
revenue-forecast/
config.pbtxt
1/
model.planModel Configuration
Each Triton model requires a config.pbtxt specifying inputs, outputs, and serving parameters:
name: "churn-predictor"
platform: "onnxruntime_onnx"
max_batch_size: 64
input [
{
name: "features"
data_type: TYPE_FP32
dims: [ 4 ]
}
]
output [
{
name: "prediction"
data_type: TYPE_FP32
dims: [ 1 ]
}
]
dynamic_batching {
preferred_batch_size: [ 16, 32, 64 ]
max_queue_delay_microseconds: 100
}
instance_group [
{
count: 2
kind: KIND_GPU
}
]Triton Client Integration
The ML Service communicates with Triton via gRPC:
class TritonInferenceService:
def __init__(self, url: str = "localhost:8001"):
self.client = grpcclient.InferenceServerClient(url=url)
async def predict(self, model_name: str, features: np.ndarray):
inputs = [grpcclient.InferInput("features", features.shape, "FP32")]
inputs[0].set_data_from_numpy(features)
result = self.client.infer(model_name, inputs)
return result.as_numpy("prediction")Dynamic Batching
Triton automatically batches incoming requests for improved throughput:
| Setting | Description | Default |
|---|---|---|
preferred_batch_size | Optimal batch sizes | [16, 32, 64] |
max_queue_delay_microseconds | Max wait time for batching | 100 |
preserve_ordering | Maintain request order | true |
Model Versioning
Triton supports multiple model versions with configurable version policy:
| Policy | Description |
|---|---|
| Latest | Serve only the latest version |
| All | Serve all available versions |
| Specific | Serve explicitly listed versions |
Health Monitoring
| Endpoint | Purpose |
|---|---|
/v2/health/live | Triton server liveness |
/v2/health/ready | Triton server readiness |
/v2/models/:model/ready | Specific model readiness |
/v2/models/:model/versions/:ver/stats | Version-level statistics |
Configuration
| Environment Variable | Default | Description |
|---|---|---|
TRITON_URL | localhost:8001 | Triton gRPC endpoint |
TRITON_HTTP_URL | localhost:8000 | Triton HTTP endpoint |
TRITON_MODEL_REPOSITORY | /models | Model repository path |
TRITON_MAX_BATCH_SIZE | 64 | Maximum batch size |
TRITON_GPU_ENABLED | false | Enable GPU backends |