Triton Inference Server
NVIDIA Triton Inference Server provides multi-framework model serving with support for TensorFlow, PyTorch, ONNX, and custom backends.
Architecture
Triton loads models from a shared model repository (S3/MinIO) and serves them via HTTP/gRPC:
+------------------+ +------------------+
| Triton Server |<----| Model Repository |
| HTTP: 8000 | | (S3 / MinIO) |
| gRPC: 8001 | +------------------+
| Metrics: 8002 |
+------------------+Supported Backends
| Backend | Model Format | Use Case |
|---|---|---|
| TensorRT | Optimized ONNX/TF | Low-latency inference |
| PyTorch | TorchScript | PyTorch models |
| TensorFlow | SavedModel | TensorFlow models |
| ONNX Runtime | ONNX | Cross-framework models |
| Python | Custom scripts | Preprocessing/postprocessing |
GPU Configuration
resources:
limits:
nvidia.com/gpu: 1
cpu: "8"
memory: "16Gi"