vLLM
vLLM provides high-throughput, low-latency LLM inference for self-hosted models, offering an OpenAI-compatible API endpoint.
Configuration
# From ai-service providers
providers:
vllm:
enabled: false # Disabled by default
baseUrl: "http://vllm:8000"
defaultModel: "default"GPU Requirements
vLLM requires GPU nodes for efficient model inference:
resources:
requests:
cpu: "4"
memory: "16Gi"
limits:
nvidia.com/gpu: 1 # A100 or H100 recommended
cpu: "8"
memory: "32Gi"
nodeSelector:
nvidia.com/gpu.present: "true"Key Features
| Feature | Description |
|---|---|
| PagedAttention | Efficient GPU memory management |
| Continuous Batching | Dynamic request batching for throughput |
| OpenAI API | Drop-in replacement for OpenAI endpoints |
| Tensor Parallelism | Multi-GPU model sharding |