vLLM

vLLM provides high-throughput, low-latency LLM inference for self-hosted models, offering an OpenAI-compatible API endpoint.

Configuration

# From ai-service providers
providers:
  vllm:
    enabled: false  # Disabled by default
    baseUrl: "http://vllm:8000"
    defaultModel: "default"

GPU Requirements

vLLM requires GPU nodes for efficient model inference:

resources:
  requests:
    cpu: "4"
    memory: "16Gi"
  limits:
    nvidia.com/gpu: 1  # A100 or H100 recommended
    cpu: "8"
    memory: "32Gi"
 
nodeSelector:
  nvidia.com/gpu.present: "true"

Key Features

Feature	Description
PagedAttention	Efficient GPU memory management
Continuous Batching	Dynamic request batching for throughput
OpenAI API	Drop-in replacement for OpenAI endpoints
Tensor Parallelism	Multi-GPU model sharding

Ray Triton