MATIH Platform is in active MVP development. Documentation reflects current implementation status.
17. Kubernetes & Helm
ML Infrastructure
vLLM

vLLM

vLLM provides high-throughput, low-latency LLM inference for self-hosted models, offering an OpenAI-compatible API endpoint.


Configuration

# From ai-service providers
providers:
  vllm:
    enabled: false  # Disabled by default
    baseUrl: "http://vllm:8000"
    defaultModel: "default"

GPU Requirements

vLLM requires GPU nodes for efficient model inference:

resources:
  requests:
    cpu: "4"
    memory: "16Gi"
  limits:
    nvidia.com/gpu: 1  # A100 or H100 recommended
    cpu: "8"
    memory: "32Gi"
 
nodeSelector:
  nvidia.com/gpu.present: "true"

Key Features

FeatureDescription
PagedAttentionEfficient GPU memory management
Continuous BatchingDynamic request batching for throughput
OpenAI APIDrop-in replacement for OpenAI endpoints
Tensor ParallelismMulti-GPU model sharding