Running LLMs on Kubernetes with vLLM — GPU Setup and Autoscaling
Deploy LLMs on Kubernetes with vLLM: GPU node setup, NVIDIA device plugin, KEDA autoscaling, multi-model serving, and spot GPU cost reduction strategies.
Running large language models on Kubernetes has become a core infrastructure requirement for AI-native companies. vLLM — the highest-throughput LLM serving engine available as of 2026 — has become the standard for production LLM deployment, offering PagedAttention for memory efficiency and continuous batching for throughput.
This guide covers the complete setup: GPU node configuration in Kubernetes, deploying vLLM as a Kubernetes workload, autoscaling with KEDA, multi-model serving strategies, and cost optimization with spot GPU instances.
Why vLLM for Production LLM Serving
Before infrastructure details, why vLLM specifically:
PagedAttention — vLLM manages KV cache memory like an OS page table, eliminating memory waste from reserved but unused attention caches. This increases effective throughput by 2-4x compared to naive implementations.
Continuous batching — vLLM dynamically batches requests at the iteration level rather than the request level. New requests join in-progress batches, dramatically improving GPU utilization during variable load.
Tensor parallelism — vLLM can shard a single model across multiple GPUs within a node, enabling models too large for a single GPU.
OpenAI-compatible API — the vLLM server exposes an API compatible with OpenAI’s chat/completions endpoint. Applications built against OpenAI’s API require zero code changes to switch to a self-hosted vLLM deployment.
Step 1: GPU Node Pool Setup
GPU nodes require specific configuration before Kubernetes can schedule GPU workloads on them.
EKS GPU Node Group (eksctl):
# eksctl cluster config for GPU node group
nodeGroups:
- name: gpu-nodes
instanceType: g5.xlarge # NVIDIA A10G, 24GB VRAM
desiredCapacity: 2
minSize: 0
maxSize: 10
ami: AL2_x86_64_GPU # GPU-optimized AMI with NVIDIA drivers
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
labels:
node-role.kubernetes.io/gpu: "true"
nvidia.com/gpu.present: "true"
The taint nvidia.com/gpu=true:NoSchedule is critical — it prevents non-GPU workloads from being scheduled on expensive GPU nodes.
GKE GPU Node Pool:
gcloud container node-pools create gpu-pool \
--cluster my-cluster \
--region us-central1 \
--machine-type n1-standard-8 \
--accelerator type=nvidia-tesla-a100,count=1 \
--num-nodes 2 \
--min-nodes 0 \
--max-nodes 10 \
--node-taints nvidia.com/gpu=present:NoSchedule
Step 2: NVIDIA Device Plugin
The NVIDIA Device Plugin is required for Kubernetes to recognize and allocate GPU resources. Without it, the scheduler can’t see GPUs and resource requests for nvidia.com/gpu will fail.
# Install NVIDIA Device Plugin as a DaemonSet
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml
Or via Helm (recommended for production, gives more configuration control):
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
--namespace kube-system \
--set failOnInitError=false
Verify GPU discovery:
# GPUs should appear as allocatable resources
kubectl describe nodes | grep -A 10 "Capacity:"
# Look for: nvidia.com/gpu: 1
# Test GPU allocation with a simple pod
kubectl run gpu-test --image=nvidia/cuda:12.0-base-ubuntu20.04 \
--restart=Never \
--limits="nvidia.com/gpu=1" \
-- nvidia-smi
Step 3: Deploying vLLM on Kubernetes
A complete vLLM Deployment spec for serving Llama 3.1 8B:
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-llama3-8b
namespace: llm-serving
spec:
replicas: 1
selector:
matchLabels:
app: vllm-llama3-8b
template:
metadata:
labels:
app: vllm-llama3-8b
spec:
tolerations:
- key: "nvidia.com/gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule"
nodeSelector:
nvidia.com/gpu.present: "true"
containers:
- name: vllm
image: vllm/vllm-openai:latest
command:
- python3
- -m
- vllm.entrypoints.openai.api_server
- --model
- meta-llama/Meta-Llama-3.1-8B-Instruct
- --port
- "8000"
- --tensor-parallel-size
- "1" # Use 1 GPU; set to 2+ for larger models
- --max-model-len
- "8192" # Maximum context window
- --gpu-memory-utilization
- "0.90" # Use 90% of GPU memory for KV cache
resources:
limits:
nvidia.com/gpu: 1 # Request exactly 1 GPU
cpu: "8"
memory: "32Gi"
requests:
nvidia.com/gpu: 1
cpu: "4"
memory: "16Gi"
ports:
- containerPort: 8000
name: http
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60 # Models take time to load
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 30 # Allow up to 5 minutes for model loading
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc # Persistent storage for model weights
---
apiVersion: v1
kind: Service
metadata:
name: vllm-llama3-8b
namespace: llm-serving
spec:
selector:
app: vllm-llama3-8b
ports:
- port: 8000
targetPort: 8000
type: ClusterIP
Key parameters:
--gpu-memory-utilization 0.90— allocates 90% of GPU VRAM to KV cache. Higher = more concurrent requests but less headroom--max-model-len— context window size. Shorter = more concurrent requests--tensor-parallel-size— number of GPUs to shard the model across. Required for models larger than a single GPU’s VRAM
Step 4: Persistent Model Storage with PVCs
Model weights are large — Llama 3.1 8B is ~16GB, Llama 3.1 70B is ~140GB. Downloading them from HuggingFace Hub every time a pod starts is slow (minutes) and increases costs. Use a PVC to cache models.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache-pvc
namespace: llm-serving
spec:
accessModes:
- ReadOnlyMany # Multiple replicas can read the same models
storageClassName: efs-sc # EFS on AWS, Filestore on GCP — shared filesystem
resources:
requests:
storage: 500Gi
Pre-populate the model cache with an init container or a separate Job that downloads the model once:
initContainers:
- name: download-model
image: python:3.11
command:
- python3
- -c
- |
import os
from huggingface_hub import snapshot_download
if not os.path.exists('/cache/models--meta-llama--Meta-Llama-3.1-8B-Instruct'):
snapshot_download('meta-llama/Meta-Llama-3.1-8B-Instruct',
cache_dir='/cache')
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
volumeMounts:
- name: model-cache
mountPath: /cache
resources:
requests:
cpu: "2"
memory: "8Gi"
Step 5: Autoscaling with KEDA
Standard HPA doesn’t work well for LLM serving — GPU utilization is a poor scaling metric because LLMs run near 100% GPU utilization even at low request rates (the GPU is constantly doing computation during inference). Scaling on GPU utilization leads to over-provisioning.
Better metrics for LLM autoscaling:
- Queue depth — number of pending requests in the vLLM queue
- Active requests — number of currently processing requests
- Time-to-first-token latency — p95 TTFT above threshold = scale up
KEDA (Kubernetes Event-Driven Autoscaling) can scale on custom Prometheus metrics:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-scaledobject
namespace: llm-serving
spec:
scaleTargetRef:
name: vllm-llama3-8b
minReplicaCount: 1
maxReplicaCount: 10
cooldownPeriod: 300 # 5 min cooldown (avoid thrashing, GPU nodes are slow to provision)
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus:9090
metricName: vllm_num_requests_waiting
threshold: "5" # Scale up when >5 requests waiting
query: sum(vllm_num_requests_waiting{namespace="llm-serving",deployment="vllm-llama3-8b"})
vLLM exposes Prometheus metrics at /metrics including:
vllm_num_requests_waiting— queue depthvllm_num_requests_running— active requestsvllm_gpu_cache_usage_perc— KV cache utilizationvllm_time_to_first_token_seconds— TTFT histogram
Multi-Model Serving Strategies
When serving multiple models, you have two architectural choices:
Option 1: Dedicated deployment per model — separate Deployment (and GPU) for each model. Simple, complete isolation, but expensive if models are small or low-traffic.
Option 2: Multi-model server — a single vLLM instance (or OpenLLM, Ray Serve) that serves multiple models, sharing GPU memory.
vLLM supports multiple models via its model router (available in vLLM 0.4+):
# Serve two models on the same GPU
vllm serve \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--served-model-name llama3-8b \
--model mistralai/Mistral-7B-Instruct-v0.2 \
--served-model-name mistral-7b \
--gpu-memory-utilization 0.85
Ray Serve provides a more flexible multi-model serving platform if you need dynamic model loading/unloading or more complex routing logic.
Cost Optimization: Spot GPU Instances
Spot GPU instances cost 60-75% less than on-demand. For batch inference workloads (not latency-sensitive), spot is viable with proper design.
Make LLM workloads spot-tolerant:
- Use Karpenter (EKS) or equivalent for fast GPU node provisioning — Karpenter can provision new GPU nodes in 60-90 seconds, vs Cluster Autoscaler’s 3-5 minutes
- Design for graceful shutdown — handle SIGTERM in the vLLM process to drain in-flight requests before shutdown
- Set short inference timeouts on the client side — allow automatic retry on a different endpoint
- Use multiple spot instance types in your node pool for better availability
# Karpenter NodeClaim for spot GPU
apiVersion: karpenter.sh/v1beta1
kind: NodeClaim
metadata:
name: gpu-spot
spec:
requirements:
- key: karpenter.k8s.aws/instance-gpu-name
operator: In
values: ["a10g", "t4", "v100"]
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
nodeClassRef:
name: gpu-nodeclass
Observability for LLM Serving
Essential metrics to monitor:
# Grafana dashboard metrics for vLLM
- vllm_num_requests_running # Active inference requests
- vllm_num_requests_waiting # Queue depth (leading indicator)
- vllm_gpu_cache_usage_perc # KV cache utilization (>90% = memory pressure)
- vllm_time_to_first_token_seconds # TTFT p50/p95/p99
- vllm_request_success_total # Success rate
- nvidia_gpu_utilization # Raw GPU utilization
Alert on:
- Queue depth > 10 for >2 minutes (scale up trigger)
- p95 TTFT > 2 seconds (latency degradation)
- GPU cache utilization > 95% (OOM risk)
Production Checklist for LLM on Kubernetes
- GPU node pool with appropriate instance type for model size
- NVIDIA Device Plugin deployed and verified
- vLLM Deployment with GPU requests, tolerations, and readiness probe
- Persistent model cache PVC (shared filesystem)
- KEDA ScaledObject using queue depth metric (not GPU utilization)
- Prometheus metrics scraping enabled on vLLM pods
- Grafana dashboard for TTFT, queue depth, cache utilization
- HuggingFace token in a Kubernetes Secret (not in env vars in code)
- Pod Disruption Budget set to maintain minimum replicas
Scale Your AI Infrastructure
Running LLMs in production requires expertise in both ML systems and Kubernetes infrastructure. Getting the memory configuration, autoscaling triggers, and spot instance strategy right makes the difference between a $5k/month and $20k/month GPU bill.
→ AI/ML Infrastructure service at kubernetes.ae — we design and operate GPU-accelerated Kubernetes infrastructure for LLM serving, model training, and MLOps pipelines.
Get Expert Kubernetes Help
Talk to a certified Kubernetes expert. Free 30-minute consultation — actionable findings within days.
Talk to an Expert