March 12, 2026 · 7 min read

Running LLMs on Kubernetes with vLLM — GPU Setup and Autoscaling

Deploy LLMs on Kubernetes with vLLM: GPU node setup, NVIDIA device plugin, KEDA autoscaling, multi-model serving, and spot GPU cost reduction strategies.

Running large language models on Kubernetes has become a core infrastructure requirement for AI-native companies. vLLM — the highest-throughput LLM serving engine available as of 2026 — has become the standard for production LLM deployment, offering PagedAttention for memory efficiency and continuous batching for throughput.

This guide covers the complete setup: GPU node configuration in Kubernetes, deploying vLLM as a Kubernetes workload, autoscaling with KEDA, multi-model serving strategies, and cost optimization with spot GPU instances.

Why vLLM for Production LLM Serving

Before infrastructure details, why vLLM specifically:

PagedAttention — vLLM manages KV cache memory like an OS page table, eliminating memory waste from reserved but unused attention caches. This increases effective throughput by 2-4x compared to naive implementations.

Continuous batching — vLLM dynamically batches requests at the iteration level rather than the request level. New requests join in-progress batches, dramatically improving GPU utilization during variable load.

Tensor parallelism — vLLM can shard a single model across multiple GPUs within a node, enabling models too large for a single GPU.

OpenAI-compatible API — the vLLM server exposes an API compatible with OpenAI’s chat/completions endpoint. Applications built against OpenAI’s API require zero code changes to switch to a self-hosted vLLM deployment.

Step 1: GPU Node Pool Setup

GPU nodes require specific configuration before Kubernetes can schedule GPU workloads on them.

EKS GPU Node Group (eksctl):

# eksctl cluster config for GPU node group
nodeGroups:
  - name: gpu-nodes
    instanceType: g5.xlarge    # NVIDIA A10G, 24GB VRAM
    desiredCapacity: 2
    minSize: 0
    maxSize: 10
    ami: AL2_x86_64_GPU        # GPU-optimized AMI with NVIDIA drivers
    taints:
    - key: nvidia.com/gpu
      value: "true"
      effect: NoSchedule
    labels:
      node-role.kubernetes.io/gpu: "true"
      nvidia.com/gpu.present: "true"

The taint nvidia.com/gpu=true:NoSchedule is critical — it prevents non-GPU workloads from being scheduled on expensive GPU nodes.

GKE GPU Node Pool:

gcloud container node-pools create gpu-pool \
  --cluster my-cluster \
  --region us-central1 \
  --machine-type n1-standard-8 \
  --accelerator type=nvidia-tesla-a100,count=1 \
  --num-nodes 2 \
  --min-nodes 0 \
  --max-nodes 10 \
  --node-taints nvidia.com/gpu=present:NoSchedule

Step 2: NVIDIA Device Plugin

The NVIDIA Device Plugin is required for Kubernetes to recognize and allocate GPU resources. Without it, the scheduler can’t see GPUs and resource requests for nvidia.com/gpu will fail.

# Install NVIDIA Device Plugin as a DaemonSet
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.1/nvidia-device-plugin.yml

Or via Helm (recommended for production, gives more configuration control):

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm repo update
helm install nvidia-device-plugin nvdp/nvidia-device-plugin \
  --namespace kube-system \
  --set failOnInitError=false

Verify GPU discovery:

# GPUs should appear as allocatable resources
kubectl describe nodes | grep -A 10 "Capacity:"
# Look for: nvidia.com/gpu: 1

# Test GPU allocation with a simple pod
kubectl run gpu-test --image=nvidia/cuda:12.0-base-ubuntu20.04 \
  --restart=Never \
  --limits="nvidia.com/gpu=1" \
  -- nvidia-smi

Step 3: Deploying vLLM on Kubernetes

A complete vLLM Deployment spec for serving Llama 3.1 8B:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3-8b
  namespace: llm-serving
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-llama3-8b
  template:
    metadata:
      labels:
        app: vllm-llama3-8b
    spec:
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Equal"
        value: "true"
        effect: "NoSchedule"
      nodeSelector:
        nvidia.com/gpu.present: "true"
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        command:
          - python3
          - -m
          - vllm.entrypoints.openai.api_server
          - --model
          - meta-llama/Meta-Llama-3.1-8B-Instruct
          - --port
          - "8000"
          - --tensor-parallel-size
          - "1"              # Use 1 GPU; set to 2+ for larger models
          - --max-model-len
          - "8192"           # Maximum context window
          - --gpu-memory-utilization
          - "0.90"           # Use 90% of GPU memory for KV cache
        resources:
          limits:
            nvidia.com/gpu: 1     # Request exactly 1 GPU
            cpu: "8"
            memory: "32Gi"
          requests:
            nvidia.com/gpu: 1
            cpu: "4"
            memory: "16Gi"
        ports:
        - containerPort: 8000
          name: http
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token
        volumeMounts:
        - name: model-cache
          mountPath: /root/.cache/huggingface
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60   # Models take time to load
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 30      # Allow up to 5 minutes for model loading
      volumes:
      - name: model-cache
        persistentVolumeClaim:
          claimName: model-cache-pvc   # Persistent storage for model weights
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-llama3-8b
  namespace: llm-serving
spec:
  selector:
    app: vllm-llama3-8b
  ports:
  - port: 8000
    targetPort: 8000
  type: ClusterIP

Key parameters:

--gpu-memory-utilization 0.90 — allocates 90% of GPU VRAM to KV cache. Higher = more concurrent requests but less headroom
--max-model-len — context window size. Shorter = more concurrent requests
--tensor-parallel-size — number of GPUs to shard the model across. Required for models larger than a single GPU’s VRAM

Step 4: Persistent Model Storage with PVCs

Model weights are large — Llama 3.1 8B is ~16GB, Llama 3.1 70B is ~140GB. Downloading them from HuggingFace Hub every time a pod starts is slow (minutes) and increases costs. Use a PVC to cache models.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache-pvc
  namespace: llm-serving
spec:
  accessModes:
    - ReadOnlyMany    # Multiple replicas can read the same models
  storageClassName: efs-sc   # EFS on AWS, Filestore on GCP — shared filesystem
  resources:
    requests:
      storage: 500Gi

Pre-populate the model cache with an init container or a separate Job that downloads the model once:

initContainers:
- name: download-model
  image: python:3.11
  command:
  - python3
  - -c
  - |
    import os
    from huggingface_hub import snapshot_download
    if not os.path.exists('/cache/models--meta-llama--Meta-Llama-3.1-8B-Instruct'):
        snapshot_download('meta-llama/Meta-Llama-3.1-8B-Instruct',
                         cache_dir='/cache')
  env:
  - name: HUGGING_FACE_HUB_TOKEN
    valueFrom:
      secretKeyRef:
        name: hf-token
        key: token
  volumeMounts:
  - name: model-cache
    mountPath: /cache
  resources:
    requests:
      cpu: "2"
      memory: "8Gi"

Step 5: Autoscaling with KEDA

Standard HPA doesn’t work well for LLM serving — GPU utilization is a poor scaling metric because LLMs run near 100% GPU utilization even at low request rates (the GPU is constantly doing computation during inference). Scaling on GPU utilization leads to over-provisioning.

Better metrics for LLM autoscaling:

Queue depth — number of pending requests in the vLLM queue
Active requests — number of currently processing requests
Time-to-first-token latency — p95 TTFT above threshold = scale up

KEDA (Kubernetes Event-Driven Autoscaling) can scale on custom Prometheus metrics:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaledobject
  namespace: llm-serving
spec:
  scaleTargetRef:
    name: vllm-llama3-8b
  minReplicaCount: 1
  maxReplicaCount: 10
  cooldownPeriod: 300    # 5 min cooldown (avoid thrashing, GPU nodes are slow to provision)
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: vllm_num_requests_waiting
      threshold: "5"      # Scale up when >5 requests waiting
      query: sum(vllm_num_requests_waiting{namespace="llm-serving",deployment="vllm-llama3-8b"})

vLLM exposes Prometheus metrics at /metrics including:

vllm_num_requests_waiting — queue depth
vllm_num_requests_running — active requests
vllm_gpu_cache_usage_perc — KV cache utilization
vllm_time_to_first_token_seconds — TTFT histogram

Multi-Model Serving Strategies

When serving multiple models, you have two architectural choices:

Option 1: Dedicated deployment per model — separate Deployment (and GPU) for each model. Simple, complete isolation, but expensive if models are small or low-traffic.

Option 2: Multi-model server — a single vLLM instance (or OpenLLM, Ray Serve) that serves multiple models, sharing GPU memory.

vLLM supports multiple models via its model router (available in vLLM 0.4+):

# Serve two models on the same GPU
vllm serve \
  --model meta-llama/Meta-Llama-3.1-8B-Instruct \
  --served-model-name llama3-8b \
  --model mistralai/Mistral-7B-Instruct-v0.2 \
  --served-model-name mistral-7b \
  --gpu-memory-utilization 0.85

Ray Serve provides a more flexible multi-model serving platform if you need dynamic model loading/unloading or more complex routing logic.

Cost Optimization: Spot GPU Instances

Spot GPU instances cost 60-75% less than on-demand. For batch inference workloads (not latency-sensitive), spot is viable with proper design.

Make LLM workloads spot-tolerant:

Use Karpenter (EKS) or equivalent for fast GPU node provisioning — Karpenter can provision new GPU nodes in 60-90 seconds, vs Cluster Autoscaler’s 3-5 minutes
Design for graceful shutdown — handle SIGTERM in the vLLM process to drain in-flight requests before shutdown
Set short inference timeouts on the client side — allow automatic retry on a different endpoint
Use multiple spot instance types in your node pool for better availability

# Karpenter NodeClaim for spot GPU
apiVersion: karpenter.sh/v1beta1
kind: NodeClaim
metadata:
  name: gpu-spot
spec:
  requirements:
  - key: karpenter.k8s.aws/instance-gpu-name
    operator: In
    values: ["a10g", "t4", "v100"]
  - key: karpenter.sh/capacity-type
    operator: In
    values: ["spot"]
  nodeClassRef:
    name: gpu-nodeclass

Observability for LLM Serving

Essential metrics to monitor:

# Grafana dashboard metrics for vLLM
- vllm_num_requests_running          # Active inference requests
- vllm_num_requests_waiting          # Queue depth (leading indicator)
- vllm_gpu_cache_usage_perc         # KV cache utilization (>90% = memory pressure)
- vllm_time_to_first_token_seconds  # TTFT p50/p95/p99
- vllm_request_success_total        # Success rate
- nvidia_gpu_utilization            # Raw GPU utilization

Alert on:

Queue depth > 10 for >2 minutes (scale up trigger)
p95 TTFT > 2 seconds (latency degradation)
GPU cache utilization > 95% (OOM risk)

Production Checklist for LLM on Kubernetes

GPU node pool with appropriate instance type for model size
NVIDIA Device Plugin deployed and verified
vLLM Deployment with GPU requests, tolerations, and readiness probe
Persistent model cache PVC (shared filesystem)
KEDA ScaledObject using queue depth metric (not GPU utilization)
Prometheus metrics scraping enabled on vLLM pods
Grafana dashboard for TTFT, queue depth, cache utilization
HuggingFace token in a Kubernetes Secret (not in env vars in code)
Pod Disruption Budget set to maintain minimum replicas

Scale Your AI Infrastructure

Running LLMs in production requires expertise in both ML systems and Kubernetes infrastructure. Getting the memory configuration, autoscaling triggers, and spot instance strategy right makes the difference between a $5k/month and $20k/month GPU bill.

→ AI/ML Infrastructure service at kubernetes.ae — we design and operate GPU-accelerated Kubernetes infrastructure for LLM serving, model training, and MLOps pipelines.

Get Expert Kubernetes Help

Talk to a certified Kubernetes expert. Free 30-minute consultation — actionable findings within days.

Talk to an Expert