Kubernetes Node Sizing: Right-Size Your Cluster and Cut Costs
Kubernetes node sizing guide: CPU and memory requests vs limits, LimitRange, ResourceQuota, VPA Recommender, and spot node strategy for mixed workloads.
Kubernetes node sizing decisions made at cluster creation tend to persist far longer than intended. The instance type chosen during initial setup, the number of node groups, and the resource request patterns set by developers compound over months — often resulting in clusters where nodes are 70% provisioned but only 30% actually utilized.
This guide covers the mechanics of Kubernetes resource requests and limits, how the Vertical Pod Autoscaler recommends right-sized values, and how to design node pools for different workload types.
CPU and Memory Requests vs Limits
Understanding the difference between requests and limits is fundamental to both scheduling efficiency and cost optimization.
Requests — the amount of CPU/memory guaranteed to the container. The scheduler uses requests to decide which node a pod lands on. If a node doesn’t have enough allocatable resources to satisfy the request, the pod won’t be scheduled there.
Limits — the maximum the container can use. For CPU, exceeding the limit results in throttling (the container continues running but gets less CPU time). For memory, exceeding the limit results in the container being OOMKilled.
resources:
requests:
cpu: "250m" # 0.25 vCPU guaranteed
memory: "256Mi" # 256MB guaranteed
limits:
cpu: "1000m" # 1 vCPU maximum
memory: "512Mi" # 512MB maximum (exceed = OOMKill)
Why this matters for node sizing:
A node with 4 vCPU and 8GB memory can schedule pods whose requests sum to approximately:
- CPU: ~3.8 vCPU (accounting for system pods and kubelet overhead)
- Memory: ~7GB
If pods have high limits but low requests, you can pack many pods onto a node — but if they all spike simultaneously, you’ll hit CPU throttling or OOMKills. This is the over-commitment trade-off.
The common mistake: teams set requests and limits to the same value (either because their tooling suggests it, or because it’s “safe”). This eliminates over-commitment entirely and leads to dramatically under-utilized nodes. Setting limits 2-4x higher than requests is usually appropriate for bursty workloads.
Quality of Service classes (determined by request/limit relationship):
| Class | Condition | Behavior |
|---|---|---|
| Guaranteed | requests == limits for all containers | Last to be evicted |
| Burstable | limits > requests | Middle priority eviction |
| BestEffort | No requests or limits set | First to be evicted |
For critical production workloads, use Guaranteed class. For background/batch workloads, BestEffort or Burstable is acceptable.
LimitRange: Safe Defaults for Every Container
LimitRange sets default resource requests and limits at the namespace level. Without LimitRange, any container deployed without explicit resource specs gets zero requests and no limits — it can consume unlimited resources and the scheduler places it arbitrarily.
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: production
spec:
limits:
- type: Container
default: # Applied when no limits specified
cpu: "500m"
memory: "512Mi"
defaultRequest: # Applied when no requests specified
cpu: "100m"
memory: "128Mi"
max: # Hard maximum any container can request
cpu: "4"
memory: "8Gi"
min: # Minimum (prevents overly small requests)
cpu: "10m"
memory: "16Mi"
Deploy LimitRange before deploying any workloads in a new namespace. Retroactively applying LimitRange only affects new pods — existing pods keep their original (possibly zero) requests until they’re restarted.
ResourceQuota: Namespace-Level Spending Caps
ResourceQuota enforces aggregate limits across all resources in a namespace. It’s your governance layer for multi-team clusters.
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-alpha
spec:
hard:
# Compute
requests.cpu: "8"
requests.memory: 16Gi
limits.cpu: "32"
limits.memory: 64Gi
# Pods and objects
pods: "100"
services: "20"
persistentvolumeclaims: "20"
# Storage
requests.storage: "200Gi"
# Node port services (expensive)
services.nodeports: "0" # Block NodePort services entirely
With this ResourceQuota, if team-alpha tries to deploy a pod that would push their total CPU requests above 8 vCPU, the deployment fails. This forces teams to right-size or request quota increases.
Useful patterns:
- Set
services.nodeports: "0"to prevent teams from creating NodePort services (expensive and security risk) - Set storage quotas to prevent runaway PVC creation
- Track quota utilization with
kubectl describe resourcequota -n <namespace>
Vertical Pod Autoscaler: Automatic Right-Sizing
Vertical Pod Autoscaler (VPA) watches historical resource usage and recommends or automatically adjusts CPU and memory requests.
VPA operates in four modes:
- Off — VPA object exists but does nothing (useful for just using the API)
- Initial — applies recommendations only at pod creation, doesn’t restart pods
- Recreate — applies recommendations and restarts pods to apply (disruption risk)
- Auto — like Recreate, uses eviction to minimize disruption when possible
Start with Recommender mode (updateMode: Off):
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: my-app-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
updatePolicy:
updateMode: "Off" # Recommendations only, no changes
resourcePolicy:
containerPolicies:
- containerName: my-app
minAllowed:
cpu: 50m
memory: 64Mi
maxAllowed:
cpu: 4
memory: 8Gi
controlledResources: ["cpu", "memory"]
After 24-48 hours of traffic, check recommendations:
kubectl describe vpa my-app-vpa -n production
Output includes:
Recommendation:
Container Recommendations:
Container Name: my-app
Lower Bound: cpu: 80m, memory: 120Mi
Target: cpu: 150m, memory: 200Mi
Upper Bound: cpu: 400m, memory: 600Mi
Use the Target values as your new requests. Apply them to your deployment manifests and merge the change — no need to enable VPA Auto mode initially.
When to use Auto mode: after you’ve validated recommendations in Off mode for several weeks and are confident the recommendations are accurate. Auto mode is particularly useful for batch workloads or development environments where occasional pod restarts are acceptable.
HPA vs VPA: When to Use Each
Horizontal Pod Autoscaler (HPA) adds or removes pod replicas. VPA adjusts resource requests on existing pods. They solve different problems.
| Scenario | Use HPA | Use VPA |
|---|---|---|
| Stateless web service with variable traffic | Yes | Optionally for initial sizing |
| Java app with unpredictable memory growth | No | Yes |
| Batch job with variable input size | No | Yes |
| Stateful sets (databases) | No (usually) | Yes, with care |
| Microservice needing 99th percentile latency SLO | Yes | Optionally |
The conflict: HPA and VPA can fight each other on the same target. If HPA is scaling up replicas and VPA is simultaneously trying to change resource requests (which requires pod restart), you get unnecessary churn.
Resolution: use HPA for scaling (replicas) and VPA in Off/Initial mode for right-sizing requests. This separates concerns — HPA handles load, VPA informs your resource spec decisions.
Don’t run HPA and VPA in Auto/Recreate/Initial mode on the same deployment. It causes conflicts. Exception: VPA and HPA can coexist if VPA only controls memory and HPA controls CPU-based scaling.
Node Pool Strategy for Mixed Workloads
Different workloads have radically different resource profiles. Designing node pools for homogeneity (one size for everything) results in poor bin-packing and higher costs than a heterogeneous pool strategy.
Recommended node pool architecture:
Pool 1: General-purpose on-demand
- Instance type: m6i.xlarge (4 vCPU / 16GB) or equivalent
- Purpose: stateful services, databases, single-replica critical workloads
- Taints: none (accept any workload)
Pool 2: Spot nodes for stateless services
- Instance types: m6i.2xlarge, m5.2xlarge, m5n.2xlarge (multiple for availability)
- Purpose: stateless web services, APIs, workers with multiple replicas
- Taint:
spot=true:NoSchedule - Savings: 60-80% vs on-demand
Pool 3: High-memory nodes for data workloads
- Instance type: r6i.2xlarge (8 vCPU / 64GB) or larger
- Purpose: Java applications, in-memory caching, data processing
- Taint:
workload-type=memory-intensive:NoSchedule
Pool 4: GPU nodes for ML/AI
- Instance type: g5.xlarge (NVIDIA A10G) or equivalent
- Purpose: LLM inference, model training, computer vision
- Taint:
nvidia.com/gpu=true:NoSchedule
# Taint configuration for spot pool
taints:
- key: "spot"
value: "true"
effect: "NoSchedule"
# Toleration for workloads that accept spot
tolerations:
- key: "spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
Cluster Autoscaler Configuration
Cluster Autoscaler (CA) automatically adjusts the number of nodes. Key configuration parameters:
# cluster-autoscaler command flags
- --scale-down-delay-after-add=5m # Wait 5 min after scale-up before scaling down
- --scale-down-unneeded-time=10m # How long a node is unneeded before scale-down
- --scale-down-utilization-threshold=0.5 # Scale down if node utilization < 50%
- --balance-similar-node-groups=true # Balance between similar node groups
- --skip-nodes-with-system-pods=false # Allow scale-down of nodes with system pods
- --expander=least-waste # Prefer node type that wastes least CPU/memory
The least-waste expander is important for cost optimization — it chooses the node group that minimizes wasted (unscheduled) capacity when scaling up, rather than defaulting to the first available group.
Bin-Packing Optimization
The Kubernetes scheduler’s default behavior is LeastAllocated — it spreads pods across nodes. For cost optimization, MostAllocated (bin-packing) is often better — it fills up nodes before using new ones, allowing CA to scale down under-utilized nodes.
Configure via KubeSchedulerProfile (Kubernetes 1.23+):
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: MostAllocated # Bin-pack instead of spread
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
Bin-packing increases node utilization and reduces the number of nodes Cluster Autoscaler needs to provision — directly reducing your node bill.
Node Sizing in Practice
Right-sizing is iterative. The practical workflow:
- Deploy workloads with reasonable initial requests (VPA recommendations from a staging environment)
- Run for 2 weeks in production and collect metrics
- Check VPA recommendations in Off mode
- Update requests in staging, validate, then promote to production
- Review node utilization in Kubecost — target 60-80% average utilization (below is waste, above risks OOM pressure)
- Adjust node instance types if the optimal requests point to a different instance family
Use our Cluster Sizing Guide tool to calculate optimal node types and counts for your specific workload profile.
For hands-on cluster optimization, our team at kubernetes.ae conducts resource audits and implements right-sizing with guaranteed, measurable savings.
Get Expert Kubernetes Help
Talk to a certified Kubernetes expert. Free 30-minute consultation — actionable findings within days.
Talk to an Expert