March 12, 2026 · 8 min read

Kubernetes Node Sizing: Right-Size Your Cluster and Cut Costs

Kubernetes node sizing guide: CPU and memory requests vs limits, LimitRange, ResourceQuota, VPA Recommender, and spot node strategy for mixed workloads.

Kubernetes Node Sizing: Right-Size Your Cluster and Cut Costs

Kubernetes node sizing decisions made at cluster creation tend to persist far longer than intended. The instance type chosen during initial setup, the number of node groups, and the resource request patterns set by developers compound over months — often resulting in clusters where nodes are 70% provisioned but only 30% actually utilized.

This guide covers the mechanics of Kubernetes resource requests and limits, how the Vertical Pod Autoscaler recommends right-sized values, and how to design node pools for different workload types.


CPU and Memory Requests vs Limits

Understanding the difference between requests and limits is fundamental to both scheduling efficiency and cost optimization.

Requests — the amount of CPU/memory guaranteed to the container. The scheduler uses requests to decide which node a pod lands on. If a node doesn’t have enough allocatable resources to satisfy the request, the pod won’t be scheduled there.

Limits — the maximum the container can use. For CPU, exceeding the limit results in throttling (the container continues running but gets less CPU time). For memory, exceeding the limit results in the container being OOMKilled.

resources:
  requests:
    cpu: "250m"      # 0.25 vCPU guaranteed
    memory: "256Mi"  # 256MB guaranteed
  limits:
    cpu: "1000m"     # 1 vCPU maximum
    memory: "512Mi"  # 512MB maximum (exceed = OOMKill)

Why this matters for node sizing:

A node with 4 vCPU and 8GB memory can schedule pods whose requests sum to approximately:

  • CPU: ~3.8 vCPU (accounting for system pods and kubelet overhead)
  • Memory: ~7GB

If pods have high limits but low requests, you can pack many pods onto a node — but if they all spike simultaneously, you’ll hit CPU throttling or OOMKills. This is the over-commitment trade-off.

The common mistake: teams set requests and limits to the same value (either because their tooling suggests it, or because it’s “safe”). This eliminates over-commitment entirely and leads to dramatically under-utilized nodes. Setting limits 2-4x higher than requests is usually appropriate for bursty workloads.

Quality of Service classes (determined by request/limit relationship):

ClassConditionBehavior
Guaranteedrequests == limits for all containersLast to be evicted
Burstablelimits > requestsMiddle priority eviction
BestEffortNo requests or limits setFirst to be evicted

For critical production workloads, use Guaranteed class. For background/batch workloads, BestEffort or Burstable is acceptable.


LimitRange: Safe Defaults for Every Container

LimitRange sets default resource requests and limits at the namespace level. Without LimitRange, any container deployed without explicit resource specs gets zero requests and no limits — it can consume unlimited resources and the scheduler places it arbitrarily.

apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: production
spec:
  limits:
  - type: Container
    default:            # Applied when no limits specified
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:     # Applied when no requests specified
      cpu: "100m"
      memory: "128Mi"
    max:                # Hard maximum any container can request
      cpu: "4"
      memory: "8Gi"
    min:                # Minimum (prevents overly small requests)
      cpu: "10m"
      memory: "16Mi"

Deploy LimitRange before deploying any workloads in a new namespace. Retroactively applying LimitRange only affects new pods — existing pods keep their original (possibly zero) requests until they’re restarted.


ResourceQuota: Namespace-Level Spending Caps

ResourceQuota enforces aggregate limits across all resources in a namespace. It’s your governance layer for multi-team clusters.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-alpha
spec:
  hard:
    # Compute
    requests.cpu: "8"
    requests.memory: 16Gi
    limits.cpu: "32"
    limits.memory: 64Gi
    # Pods and objects
    pods: "100"
    services: "20"
    persistentvolumeclaims: "20"
    # Storage
    requests.storage: "200Gi"
    # Node port services (expensive)
    services.nodeports: "0"   # Block NodePort services entirely

With this ResourceQuota, if team-alpha tries to deploy a pod that would push their total CPU requests above 8 vCPU, the deployment fails. This forces teams to right-size or request quota increases.

Useful patterns:

  • Set services.nodeports: "0" to prevent teams from creating NodePort services (expensive and security risk)
  • Set storage quotas to prevent runaway PVC creation
  • Track quota utilization with kubectl describe resourcequota -n <namespace>

Vertical Pod Autoscaler: Automatic Right-Sizing

Vertical Pod Autoscaler (VPA) watches historical resource usage and recommends or automatically adjusts CPU and memory requests.

VPA operates in four modes:

  • Off — VPA object exists but does nothing (useful for just using the API)
  • Initial — applies recommendations only at pod creation, doesn’t restart pods
  • Recreate — applies recommendations and restarts pods to apply (disruption risk)
  • Auto — like Recreate, uses eviction to minimize disruption when possible

Start with Recommender mode (updateMode: Off):

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"    # Recommendations only, no changes
  resourcePolicy:
    containerPolicies:
    - containerName: my-app
      minAllowed:
        cpu: 50m
        memory: 64Mi
      maxAllowed:
        cpu: 4
        memory: 8Gi
      controlledResources: ["cpu", "memory"]

After 24-48 hours of traffic, check recommendations:

kubectl describe vpa my-app-vpa -n production

Output includes:

  Recommendation:
    Container Recommendations:
      Container Name:  my-app
      Lower Bound:     cpu: 80m, memory: 120Mi
      Target:          cpu: 150m, memory: 200Mi
      Upper Bound:     cpu: 400m, memory: 600Mi

Use the Target values as your new requests. Apply them to your deployment manifests and merge the change — no need to enable VPA Auto mode initially.

When to use Auto mode: after you’ve validated recommendations in Off mode for several weeks and are confident the recommendations are accurate. Auto mode is particularly useful for batch workloads or development environments where occasional pod restarts are acceptable.


HPA vs VPA: When to Use Each

Horizontal Pod Autoscaler (HPA) adds or removes pod replicas. VPA adjusts resource requests on existing pods. They solve different problems.

ScenarioUse HPAUse VPA
Stateless web service with variable trafficYesOptionally for initial sizing
Java app with unpredictable memory growthNoYes
Batch job with variable input sizeNoYes
Stateful sets (databases)No (usually)Yes, with care
Microservice needing 99th percentile latency SLOYesOptionally

The conflict: HPA and VPA can fight each other on the same target. If HPA is scaling up replicas and VPA is simultaneously trying to change resource requests (which requires pod restart), you get unnecessary churn.

Resolution: use HPA for scaling (replicas) and VPA in Off/Initial mode for right-sizing requests. This separates concerns — HPA handles load, VPA informs your resource spec decisions.

Don’t run HPA and VPA in Auto/Recreate/Initial mode on the same deployment. It causes conflicts. Exception: VPA and HPA can coexist if VPA only controls memory and HPA controls CPU-based scaling.


Node Pool Strategy for Mixed Workloads

Different workloads have radically different resource profiles. Designing node pools for homogeneity (one size for everything) results in poor bin-packing and higher costs than a heterogeneous pool strategy.

Recommended node pool architecture:

Pool 1: General-purpose on-demand

  • Instance type: m6i.xlarge (4 vCPU / 16GB) or equivalent
  • Purpose: stateful services, databases, single-replica critical workloads
  • Taints: none (accept any workload)

Pool 2: Spot nodes for stateless services

  • Instance types: m6i.2xlarge, m5.2xlarge, m5n.2xlarge (multiple for availability)
  • Purpose: stateless web services, APIs, workers with multiple replicas
  • Taint: spot=true:NoSchedule
  • Savings: 60-80% vs on-demand

Pool 3: High-memory nodes for data workloads

  • Instance type: r6i.2xlarge (8 vCPU / 64GB) or larger
  • Purpose: Java applications, in-memory caching, data processing
  • Taint: workload-type=memory-intensive:NoSchedule

Pool 4: GPU nodes for ML/AI

  • Instance type: g5.xlarge (NVIDIA A10G) or equivalent
  • Purpose: LLM inference, model training, computer vision
  • Taint: nvidia.com/gpu=true:NoSchedule
# Taint configuration for spot pool
taints:
  - key: "spot"
    value: "true"
    effect: "NoSchedule"

# Toleration for workloads that accept spot
tolerations:
  - key: "spot"
    operator: "Equal"
    value: "true"
    effect: "NoSchedule"

Cluster Autoscaler Configuration

Cluster Autoscaler (CA) automatically adjusts the number of nodes. Key configuration parameters:

# cluster-autoscaler command flags
- --scale-down-delay-after-add=5m          # Wait 5 min after scale-up before scaling down
- --scale-down-unneeded-time=10m           # How long a node is unneeded before scale-down
- --scale-down-utilization-threshold=0.5   # Scale down if node utilization < 50%
- --balance-similar-node-groups=true       # Balance between similar node groups
- --skip-nodes-with-system-pods=false      # Allow scale-down of nodes with system pods
- --expander=least-waste                   # Prefer node type that wastes least CPU/memory

The least-waste expander is important for cost optimization — it chooses the node group that minimizes wasted (unscheduled) capacity when scaling up, rather than defaulting to the first available group.


Bin-Packing Optimization

The Kubernetes scheduler’s default behavior is LeastAllocated — it spreads pods across nodes. For cost optimization, MostAllocated (bin-packing) is often better — it fills up nodes before using new ones, allowing CA to scale down under-utilized nodes.

Configure via KubeSchedulerProfile (Kubernetes 1.23+):

apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
  pluginConfig:
  - name: NodeResourcesFit
    args:
      scoringStrategy:
        type: MostAllocated   # Bin-pack instead of spread
        resources:
        - name: cpu
          weight: 1
        - name: memory
          weight: 1

Bin-packing increases node utilization and reduces the number of nodes Cluster Autoscaler needs to provision — directly reducing your node bill.


Node Sizing in Practice

Right-sizing is iterative. The practical workflow:

  1. Deploy workloads with reasonable initial requests (VPA recommendations from a staging environment)
  2. Run for 2 weeks in production and collect metrics
  3. Check VPA recommendations in Off mode
  4. Update requests in staging, validate, then promote to production
  5. Review node utilization in Kubecost — target 60-80% average utilization (below is waste, above risks OOM pressure)
  6. Adjust node instance types if the optimal requests point to a different instance family

Use our Cluster Sizing Guide tool to calculate optimal node types and counts for your specific workload profile.

For hands-on cluster optimization, our team at kubernetes.ae conducts resource audits and implements right-sizing with guaranteed, measurable savings.

Get Expert Kubernetes Help

Talk to a certified Kubernetes expert. Free 30-minute consultation — actionable findings within days.

Talk to an Expert