How to Set Kubernetes Resource Requests and Limits Without Throttling Apps
DevOps Engineering

How to Set Kubernetes Resource Requests and Limits Without Throttling Apps

Balance Kubernetes resource requests and limits to reduce throttling and wasted capacity.

Arthur Azrieli

0 min read

Kubernetes resource requests and limits often get tuned during an incident, when pods are pending, nodes are packed too tightly, or latency jumps after a rollout. The usual pressure is simple: stop wasting capacity without starving the application. The hard part is that CPU and memory behave very differently, and a safe-looking limit can create throttling long before a node is actually busy.

This guide walks through a practical way to set requests and limits, check for throttling, and adjust workloads without turning every deployment into a resource guessing game.

Understand what Kubernetes actually does with requests and limits

Before tuning numbers, separate scheduling behavior from runtime behavior.

  • CPU request: The amount of CPU Kubernetes uses when scheduling the pod onto a node. It also affects the CPU share the container gets when the node is under CPU pressure.
  • CPU limit: The maximum CPU the container can use over a scheduling period. If the process tries to use more, Linux Completely Fair Scheduler (CFS) quota can throttle it.
  • Memory request: The amount of memory Kubernetes uses when scheduling the pod.
  • Memory limit: The maximum memory the container can use. If it exceeds the limit, it can be killed with an out-of-memory (OOM) event.

CPU is compressible. Kubernetes can slow a container down. Memory is not. If the process needs memory and cannot get it, it may be killed.

That distinction drives most good defaults:

  • Be careful with CPU limits on latency-sensitive services.
  • Set memory limits for most workloads, but leave enough headroom for spikes, garbage collection, caches, and startup behavior.
  • Use requests to describe normal operating needs, not worst-case peaks.
  • Use autoscaling for changing demand instead of setting huge static requests.

Start with observability, not guesses

If the app already runs in Kubernetes, collect usage and throttling data before changing manifests. A few minutes of data can be misleading. Use a window that includes startup, normal traffic, background jobs, and any scheduled bursts.

Start with basic checks:

kubectl top pod -n production

kubectl top pod -n production --containers

kubectl describe pod -n production my-app-abc123

Look for current requests, limits, restarts, OOM kills, and scheduling events:

kubectl get pod -n production my-app-abc123 -o jsonpath='{range .spec.containers[*]}{.name}{" requests="}{.resources.requests}{" limits="}{.resources.limits}{"\n"}{end}'

kubectl get pod -n production my-app-abc123 -o jsonpath='{range .status.containerStatuses[*]}{.name}{" restartCount="}{.restartCount}{" lastState="}{.lastState}{"\n"}{end}'

If you use Prometheus with kubelet and cAdvisor metrics, check CPU usage and throttling. Metric names can vary by Kubernetes version, runtime, and scrape configuration, so confirm them in your own Prometheus first.

CPU usage by container:

sum by (namespace, pod, container) (
  rate(container_cpu_usage_seconds_total{
    namespace="production",
    container!="",
    image!=""
  }[5m])
)

CPU throttling ratio:

sum by (namespace, pod, container) (
  rate(container_cpu_cfs_throttled_periods_total{
    namespace="production",
    container!="",
    image!=""
  }[5m])
)
/
sum by (namespace, pod, container) (
  rate(container_cpu_cfs_periods_total{
    namespace="production",
    container!="",
    image!=""
  }[5m])
)

Memory working set:

max by (namespace, pod, container) (
  container_memory_working_set_bytes{
    namespace="production",
    container!="",
    image!=""
  }
)

OOM kills from kube-state-metrics, if available:

kube_pod_container_status_last_terminated_reason{
  namespace="production",
  reason="OOMKilled"
}

Do not tune from average CPU alone. A service can average low CPU while still getting throttled during request bursts, garbage collection, TLS handshakes, JSON serialization, or short background tasks.

Set CPU requests and limits without causing avoidable throttling

CPU limits are the most common source of surprise. A container can be throttled even when the node still has idle CPU, because the limit is enforced at the container cgroup level.

A practical CPU process:

  1. Find steady-state CPU usage. Use normal traffic windows, not deployment warmup only.
  2. Find burst behavior. Check short windows such as 1 minute and 5 minutes. Latency-sensitive apps often care about short spikes.
  3. Set the CPU request near sustained need. This gives the scheduler a realistic placement signal.
  4. Avoid CPU limits for latency-sensitive services when your platform policy allows it. Let the workload burst if the node has spare CPU.
  5. If you must set CPU limits, set them high enough for real bursts and test under load. A limit equal to the request is often too tight for web APIs and workers with bursty CPU.

Example for a service that usually needs around 250 millicores and bursts higher during traffic spikes:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-api
  namespace: production
spec:
  replicas: 3
  selector:
    matchLabels:
      app: checkout-api
  template:
    metadata:
      labels:
        app: checkout-api
    spec:
      containers:
        - name: checkout-api
          image: example.com/checkout-api:1.2.3
          resources:
            requests:
              cpu: "250m"
              memory: "512Mi"
            limits:
              memory: "768Mi"

This example intentionally omits a CPU limit. That does not mean every workload should omit CPU limits. It means you should avoid adding tight CPU quotas to services that need short bursts to keep latency stable.

CPU limits may still make sense for:

  • Untrusted workloads in shared clusters.
  • Batch jobs that can run slower without user impact.
  • Development namespaces where runaway CPU use can disrupt other teams.
  • Clusters with strict chargeback or tenancy rules.

If your organization requires CPU limits, start with a limit higher than the request and validate it with throttling data:

resources:
  requests:
    cpu: "250m"
    memory: "512Mi"
  limits:
    cpu: "1000m"
    memory: "768Mi"

After rollout, watch throttling and application latency together. Throttling by itself is not always an outage, but throttling plus rising p95 or p99 latency is a clear signal that the CPU limit is too low or the app needs more replicas.

Set memory requests and limits with enough headroom

Memory tuning has a different failure mode. If memory usage exceeds the limit, Kubernetes does not slow the process down. The container can be killed. If this happens during a rollout, new pods may never become ready and the deployment can stall.

A practical memory process:

  1. Measure working set under normal load. Use container memory working set, not only node memory.
  2. Include startup behavior. Some runtimes allocate more memory during boot, dependency loading, just-in-time compilation, or cache warmup.
  3. Include background jobs. A web container that also runs scheduled work may have memory spikes outside peak request traffic.
  4. Set the request to realistic normal usage. This helps the scheduler place the pod safely.
  5. Set the limit above observed spikes. Leave enough room for allocator behavior, garbage collection, and request bursts.

Example:

resources:
  requests:
    cpu: "300m"
    memory: "1Gi"
  limits:
    memory: "1536Mi"

For memory-heavy workloads, do not copy limits between services just because the images look similar. A Java API, a Node.js API, and a Python worker can have very different memory behavior even when they serve the same product area.

If you see OOM kills, confirm whether the container exceeded its own limit or the node was under pressure:

kubectl describe pod -n production my-app-abc123

kubectl get events -n production \
  --field-selector involvedObject.name=my-app-abc123 \
  --sort-by=.lastTimestamp

Look for output such as:

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137

Common fixes:

  • Raise the memory limit if the app is healthy but underprovisioned.
  • Raise the memory request if the scheduler places too many similar pods on the same node.
  • Split background workers from request-serving containers if one path causes memory spikes.
  • Fix application leaks if memory grows without returning to a stable range.
  • Reduce in-process cache sizes if each replica duplicates too much data.

Use quality of service classes intentionally

Kubernetes assigns a Quality of Service (QoS) class to each pod based on requests and limits. This affects eviction order when a node runs out of resources.

  • Guaranteed: Every container has CPU and memory requests and limits, and each request equals its limit.
  • Burstable: At least one request is set, but the pod does not meet the Guaranteed rules.
  • BestEffort: No requests or limits are set.

Most production application pods should be Burstable. That lets you set realistic requests while still allowing CPU bursting if you omit CPU limits or set them higher than requests.

Check a pod QoS class:

kubectl get pod -n production my-app-abc123 -o jsonpath='{.status.qosClass}{"\n"}'

Avoid BestEffort for production services. The scheduler has no useful resource signal, and the pod is a strong eviction candidate under pressure.

Guaranteed can work for critical infrastructure components when you know exact needs and want strict reservation. It can also waste capacity if you set high CPU limits equal to high CPU requests just to reach the class.

Roll out resource changes safely

Changing resource settings can affect scheduling, rollout speed, autoscaling, and node pressure. Treat it like a production change, not a YAML cleanup.

A safe workflow:

  1. Record current settings.
  2. Capture current CPU, memory, restarts, throttling, and latency.
  3. Change one resource dimension at a time when possible. For example, adjust CPU first, then memory.
  4. Roll out to a small set of replicas or one environment first.
  5. Watch scheduling events. Higher requests can make pods pending if nodes do not have allocatable capacity.
  6. Watch throttling, OOM kills, and application-level service indicators after rollout.
  7. Keep the old values ready for rollback.

Patch a deployment quickly during investigation:

kubectl set resources deployment/checkout-api \
  -n production \
  -c checkout-api \
  --requests=cpu=300m,memory=768Mi \
  --limits=memory=1Gi

Then verify the generated pod template:

kubectl get deployment checkout-api -n production -o yaml
kubectl rollout status deployment/checkout-api -n production
kubectl get pods -n production -l app=checkout-api

For repeatable changes, update the source manifest, Helm values, Kustomize patch, or Terraform code instead of leaving a manual patch in the cluster. If your team manages Kubernetes objects through Terraform, keep resource settings close to the deployment definition and review the plan before applying. This fits the same workflow used to deploy Kubernetes resources using Terraform.

Example Helm values pattern:

resources:
  requests:
    cpu: 300m
    memory: 768Mi
  limits:
    memory: 1Gi

Example Kustomize patch:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-api
spec:
  template:
    spec:
      containers:
        - name: checkout-api
          resources:
            requests:
              cpu: "300m"
              memory: "768Mi"
            limits:
              memory: "1Gi"

If you are provisioning cloud resources from inside Kubernetes, keep the boundary clear. Crossplane can manage cloud infrastructure, while standard Kubernetes controllers manage pods and deployments. The same separation applies when you deploy AWS resources using Crossplane on Kubernetes: infrastructure capacity and pod resource settings should be reviewed together, but they are still different control loops.

Account for autoscaling and cluster capacity

Requests drive scheduling, and many autoscaling setups use resource metrics. Bad requests can make autoscaling noisy or ineffective.

For a Horizontal Pod Autoscaler (HPA), CPU utilization is commonly calculated relative to CPU requests. If the request is too low, utilization looks high and the HPA may scale out too aggressively. If the request is too high, utilization looks low and the HPA may not add replicas soon enough.

Example HPA:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: checkout-api
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-api
  minReplicas: 3
  maxReplicas: 12
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Before applying an HPA, confirm every target container has a CPU request. Without CPU requests, CPU utilization-based scaling cannot work correctly.

Check requests across a namespace:

kubectl get pods -n production -o custom-columns='POD:.metadata.name,CONTAINERS:.spec.containers[*].name,CPU_REQ:.spec.containers[*].resources.requests.cpu,MEM_REQ:.spec.containers[*].resources.requests.memory,CPU_LIMIT:.spec.containers[*].resources.limits.cpu,MEM_LIMIT:.spec.containers[*].resources.limits.memory'

Also check whether your requests fit available node capacity:

kubectl describe nodes | grep -A 8 "Allocated resources"

Gotchas to watch:

  • Pods pending after raising requests: The cluster may lack allocatable CPU or memory for the new request.
  • HPA scaling after request changes: Changing CPU requests can change reported utilization even if traffic stays flat.
  • Cluster Autoscaler delay: Pods may wait for new nodes if requests no longer fit current nodes.
  • Namespace ResourceQuota failures: Higher requests or limits can exceed quota and block deployments.
  • LimitRange defaults: A namespace may inject default limits you did not set in your manifest.

Inspect namespace policies:

kubectl get resourcequota -n production
kubectl describe resourcequota -n production

kubectl get limitrange -n production
kubectl describe limitrange -n production

For heavier platform workloads such as workflow schedulers, resource settings deserve extra care because web servers, schedulers, and workers often need different profiles. If you are running Airflow on Kubernetes, review each component separately rather than applying one resource block everywhere. The same principle applies when you deploy Apache Airflow on Amazon Elastic Kubernetes Service.

Use a simple decision model

When you need a practical default, use this model:

  • Production web API: Set CPU and memory requests. Set a memory limit. Avoid CPU limits unless policy requires them.
  • Latency-sensitive service: Be extra cautious with CPU limits. Watch throttling and tail latency together.
  • Batch worker: Set CPU and memory requests. CPU limits are usually safer here than on request-serving paths.
  • Memory-heavy service: Set memory request close to normal usage and memory limit above real spikes. Investigate OOM kills quickly.
  • Shared development namespace: Use LimitRange defaults to prevent accidental BestEffort pods and runaway limits.
  • Autoscaled deployment: Make CPU requests realistic before trusting CPU-based HPA behavior.

A good starting manifest for many production services looks like this:

resources:
  requests:
    cpu: "250m"
    memory: "512Mi"
  limits:
    memory: "768Mi"

A stricter shared-cluster version may look like this:

resources:
  requests:
    cpu: "250m"
    memory: "512Mi"
  limits:
    cpu: "1000m"
    memory: "768Mi"

Do not treat either block as universal. Treat it as a starting point, then validate with real usage, throttling, restarts, scheduling events, and application behavior.

Takeaway

Set requests to help Kubernetes schedule pods based on normal resource needs. Set memory limits to prevent one container from consuming too much memory. Be careful with CPU limits because they can throttle applications during short bursts even when the node has spare CPU.

The safest path is measured and repeatable: collect usage data, set realistic requests, avoid tight CPU limits for latency-sensitive services, leave memory headroom, roll changes through your normal delivery process, and verify the result with both Kubernetes metrics and application metrics.

Want a senior engineer on this?

We put vetted senior DevOps engineers in your Slack within a week, billed by the hour. No retainer, no lock-in.