How to Configure Kubernetes PriorityClasses Without Starving Workloads
DevOps Engineering

How to Configure Kubernetes PriorityClasses Without Starving Workloads

Configure Kubernetes scheduling priority while preserving capacity for lower-priority workloads.

Arthur Azrieli

0 min read

Kubernetes PriorityClasses are easy to create and hard to govern. Once you give some pods a higher scheduling priority, the scheduler can place them first and may evict lower-priority pods to make room. That is useful during real capacity pressure, but it can also turn a normal incident into a cluster-wide starvation problem.

The goal is not to make every important workload “critical.” The goal is to give the scheduler enough signal to protect the right pods while preserving capacity for everything else that still needs to run.

How Kubernetes priority and preemption work

A PriorityClass is a cluster-scoped Kubernetes object that maps a class name to an integer value. Pods reference the class through priorityClassName. Higher values mean higher scheduling priority.

Priority affects two scheduler behaviors:

  • Scheduling order: when multiple pods are waiting, higher-priority pods move ahead in the scheduling queue.
  • Preemption: if a high-priority pod cannot fit, Kubernetes may evict lower-priority pods from a node so the high-priority pod can run.

A minimal PriorityClass looks like this:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: app-critical
value: 100000
globalDefault: false
description: "Critical application pods that should preempt lower-priority work when capacity is constrained."

A pod uses it like this:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payments-api
spec:
  replicas: 3
  selector:
    matchLabels:
      app: payments-api
  template:
    metadata:
      labels:
        app: payments-api
    spec:
      priorityClassName: app-critical
      containers:
        - name: app
          image: example.com/payments-api:1.0.0
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
            limits:
              cpu: "1"
              memory: "1Gi"

User-defined PriorityClass values can be up to 1000000000. Kubernetes reserves higher values for system-critical classes such as system-cluster-critical and system-node-critical. Do not use system-critical classes for application workloads.

Design a small priority model before writing YAML

Most starvation problems start with too many priority levels or poorly defined rules. Keep the model small enough that engineers can choose correctly during a deploy review.

A practical starting model is:

  • Platform critical: cluster or platform components required for other workloads to function. Use carefully and keep the list short.
  • Application critical: production services that should keep serving traffic during constrained capacity.
  • Default: normal production workloads with no special preemption rights.
  • Opportunistic: batch jobs, experiments, one-off workers, and workloads that can wait.

Example PriorityClasses:

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: platform-critical
value: 1000000
globalDefault: false
description: "Platform services required for cluster or workload operation."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: app-critical
value: 100000
globalDefault: false
description: "Production application pods that may preempt lower-priority workloads."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: workload-default
value: 0
globalDefault: true
description: "Default priority for regular workloads."
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: opportunistic
value: -1000
globalDefault: false
description: "Best-effort jobs and non-urgent workloads."

Use wide gaps between values. You probably do not need values like 100, 101, and 102. Small differences create arguments without improving scheduling behavior. Large gaps leave room for future classes if you truly need them.

Be careful with globalDefault: true. Only one PriorityClass can be the global default. Pods without priorityClassName get that class at admission time. Existing pods do not automatically change priority when you edit a PriorityClass later.

Prevent starvation with quotas and admission rules

PriorityClasses control scheduling preference. They do not control how many high-priority pods a team can create. If every namespace can create unlimited app-critical pods, lower-priority workloads will eventually lose during capacity pressure.

Use ResourceQuota with a PriorityClass scope to cap high-priority resource requests per namespace. This forces teams to reserve critical priority for the workloads that truly need it.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: app-critical-quota
  namespace: payments
spec:
  hard:
    requests.cpu: "8"
    requests.memory: "16Gi"
    pods: "20"
  scopeSelector:
    matchExpressions:
      - scopeName: PriorityClass
        operator: In
        values:
          - app-critical

In this example, the payments namespace can run up to 20 app-critical pods with a total of 8 requested CPU cores and 16 GiB of requested memory. Extra critical pods fail admission instead of silently increasing starvation risk.

You can also create a quota for opportunistic work so batch jobs cannot consume the entire namespace:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: opportunistic-quota
  namespace: payments
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    pods: "100"
  scopeSelector:
    matchExpressions:
      - scopeName: PriorityClass
        operator: In
        values:
          - opportunistic

Pair quotas with policy checks. At minimum, require code review for these cases:

  • A workload sets priorityClassName: platform-critical.
  • A new PriorityClass is added.
  • A PriorityClass value is increased.
  • A namespace quota for high-priority workloads is increased.
  • A workload uses high priority without realistic CPU and memory requests.

If you manage Kubernetes objects through infrastructure as code, keep PriorityClasses, quotas, and workload manifests in the same review path. For example, if your team provisions Kubernetes resources with Terraform, keep the policy close to the workload definitions instead of treating priority as an afterthought. See this related guide on deploying Kubernetes resources using Terraform.

Use non-preempting PriorityClasses when ordering is enough

Some workloads should be scheduled before others but should not evict running pods. Kubernetes supports this through preemptionPolicy: Never.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: app-urgent-nonpreempting
value: 50000
globalDefault: false
preemptionPolicy: Never
description: "Urgent pods that should move ahead in the scheduling queue but should not evict lower-priority pods."

This is useful for workloads such as:

  • Time-sensitive jobs that can wait for a slot but should start before lower-priority queued jobs.
  • Deploy validation tasks that should not disrupt running production pods.
  • Internal tools that need faster scheduling but are not important enough to cause evictions.

Non-preempting priority still affects scheduling queue order. It does not create capacity. If the cluster is full, those pods remain pending until resources free up or autoscaling adds capacity.

Test preemption behavior before enabling it broadly

Do not introduce high-priority preemption directly into a busy production cluster. Test the behavior in a small namespace or non-production cluster first.

Use this process:

  1. Create your PriorityClasses.
  2. Create lower-priority pods that consume most available resources.
  3. Create a higher-priority pod that cannot currently fit.
  4. Watch which pods Kubernetes preempts.
  5. Check events, pod status, and workload recovery behavior.

Apply the classes:

kubectl apply -f priorityclasses.yaml

Create a namespace for the test:

kubectl create namespace priority-test

Create lower-priority pods:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: low-priority-workers
  namespace: priority-test
spec:
  replicas: 5
  selector:
    matchLabels:
      app: low-priority-workers
  template:
    metadata:
      labels:
        app: low-priority-workers
    spec:
      priorityClassName: opportunistic
      containers:
        - name: worker
          image: registry.k8s.io/pause:3.9
          resources:
            requests:
              cpu: "500m"
              memory: "256Mi"

Create a higher-priority pod with requests large enough to force a scheduling decision:

apiVersion: v1
kind: Pod
metadata:
  name: critical-test-pod
  namespace: priority-test
spec:
  priorityClassName: app-critical
  containers:
    - name: pause
      image: registry.k8s.io/pause:3.9
      resources:
        requests:
          cpu: "2"
          memory: "1Gi"

Inspect the result:

kubectl get pods -n priority-test -o wide
kubectl describe pod critical-test-pod -n priority-test
kubectl get events -n priority-test --sort-by=.lastTimestamp

Look for event messages that mention preemption. Also check the lower-priority pods. If they belong to a Deployment, ReplicaSet, Job, or StatefulSet, their controller may create replacements. Those replacements can remain pending if capacity is still constrained.

If you run data platforms or workflow schedulers on Kubernetes, test this carefully. A system such as Apache Airflow can create many pods or workers depending on how it is deployed. The scheduling behavior of those workers matters during capacity pressure. For a concrete deployment flow, see this guide on how to deploy Apache Airflow on AWS Elastic Kubernetes Service.

Operational gotchas that cause bad outcomes

PriorityClasses are simple objects, but the runtime behavior has sharp edges. Watch for these failure modes.

PodDisruptionBudgets are not a hard shield against preemption

A PodDisruptionBudget, or PDB, limits voluntary disruptions for a workload. Kubernetes tries to choose preemption victims in a way that respects PDBs, but preemption can still violate a PDB if the scheduler cannot find another way to place the higher-priority pod.

Do not assume this will protect lower-priority services:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api

A PDB helps, but it does not replace capacity planning, quotas, or careful priority assignment.

Missing resource requests make priority less predictable

The scheduler makes placement decisions using resource requests, not actual usage. A high-priority pod with missing or unrealistic requests can be scheduled in ways that create node pressure later.

Require CPU and memory requests for workloads that use elevated priority:

resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "1"
    memory: "1Gi"

If a team cannot define reasonable requests, they probably should not receive preemption rights yet.

Priority does not fix bad topology constraints

A high-priority pod can still stay pending if its node selector, affinity, taints, tolerations, storage constraints, or topology spread constraints make it impossible to place.

When a high-priority pod stays pending, inspect the scheduler message:

kubectl describe pod <pod-name> -n <namespace>

Common causes include:

  • No nodes match the pod’s nodeSelector.
  • The pod lacks a required toleration for tainted nodes.
  • Persistent volumes are bound to a different zone.
  • Topology spread rules are too strict.
  • Resource requests are larger than any single node can satisfy.

Changing a PriorityClass does not update existing pods

Priority is resolved when the pod is admitted. If you change a PriorityClass value, existing pods keep their current priority until they are recreated. Restart the owning workload if you need the new value to apply:

kubectl rollout restart deployment/payments-api -n payments

Verify the effective priority on a pod:

kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.priorityClassName}{" "}{.spec.priority}{"\n"}'

Critical classes can hide capacity problems

If high-priority pods are constantly preempting lower-priority pods, the cluster is telling you something useful: capacity, quotas, or workload placement rules are wrong. Do not solve that by raising more priorities.

Check these signals:

  • Pods with reason Preempted or frequent termination after high-priority deploys.
  • Lower-priority Deployments that cannot return to their desired replica count.
  • Namespaces that repeatedly hit high-priority ResourceQuotas.
  • Pending pods with scheduling events that mention insufficient CPU or memory.
  • Autoscaling events that show the cluster cannot add suitable nodes.

If you use Kubernetes to provision cloud resources through control-plane tools, make sure those controllers are classified correctly. For example, Crossplane controllers may be part of your platform path, while the resources they create have separate lifecycle concerns. These articles on deploying AWS resources using Crossplane on Kubernetes and deploying a Kubernetes app with an AWS resource using Crossplane give useful context for teams managing infrastructure through Kubernetes APIs.

A safe rollout checklist

Use this checklist before you merge PriorityClasses into a shared cluster:

  • Create no more than three or four application-facing priority levels.
  • Reserve system-critical PriorityClasses for Kubernetes system components only.
  • Set one clear global default, usually a normal workload class with value 0.
  • Require resource requests for every workload with elevated priority.
  • Add ResourceQuotas scoped by PriorityClass for each namespace that can use high priority.
  • Use preemptionPolicy: Never when queue ordering is enough.
  • Test preemption in a non-production namespace and inspect events.
  • Document who can approve new high-priority workloads.
  • Monitor preemptions, pending pods, and quota failures after rollout.
  • Recreate pods when you need PriorityClass value changes to take effect.

The safest PriorityClass setup is boring: few classes, clear ownership, real quotas, and measured use of preemption. Start with scheduling order, add preemption only where downtime is more expensive than evicting lower-priority work, and treat frequent preemption as a capacity or policy problem to fix.