Kubernetes CPU Throttling: Why Your Pods Lie to You
May 15, 2025
The Problem Nobody Tells You About
You set a CPU limit of 500m on your pod. Your monitoring shows average CPU usage at 18%. Your SLO is burning. Latency spikes every few seconds with no obvious cause. You scale horizontally. Nothing changes.
The culprit is almost certainly CPU throttling, and the reason it's invisible to most teams is that the metric everyone watches (CPU usage) is the wrong metric entirely.
How Kubernetes CPU Limits Actually Work
Kubernetes CPU limits are enforced by the Linux kernel's Completely Fair Scheduler (CFS) bandwidth control mechanism. Understanding this at the kernel level is the only way to reason about throttling correctly.
When you set resources.limits.cpu: 500m, Kubernetes writes two values to the container's cgroup:
cpu.cfs_period_us, the scheduling period, defaulting to 100,000 microseconds (100ms)cpu.cfs_quota_us, how much CPU time the container is allowed within that period. For500m, this is 50,000µs (50% of one core per 100ms window)
The kernel runs a hard accounting loop: every 100ms, it resets the quota. The moment a container exhausts its 50ms of CPU time within a given window, the kernel throttles every process in that cgroup until the next period starts. No exceptions. No borrowing from adjacent windows.
The Throttling Math That Explains Everything
Here is why a pod at 20% average CPU can still be heavily throttled. Consider a Go service that handles HTTP requests with occasional bursts of compute, JSON marshalling, crypto, regex. The request handler takes 8ms of CPU time in a tight burst, then idles for the rest.
If two requests arrive simultaneously, they consume 16ms of CPU in the first few milliseconds of the 100ms window. Now imagine a garbage collection pause hits at the same moment, add another 40ms of CPU burst. Total: 56ms. The quota is 50ms. The kernel throttles the container for the remaining ~94ms of that period.
On a wall-clock basis, your p99 latency just jumped by up to 100ms, not because your service is slow, but because the kernel suspended it mid-request. The CPU usage metric shows 56ms used out of a 100ms window: that's a reported 56% for that period. But your average over an hour is still 20% because most windows are idle.
This is the lie: average CPU usage is meaningless as a throttling signal. The kernel doesn't care about averages. It only cares about the current 100ms window.
Diagnosing Throttling with Prometheus
The right metrics live in cAdvisor, which exposes them via the kubelet's metrics endpoint:
# Ratio of throttled periods to total periods, the key signal
sum(rate(container_cpu_cfs_throttled_periods_total[5m])) by (pod, container)
/
sum(rate(container_cpu_cfs_periods_total[5m])) by (pod, container)
A value above 0.25 (25%) is a yellow flag. Above 0.5 (50%) is a problem you should act on. Some latency-sensitive services break at 10%.
A second useful query shows the actual CPU time being throttled away:
rate(container_cpu_cfs_throttled_seconds_total[5m])
Cross-reference this against your p99 latency graphs. If the spikes correlate, throttling is your root cause.
The Four Ways to Fix It
1. Raise the limit (the obvious answer, often wrong)
Increasing cpu: 500m to cpu: 2000m raises the quota to 200ms per period. The burst headroom grows. This works, but it means you're reserving cluster capacity for peak bursts, not averages, which is wasteful at scale.
2. Remove the CPU limit entirely (controversial but often correct)
For trusted, well-understood workloads on nodes you control, removing CPU limits converts the pod to Burstable QoS with no upper bound. The container competes fairly with other containers using the CFS weight mechanism (driven by cpu.shares / requests), but it can burst to the full node capacity when cores are idle.
The risk: a runaway process can steal CPU from neighbours. The mitigation: set accurate requests so the scheduler places pods correctly, and monitor per-pod CPU usage actively. Many high-performance clusters at Google, Cloudflare, and others run without CPU limits for exactly this reason.
3. Reduce the CFS period (advanced, surgical)
The 100ms default period is long. Kubernetes 1.12+ allows you to configure --cpu-cfs-quota-period on the kubelet. Reducing it to 5ms or 10ms means shorter burst windows and faster quota resets, the container is throttled for a shorter wall-clock time per event.
This doesn't give more CPU; it reduces the maximum throttle duration per event. For latency-sensitive APIs, this can cut p99 throttle-induced latency from ~100ms to ~5ms. The trade-off is higher kernel scheduling overhead from more frequent accounting.
4. Fix the actual burst (the right answer for most cases)
Profile why the burst exists. Common culprits: synchronous JSON marshalling on the hot path (switch to streaming), naive regex compilation per request (compile once, reuse), GC pressure from excessive allocations (profile heap usage). Eliminating the burst eliminates the throttle without touching limits.
The QoS Class Consequence
One side effect worth knowing: if you remove CPU limits while keeping requests, your pod moves to Burstable QoS. If you also remove requests, it becomes BestEffort, the first to be evicted under node pressure. Keep requests accurate and set memory limits. The common pattern for performance-sensitive services:
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
memory: "512Mi" # keep memory limit, OOM is safer than throttle
# no cpu limit
What to Take Away
CPU throttling is a kernel mechanism, not a Kubernetes abstraction. Average CPU usage is not a throttling signal, throttled period ratio is. The default 100ms CFS period means a single burst can freeze your process for up to 100ms regardless of your average load. The right fix depends on your workload: raise limits if you have headroom budget, remove limits if you trust the workload, reduce the CFS period if latency is the primary concern, or profile and eliminate the burst at the source. Most teams should do the last one first.
Subscribe
I write about frameworks and principles from the things I build and read. Infra, AI agents, and the occasional detour outside of these topics. ~Monthly, sometimes more often.