Load Average Has Been Lying to You
You’ve stared at that load average: 3.42, 2.89, 2.71 output at 2 AM, wondering whether to panic. Four CPUs. Is 3.42 bad? Is it memory pressure? IO wait? CPU saturation? Is one process chewing everything or are fifty processes politely taking turns?
Load average doesn’t tell you. It never did. It counts the number of runnable or uninterruptible-sleep processes as a rolling average — which means it bundles IO-blocked tasks right alongside CPU-hungry ones, smooths everything over one, five, and fifteen minutes, and hands you a number that could mean roughly anything depending on your CPU count and workload shape.
The kernel has had a better answer since Linux 4.20. It’s called PSI — Pressure Stall Information — and it tells you the actual percentage of time your tasks were stuck waiting. Not a count. A percentage. With three separate windows. For CPU, memory, and IO independently. If your kernel is 4.20+ (it is), you already have it. You’re probably just not looking at it.
What PSI Actually Measures
PSI tracks stall time — wall-clock time during which at least one task couldn’t make progress because it was waiting on a resource. Three files in /proc/pressure/:
cat /proc/pressure/cpucat /proc/pressure/memorycat /proc/pressure/ioEach file looks roughly like this (memory, as an example):
some avg10=0.42 avg60=1.23 avg300=0.88 total=284750full avg10=0.00 avg60=0.15 avg300=0.08 total=42318Two lines, and the distinction matters:
some— at least one task was stalled. The system was partially stuck.full— every runnable task was stalled simultaneously. The system was completely idle on that resource — nobody was making progress.
The full metric is the one that should make you sweat. some at 5% on memory means a few tasks occasionally waited for pages to be faulted in — normal. full at 5% on memory means everything ground to a halt for 5% of the last ten seconds. That’s an OOM situation developing.
The three windows — avg10, avg60, avg300 — are exponentially weighted moving averages over 10 seconds, 60 seconds, and 300 seconds. Sound familiar? Same shape as load average’s 1/5/15, but these are percentages (0–100), they’re per-resource, and the 10-second window actually catches sudden spikes before your monitoring system has a chance to yawn and look away.
total is a monotonically increasing counter in microseconds — useful for calculating rate of change in your own tooling.
Reading the Three Pressure Files
CPU Pressure
cat /proc/pressure/cpusome avg10=0.00 avg60=0.12 avg300=0.08 total=194823Note: CPU pressure has no full line on most kernels (there’s no meaningful “all tasks stalled on CPU” state when the CPU is free — if no task runs, that’s just idle). some CPU pressure means tasks were runnable but waiting for CPU time — scheduler contention, not saturation per se.
Memory Pressure
cat /proc/pressure/memorysome avg10=2.14 avg60=0.88 avg300=0.31 total=982341full avg10=0.00 avg60=0.00 avg300=0.00 total=8821Memory some at 2% over 10 seconds: a handful of tasks are waiting for the kernel to reclaim pages or service page faults. Not an emergency, but worth watching. Memory full near zero: nobody’s completely blocked. The moment full starts climbing, your system is heading toward an OOM event — the kernel is desperately swapping and reclaiming while everything waits.
IO Pressure
cat /proc/pressure/iosome avg10=15.32 avg60=8.44 avg300=3.12 total=4821033full avg10=4.11 avg60=2.31 avg300=0.98 total=892441IO some at 15% is a yellow flag — tasks are waiting on disk reads/writes, probably fine on a busy DB node but worth correlating with disk throughput. IO full at 4% over 10 seconds means everything froze while waiting on IO. That’s the stat that explains why your SSH session felt like moving through treacle even though CPU load looked fine.
PSI Per-Cgroup: Pressure Inside Containers
Here’s where it gets useful for container workloads. If you’re running cgroups v2 (and if you followed the cgroups v2 deep dive published here on 2026-08-07, you know your systemd-enabled system is already using it), each cgroup exposes its own pressure files.
Find your container’s cgroup path:
# For a systemd servicesystemctl show my-service.service -p ControlGroup# Typical output: ControlGroup=/system.slice/my-service.service
cat /sys/fs/cgroup/system.slice/my-service.service/memory.pressureFor Docker containers (cgroup v2):
# Get the container's cgroupdocker inspect --format '{{.Id}}' mycontainer# Then:cat /sys/fs/cgroup/system.slice/docker-<container-id>.scope/memory.pressureThis is huge. You can now see exactly which container is experiencing memory pressure independently of what the whole system is doing. One container thrashing its memory won’t contaminate the system-level numbers — you can pinpoint it. No more docker stats showing 90% memory usage on a container and trying to guess whether it’s actually stalled.
How oomd and systemd-oomd Use This
The classic OOM killer is reactive — it fires after the kernel has already given up on memory reclaim, picks a process (often the wrong one), and terminates it. By then you’ve had a bad time.
systemd-oomd uses PSI to be proactive. It monitors memory.pressure per-cgroup and starts killing processes when pressure crosses configurable thresholds before the kernel OOM killer triggers. The default policy watches for memory full pressure staying elevated for a time window, then terminates the highest-memory cgroup in the affected slice.
Check if it’s running:
systemctl status systemd-oomdThe config lives at /etc/systemd/oomd.conf. Key options:
[OOM]SwapUsedLimit=90%DefaultMemoryPressureLimit=60%DefaultMemoryPressureDurationSec=30sDefaultMemoryPressureLimit=60% means: if memory some exceeds 60% for 30 consecutive seconds in a cgroup, consider it a candidate for termination. Adjust to taste — 60% is conservative. On a database node you might push this to 80%.
Kubernetes takes a similar approach with eviction signals. The kubelet can consume PSI metrics (kubelet >= 1.33 with --feature-gates=KubeletPSI=true) to make eviction decisions before a node goes fully OOM.
systemd-cgtop: Pressure at a Glance
Before you wire up Prometheus, you can use systemd-cgtop to see live pressure data per-cgroup:
systemd-cgtop -p --depth=3The -p flag adds pressure columns. You’ll see CPU%, Memory%, IO Read/Write, and — with a kernel that exposes it — memory pressure per cgroup. It’s a solid first triage tool when something is wrong right now and you don’t have time to query Grafana.
Wiring PSI Into Prometheus
For longer-term visibility and alerting, Prometheus node_exporter includes a PSI collector since v1.1.0. If you’re already running the Prometheus + Grafana stack (covered in the prometheus-grafana-setup guide), you just need to make sure the PSI collector is enabled.
Verify it’s active:
curl -s http://localhost:9100/metrics | grep node_pressureYou should see metrics like:
node_pressure_cpu_waiting_seconds_totalnode_pressure_memory_waiting_seconds_totalnode_pressure_memory_stalled_seconds_totalnode_pressure_io_waiting_seconds_totalnode_pressure_io_stalled_seconds_totalThese are counters (total stalled seconds). Prometheus rate() converts them to percentage approximations.
Alert Rules That Don’t Suck
The default “alert if memory usage > 90%” rule is famously terrible — it fires on healthy JVM heap behavior and stays silent during actual OOM spirals. PSI-based alerts are better because they measure impact, not just utilization.
groups: - name: psi_pressure rules:
- alert: MemoryPressureElevated expr: | rate(node_pressure_memory_waiting_seconds_total[2m]) * 100 > 10 for: 5m labels: severity: warning annotations: summary: "Memory pressure on {{ $labels.instance }}" description: > Memory 'some' pressure at {{ $value | printf \"%.1f\" }}% over last 2m. Tasks are waiting on memory reclaim.
- alert: MemoryStalledCritical expr: | rate(node_pressure_memory_stalled_seconds_total[2m]) * 100 > 5 for: 2m labels: severity: critical annotations: summary: "Memory stall (full) on {{ $labels.instance }}" description: > Memory 'full' pressure at {{ $value | printf \"%.1f\" }}% — ALL tasks stalled. OOM event likely imminent. Check systemd-oomd logs.
- alert: IOPressureHigh expr: | rate(node_pressure_io_waiting_seconds_total[5m]) * 100 > 20 for: 10m labels: severity: warning annotations: summary: "IO pressure on {{ $labels.instance }}" description: > IO 'some' pressure at {{ $value | printf \"%.1f\" }}% for 10m. Disk throughput may be saturated.
- alert: IOStalledCritical expr: | rate(node_pressure_io_stalled_seconds_total[2m]) * 100 > 10 for: 3m labels: severity: critical annotations: summary: "IO full stall on {{ $labels.instance }}" description: > IO 'full' pressure at {{ $value | printf \"%.1f\" }}% — complete IO stall. Check disk health, IOPS limits, or NFS mount issues.The thresholds above are sensible starting points, not gospel. A Postgres server under normal load might push IO some to 15% and that’s fine — tune your baselines by watching avg10 over a few days during normal operation, then set warnings at 2x that value.
Drop this in your Prometheus rules directory and reload:
sudo cp psi-alerts.yml /etc/prometheus/rules/sudo systemctl reload prometheus# Or if using the HTTP API:curl -X POST http://localhost:9090/-/reloadQuick Reference: What the Numbers Mean
| Metric | High Value Means | Action |
|---|---|---|
CPU some > 20% | Scheduler contention | Check CPU count, nice levels, cgroup CPU limits |
Memory some > 10% | Active page reclaim | Check RSS, swap usage, transparent hugepages |
Memory full > 1% | Near-OOM condition | Check oomd logs, kill or limit offending cgroup |
IO some > 25% | Disk throughput pressure | iostat, check IOPS limits, queue depth |
IO full > 5% | Complete IO stall | Disk health, NFS issues, block device errors |
The Takeaway
Load average told you something was happening. PSI tells you what was happening, to which resource, and for how long. It’s the difference between a smoke detector that goes off when you open the oven and one that tells you which room is on fire and how long it’s been burning.
Enable the node_exporter PSI collector. Add the alerts above. Check your containers’ memory.pressure files the next time something feels sluggish. And the next time someone says “load average is 4.2, seems fine,” you can pull up the memory full graph at 3%, watch their face change, and quietly go fix the actual problem.
Your 2 AM self will thank you for having real numbers instead of a five-minute average that tells you precisely nothing.