Skip to content
Go back

Linux PSI: The Pressure Metrics Load Average Wishes It Were

By SumGuy 9 min read
Linux PSI: The Pressure Metrics Load Average Wishes It Were

Load Average Has Been Lying to You

You’ve stared at that load average: 3.42, 2.89, 2.71 output at 2 AM, wondering whether to panic. Four CPUs. Is 3.42 bad? Is it memory pressure? IO wait? CPU saturation? Is one process chewing everything or are fifty processes politely taking turns?

Load average doesn’t tell you. It never did. It counts the number of runnable or uninterruptible-sleep processes as a rolling average — which means it bundles IO-blocked tasks right alongside CPU-hungry ones, smooths everything over one, five, and fifteen minutes, and hands you a number that could mean roughly anything depending on your CPU count and workload shape.

The kernel has had a better answer since Linux 4.20. It’s called PSI — Pressure Stall Information — and it tells you the actual percentage of time your tasks were stuck waiting. Not a count. A percentage. With three separate windows. For CPU, memory, and IO independently. If your kernel is 4.20+ (it is), you already have it. You’re probably just not looking at it.


What PSI Actually Measures

PSI tracks stall time — wall-clock time during which at least one task couldn’t make progress because it was waiting on a resource. Three files in /proc/pressure/:

Terminal window
cat /proc/pressure/cpu
cat /proc/pressure/memory
cat /proc/pressure/io

Each file looks roughly like this (memory, as an example):

some avg10=0.42 avg60=1.23 avg300=0.88 total=284750
full avg10=0.00 avg60=0.15 avg300=0.08 total=42318

Two lines, and the distinction matters:

The full metric is the one that should make you sweat. some at 5% on memory means a few tasks occasionally waited for pages to be faulted in — normal. full at 5% on memory means everything ground to a halt for 5% of the last ten seconds. That’s an OOM situation developing.

The three windows — avg10, avg60, avg300 — are exponentially weighted moving averages over 10 seconds, 60 seconds, and 300 seconds. Sound familiar? Same shape as load average’s 1/5/15, but these are percentages (0–100), they’re per-resource, and the 10-second window actually catches sudden spikes before your monitoring system has a chance to yawn and look away.

total is a monotonically increasing counter in microseconds — useful for calculating rate of change in your own tooling.


Reading the Three Pressure Files

CPU Pressure

Terminal window
cat /proc/pressure/cpu
some avg10=0.00 avg60=0.12 avg300=0.08 total=194823

Note: CPU pressure has no full line on most kernels (there’s no meaningful “all tasks stalled on CPU” state when the CPU is free — if no task runs, that’s just idle). some CPU pressure means tasks were runnable but waiting for CPU time — scheduler contention, not saturation per se.

Memory Pressure

Terminal window
cat /proc/pressure/memory
some avg10=2.14 avg60=0.88 avg300=0.31 total=982341
full avg10=0.00 avg60=0.00 avg300=0.00 total=8821

Memory some at 2% over 10 seconds: a handful of tasks are waiting for the kernel to reclaim pages or service page faults. Not an emergency, but worth watching. Memory full near zero: nobody’s completely blocked. The moment full starts climbing, your system is heading toward an OOM event — the kernel is desperately swapping and reclaiming while everything waits.

IO Pressure

Terminal window
cat /proc/pressure/io
some avg10=15.32 avg60=8.44 avg300=3.12 total=4821033
full avg10=4.11 avg60=2.31 avg300=0.98 total=892441

IO some at 15% is a yellow flag — tasks are waiting on disk reads/writes, probably fine on a busy DB node but worth correlating with disk throughput. IO full at 4% over 10 seconds means everything froze while waiting on IO. That’s the stat that explains why your SSH session felt like moving through treacle even though CPU load looked fine.


PSI Per-Cgroup: Pressure Inside Containers

Here’s where it gets useful for container workloads. If you’re running cgroups v2 (and if you followed the cgroups v2 deep dive published here on 2026-08-07, you know your systemd-enabled system is already using it), each cgroup exposes its own pressure files.

Find your container’s cgroup path:

Terminal window
# For a systemd service
systemctl show my-service.service -p ControlGroup
# Typical output: ControlGroup=/system.slice/my-service.service
cat /sys/fs/cgroup/system.slice/my-service.service/memory.pressure

For Docker containers (cgroup v2):

Terminal window
# Get the container's cgroup
docker inspect --format '{{.Id}}' mycontainer
# Then:
cat /sys/fs/cgroup/system.slice/docker-<container-id>.scope/memory.pressure

This is huge. You can now see exactly which container is experiencing memory pressure independently of what the whole system is doing. One container thrashing its memory won’t contaminate the system-level numbers — you can pinpoint it. No more docker stats showing 90% memory usage on a container and trying to guess whether it’s actually stalled.


How oomd and systemd-oomd Use This

The classic OOM killer is reactive — it fires after the kernel has already given up on memory reclaim, picks a process (often the wrong one), and terminates it. By then you’ve had a bad time.

systemd-oomd uses PSI to be proactive. It monitors memory.pressure per-cgroup and starts killing processes when pressure crosses configurable thresholds before the kernel OOM killer triggers. The default policy watches for memory full pressure staying elevated for a time window, then terminates the highest-memory cgroup in the affected slice.

Check if it’s running:

Terminal window
systemctl status systemd-oomd

The config lives at /etc/systemd/oomd.conf. Key options:

[OOM]
SwapUsedLimit=90%
DefaultMemoryPressureLimit=60%
DefaultMemoryPressureDurationSec=30s

DefaultMemoryPressureLimit=60% means: if memory some exceeds 60% for 30 consecutive seconds in a cgroup, consider it a candidate for termination. Adjust to taste — 60% is conservative. On a database node you might push this to 80%.

Kubernetes takes a similar approach with eviction signals. The kubelet can consume PSI metrics (kubelet >= 1.33 with --feature-gates=KubeletPSI=true) to make eviction decisions before a node goes fully OOM.


systemd-cgtop: Pressure at a Glance

Before you wire up Prometheus, you can use systemd-cgtop to see live pressure data per-cgroup:

Terminal window
systemd-cgtop -p --depth=3

The -p flag adds pressure columns. You’ll see CPU%, Memory%, IO Read/Write, and — with a kernel that exposes it — memory pressure per cgroup. It’s a solid first triage tool when something is wrong right now and you don’t have time to query Grafana.


Wiring PSI Into Prometheus

For longer-term visibility and alerting, Prometheus node_exporter includes a PSI collector since v1.1.0. If you’re already running the Prometheus + Grafana stack (covered in the prometheus-grafana-setup guide), you just need to make sure the PSI collector is enabled.

Verify it’s active:

Terminal window
curl -s http://localhost:9100/metrics | grep node_pressure

You should see metrics like:

node_pressure_cpu_waiting_seconds_total
node_pressure_memory_waiting_seconds_total
node_pressure_memory_stalled_seconds_total
node_pressure_io_waiting_seconds_total
node_pressure_io_stalled_seconds_total

These are counters (total stalled seconds). Prometheus rate() converts them to percentage approximations.

Alert Rules That Don’t Suck

The default “alert if memory usage > 90%” rule is famously terrible — it fires on healthy JVM heap behavior and stays silent during actual OOM spirals. PSI-based alerts are better because they measure impact, not just utilization.

psi-alerts.yml
groups:
- name: psi_pressure
rules:
- alert: MemoryPressureElevated
expr: |
rate(node_pressure_memory_waiting_seconds_total[2m]) * 100 > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Memory pressure on {{ $labels.instance }}"
description: >
Memory 'some' pressure at {{ $value | printf \"%.1f\" }}% over last 2m.
Tasks are waiting on memory reclaim.
- alert: MemoryStalledCritical
expr: |
rate(node_pressure_memory_stalled_seconds_total[2m]) * 100 > 5
for: 2m
labels:
severity: critical
annotations:
summary: "Memory stall (full) on {{ $labels.instance }}"
description: >
Memory 'full' pressure at {{ $value | printf \"%.1f\" }}% — ALL tasks stalled.
OOM event likely imminent. Check systemd-oomd logs.
- alert: IOPressureHigh
expr: |
rate(node_pressure_io_waiting_seconds_total[5m]) * 100 > 20
for: 10m
labels:
severity: warning
annotations:
summary: "IO pressure on {{ $labels.instance }}"
description: >
IO 'some' pressure at {{ $value | printf \"%.1f\" }}% for 10m.
Disk throughput may be saturated.
- alert: IOStalledCritical
expr: |
rate(node_pressure_io_stalled_seconds_total[2m]) * 100 > 10
for: 3m
labels:
severity: critical
annotations:
summary: "IO full stall on {{ $labels.instance }}"
description: >
IO 'full' pressure at {{ $value | printf \"%.1f\" }}% — complete IO stall.
Check disk health, IOPS limits, or NFS mount issues.

The thresholds above are sensible starting points, not gospel. A Postgres server under normal load might push IO some to 15% and that’s fine — tune your baselines by watching avg10 over a few days during normal operation, then set warnings at 2x that value.

Drop this in your Prometheus rules directory and reload:

Terminal window
sudo cp psi-alerts.yml /etc/prometheus/rules/
sudo systemctl reload prometheus
# Or if using the HTTP API:
curl -X POST http://localhost:9090/-/reload

Quick Reference: What the Numbers Mean

MetricHigh Value MeansAction
CPU some > 20%Scheduler contentionCheck CPU count, nice levels, cgroup CPU limits
Memory some > 10%Active page reclaimCheck RSS, swap usage, transparent hugepages
Memory full > 1%Near-OOM conditionCheck oomd logs, kill or limit offending cgroup
IO some > 25%Disk throughput pressureiostat, check IOPS limits, queue depth
IO full > 5%Complete IO stallDisk health, NFS issues, block device errors

The Takeaway

Load average told you something was happening. PSI tells you what was happening, to which resource, and for how long. It’s the difference between a smoke detector that goes off when you open the oven and one that tells you which room is on fire and how long it’s been burning.

Enable the node_exporter PSI collector. Add the alerts above. Check your containers’ memory.pressure files the next time something feels sluggish. And the next time someone says “load average is 4.2, seems fine,” you can pull up the memory full graph at 3%, watch their face change, and quietly go fix the actual problem.

Your 2 AM self will thank you for having real numbers instead of a five-minute average that tells you precisely nothing.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
Argo Workflows vs Tekton

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts