Linux PSI: The Pressure Metrics Load Average Wishes It Were

Load Average Has Been Lying to You

You’ve stared at that load average: 3.42, 2.89, 2.71 output at 2 AM, wondering whether to panic. Four CPUs. Is 3.42 bad? Is it memory pressure? IO wait? CPU saturation? Is one process chewing everything or are fifty processes politely taking turns?

Load average doesn’t tell you. It never did. It counts the number of runnable or uninterruptible-sleep processes as a rolling average, which means it bundles IO-blocked tasks right alongside CPU-hungry ones, smooths everything over one, five, and fifteen minutes, and hands you a number that could mean roughly anything depending on your CPU count and workload shape.

The kernel has had a better answer since Linux 4.20. It’s called PSI, Pressure Stall Information, and it tells you the actual percentage of time your tasks were stuck waiting. Not a count. A percentage. With three separate windows. For CPU, memory, and IO independently. If your kernel is 4.20+ (it is), you already have it. You’re probably just not looking at it.

What PSI Actually Measures

PSI tracks stall time, wall-clock time during which at least one task couldn’t make progress because it was waiting on a resource. Three files in /proc/pressure/:

cat /proc/pressure/cpu
cat /proc/pressure/memory
cat /proc/pressure/io

Each file looks roughly like this (memory, as an example):

some avg10=0.42 avg60=1.23 avg300=0.88 total=284750
full avg10=0.00 avg60=0.15 avg300=0.08 total=42318

Two lines, and the distinction matters:

some: at least one task was stalled. The system was partially stuck.
full: every runnable task was stalled simultaneously. The system was completely idle on that resource, nobody was making progress.

The full metric is the one that should make you sweat. some at 5% on memory means a few tasks occasionally waited for pages to be faulted in, normal. full at 5% on memory means everything ground to a halt for 5% of the last ten seconds. That’s an OOM situation developing.

The three windows, avg10, avg60, avg300, are exponentially weighted moving averages over 10 seconds, 60 seconds, and 300 seconds. Sound familiar? Same shape as load average’s 1/5/15, but these are percentages (0 to 100), they’re per-resource, and the 10-second window actually catches sudden spikes before your monitoring system has a chance to yawn and look away.

total is a monotonically increasing counter in microseconds, useful for calculating rate of change in your own tooling.

Reading the Three Pressure Files

CPU Pressure

cat /proc/pressure/cpu

some avg10=0.00 avg60=0.12 avg300=0.08 total=194823

Note: CPU pressure has no full line on most kernels (there’s no meaningful “all tasks stalled on CPU” state when the CPU is free, if no task runs, that’s just idle). some CPU pressure means tasks were runnable but waiting for CPU time, scheduler contention, not saturation per se.

Memory Pressure

cat /proc/pressure/memory

some avg10=2.14 avg60=0.88 avg300=0.31 total=982341
full avg10=0.00 avg60=0.00 avg300=0.00 total=8821

Memory some at 2% over 10 seconds: a handful of tasks are waiting for the kernel to reclaim pages or service page faults. Not an emergency, but worth watching. Memory full near zero: nobody’s completely blocked. The moment full starts climbing, your system is heading toward an OOM event, the kernel is desperately swapping and reclaiming while everything waits.

IO Pressure

cat /proc/pressure/io

some avg10=15.32 avg60=8.44 avg300=3.12 total=4821033
full avg10=4.11 avg60=2.31 avg300=0.98 total=892441

IO some at 15% is a yellow flag, tasks are waiting on disk reads/writes, probably fine on a busy DB node but worth correlating with disk throughput. IO full at 4% over 10 seconds means everything froze while waiting on IO. That’s the stat that explains why your SSH session felt like moving through treacle even though CPU load looked fine.

PSI Per-Cgroup: Pressure Inside Containers

Here’s where it gets useful for container workloads. If you’re running cgroups v2 (and if you followed the cgroups v2 deep dive published here on 2026-08-07, you know your systemd-enabled system is already using it), each cgroup exposes its own pressure files.

Find your container’s cgroup path:

# For a systemd service
systemctl show my-service.service -p ControlGroup
# Typical output: ControlGroup=/system.slice/my-service.service

cat /sys/fs/cgroup/system.slice/my-service.service/memory.pressure

For Docker containers (cgroup v2):

# Get the container's cgroup
docker inspect --format '{{.Id}}' mycontainer
# Then:
cat /sys/fs/cgroup/system.slice/docker-<container-id>.scope/memory.pressure

This is huge. You can now see exactly which container is experiencing memory pressure independently of what the whole system is doing. One container thrashing its memory won’t contaminate the system-level numbers, you can pinpoint it. No more docker stats showing 90% memory usage on a container and trying to guess whether it’s actually stalled.

How oomd and systemd-oomd Use This

The classic OOM killer is reactive, it fires after the kernel has already given up on memory reclaim, picks a process (often the wrong one), and terminates it. By then you’ve had a bad time.

systemd-oomd uses PSI to be proactive. It monitors memory.pressure per-cgroup and starts killing processes when pressure crosses configurable thresholds before the kernel OOM killer triggers. The default policy watches for memory full pressure staying elevated for a time window, then terminates the highest-memory cgroup in the affected slice.

Check if it’s running:

systemctl status systemd-oomd

The config lives at /etc/systemd/oomd.conf. Key options:

[OOM]
SwapUsedLimit=90%
DefaultMemoryPressureLimit=60%
DefaultMemoryPressureDurationSec=30s

DefaultMemoryPressureLimit=60% means: if memory some exceeds 60% for 30 consecutive seconds in a cgroup, consider it a candidate for termination. Adjust to taste, 60% is conservative. On a database node you might push this to 80%.

Kubernetes takes a similar approach with eviction signals. The kubelet can surface PSI metrics via the KubeletPSI feature gate, it landed as alpha in 1.33, went beta (on by default) in 1.34, and is stable and locked-on as of 1.36, letting the kubelet feed pressure into eviction decisions before a node goes fully OOM.

systemd-cgtop: Pressure at a Glance

Before you wire up Prometheus, you can use systemd-cgtop to see live pressure data per-cgroup:

systemd-cgtop -p --depth=3

The -p flag adds pressure columns. You’ll see CPU%, Memory%, IO Read/Write, and, with a kernel that exposes it, memory pressure per cgroup. It’s a solid first triage tool when something is wrong right now and you don’t have time to query Grafana.

Wiring PSI Into Prometheus

For longer-term visibility and alerting, Prometheus node_exporter includes a PSI collector since v1.1.0. If you’re already running the Prometheus + Grafana stack (covered in the prometheus-grafana-setup guide), you just need to make sure the PSI collector is enabled.

Verify it’s active:

curl -s http://localhost:9100/metrics | grep node_pressure

You should see metrics like:

node_pressure_cpu_waiting_seconds_total
node_pressure_memory_waiting_seconds_total
node_pressure_memory_stalled_seconds_total
node_pressure_io_waiting_seconds_total
node_pressure_io_stalled_seconds_total

These are counters (total stalled seconds). Prometheus rate() converts them to percentage approximations.

Alert Rules That Don’t Suck

The default “alert if memory usage > 90%” rule is famously terrible, it fires on healthy JVM heap behavior and stays silent during actual OOM spirals. PSI-based alerts are better because they measure impact, not just utilization.

groups:
  - name: psi_pressure
    rules:

      - alert: MemoryPressureElevated
        expr: |
          rate(node_pressure_memory_waiting_seconds_total[2m]) * 100 > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Memory pressure on {{ $labels.instance }}"
          description: >
            Memory 'some' pressure at {{ $value | printf \"%.1f\" }}% over last 2m.
            Tasks are waiting on memory reclaim.

      - alert: MemoryStalledCritical
        expr: |
          rate(node_pressure_memory_stalled_seconds_total[2m]) * 100 > 5
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Memory stall (full) on {{ $labels.instance }}"
          description: >
            Memory 'full' pressure at {{ $value | printf \"%.1f\" }}% — ALL tasks stalled.
            OOM event likely imminent. Check systemd-oomd logs.

      - alert: IOPressureHigh
        expr: |
          rate(node_pressure_io_waiting_seconds_total[5m]) * 100 > 20
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "IO pressure on {{ $labels.instance }}"
          description: >
            IO 'some' pressure at {{ $value | printf \"%.1f\" }}% for 10m.
            Disk throughput may be saturated.

      - alert: IOStalledCritical
        expr: |
          rate(node_pressure_io_stalled_seconds_total[2m]) * 100 > 10
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "IO full stall on {{ $labels.instance }}"
          description: >
            IO 'full' pressure at {{ $value | printf \"%.1f\" }}% — complete IO stall.
            Check disk health, IOPS limits, or NFS mount issues.

The thresholds above are sensible starting points, not gospel. A Postgres server under normal load might push IO some to 15% and that’s fine, tune your baselines by watching avg10 over a few days during normal operation, then set warnings at 2x that value.

Drop this in your Prometheus rules directory and reload:

sudo cp psi-alerts.yml /etc/prometheus/rules/
sudo systemctl reload prometheus
# Or if using the HTTP API:
curl -X POST http://localhost:9090/-/reload

Quick Reference: What the Numbers Mean

Metric	High Value Means	Action
CPU `some` > 20%	Scheduler contention	Check CPU count, nice levels, cgroup CPU limits
Memory `some` > 10%	Active page reclaim	Check RSS, swap usage, transparent hugepages
Memory `full` > 1%	Near-OOM condition	Check oomd logs, kill or limit offending cgroup
IO `some` > 25%	Disk throughput pressure	iostat, check IOPS limits, queue depth
IO `full` > 5%	Complete IO stall	Disk health, NFS issues, block device errors

The Takeaway

Load average told you something was happening. PSI tells you what was happening, to which resource, and for how long. It’s the difference between a smoke detector that goes off when you open the oven and one that tells you which room is on fire and how long it’s been burning.

Enable the node_exporter PSI collector. Add the alerts above. Check your containers’ memory.pressure files the next time something feels sluggish. And the next time someone says “load average is 4.2, seems fine,” you can pull up the memory full graph at 3%, watch their face change, and quietly go fix the actual problem.

Your 2 AM self will thank you for having real numbers instead of a five-minute average that tells you precisely nothing.

Linux PSI: The Pressure Metrics Load Average Wishes It Were

Load Average Has Been Lying to You

What PSI Actually Measures

Reading the Three Pressure Files

CPU Pressure

Memory Pressure

IO Pressure

PSI Per-Cgroup: Pressure Inside Containers

How oomd and systemd-oomd Use This

systemd-cgtop: Pressure at a Glance

Wiring PSI Into Prometheus

Alert Rules That Don’t Suck

Quick Reference: What the Numbers Mean

The Takeaway

Responses from around the web

Discussion

Related Posts

io_uring: Linux's Async I/O Future Is Already Here

Sysctl Tuning: The Linux Kernel Knobs That Actually Matter

Tmpfs vs Ramfs: When Your Disk Is Too Slow and Your RAM Is Just Sitting There

Ulimit, Cgroups, and the Art of Stopping Processes From Eating Your Server

Linux PSI: The Pressure Metrics Load Average Wishes It Were

Load Average Has Been Lying to You

What PSI Actually Measures

Reading the Three Pressure Files

CPU Pressure

Memory Pressure

IO Pressure

PSI Per-Cgroup: Pressure Inside Containers

How oomd and systemd-oomd Use This

systemd-cgtop: Pressure at a Glance

Wiring PSI Into Prometheus

Alert Rules That Don’t Suck

Quick Reference: What the Numbers Mean

The Takeaway

Related Reading

Responses from around the web

Discussion

Related Posts

io_uring: Linux's Async I/O Future Is Already Here

Sysctl Tuning: The Linux Kernel Knobs That Actually Matter

Tmpfs vs Ramfs: When Your Disk Is Too Slow and Your RAM Is Just Sitting There

Ulimit, Cgroups, and the Art of Stopping Processes From Eating Your Server