io_uring: Linux's Async I/O Future Is Already Here

Your Syscalls Are the Problem

Every time your app does a read() or write(), you pay a tax. User space asks the kernel to do something, the kernel does it, control returns. That round-trip costs context switches, copying data between kernel and user buffers, and if you’re doing it thousands of times per second — those taxes stack up fast.

The classic fix was epoll: watch a pile of file descriptors, get notified when they’re ready, then do your I/O. Better than select() for sure. But you’re still issuing syscalls one at a time, still paying per operation. POSIX AIO tried to solve this properly and mostly succeeded in being confusing and unreliable. For years, high-performance Linux I/O was just “use epoll and cry a little.”

Then in Linux 5.1 (2019), Jens Axboe landed io_uring and quietly changed everything.

What io_uring Actually Is

Two shared-memory ring buffers between your process and the kernel:

SQ (Submission Queue): You write I/O requests here — reads, writes, accept(), sendmsg(), whatever.
CQ (Completion Queue): The kernel writes results here when operations finish.

That’s it. You batch up work in the SQ, call io_uring_enter() once (or not at all if you enable kernel-side polling), and harvest completions from the CQ. The kernel never has to copy your requests — it reads directly from shared memory. Your app never has to block waiting for individual ops to finish.

The syscall overhead collapse is real. Instead of N syscalls for N operations, you might pay one io_uring_enter() for hundreds of ops. At scale, this is the difference between your event loop burning 15% CPU on syscall overhead versus basically nothing.

Traditional epoll flow:
  app → epoll_wait() → fd ready → read() → process → repeat
  [syscall][syscall][syscall] per operation

io_uring flow:
  app → fill SQE batch → io_uring_enter() → kernel drains queue → harvest CQEs
  [1 syscall] per batch of N operations

The API: liburing Makes It Survivable

The raw io_uring syscall interface is, charitably, not ergonomic. liburing is the thin wrapper Jens wrote to make it usable. It ships in most distros as liburing-dev / liburing-devel.

Here’s a minimal but working example — read a file asynchronously and print its contents:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <liburing.h>

#define BUFSIZE 4096
#define QUEUE_DEPTH 1

int main(int argc, char *argv[]) {
    if (argc < 2) { fprintf(stderr, "usage: %s <file>\n", argv[0]); return 1; }

    struct io_uring ring;
    struct io_uring_sqe *sqe;
    struct io_uring_cqe *cqe;
    char buf[BUFSIZE];
    struct iovec iov = { .iov_base = buf, .iov_len = BUFSIZE };

    int fd = open(argv[1], O_RDONLY);
    if (fd < 0) { perror("open"); return 1; }

    // init ring with depth 1
    if (io_uring_queue_init(QUEUE_DEPTH, &ring, 0) < 0) {
        perror("io_uring_queue_init"); return 1;
    }

    // grab a submission queue entry
    sqe = io_uring_get_sqe(&ring);
    io_uring_prep_readv(sqe, fd, &iov, 1, 0);
    io_uring_sqe_set_data(sqe, &iov); // attach user data for CQE lookup

    // submit and wait for 1 completion
    io_uring_submit(&ring);
    io_uring_wait_cqe(&ring, &cqe);

    if (cqe->res < 0) {
        fprintf(stderr, "async read failed: %s\n", strerror(-cqe->res));
    } else {
        buf[cqe->res] = '\0';
        printf("%s", buf);
    }

    io_uring_cqe_seen(&ring, cqe); // mark completion consumed
    io_uring_queue_exit(&ring);
    close(fd);
    return 0;
}

Compile with:

gcc ring_read.c -o ring_read -luring
./ring_read /etc/hostname

That’s the skeleton. Real usage batches dozens of SQEs before calling submit, and harvests CQEs in a loop. The io_uring_sqe_set_data() / io_uring_cqe_get_data() pair is how you correlate which completion belongs to which request — essential when you’re juggling hundreds in flight.

The Kernel-Side Polling Mode

If you’re doing extremely high throughput I/O (think NVMe at millions of IOPS), there’s IORING_SETUP_SQPOLL. A kernel thread wakes up and polls your SQ continuously — zero syscalls from user space after setup. Your process just writes SQEs and reads CQEs from shared memory.

struct io_uring_params params = {0};
params.flags |= IORING_SETUP_SQPOLL;
params.sq_thread_idle = 2000; // ms before thread sleeps

io_uring_queue_init_params(QUEUE_DEPTH, &ring, &params);

This is “forklift to move a couch” territory — great if you’re writing a storage engine, overkill if you’re handling 50 HTTP requests/sec on a home server. The kernel thread burns a whole CPU core when active.

Python: You Can, But Should You?

Python’s asyncio doesn’t use io_uring by default — it still goes through epoll under the hood via the selector event loop. There’s work in progress to change that, but as of CPython 3.13/3.14 you’re still on epoll by default unless you reach for something else.

For pure Python, your two options are:

Option 1: liburing via ctypes — works, cursed, not recommended for production:

import ctypes
import os

# This is a sketch — real ctypes bindings need the full struct layout
# Use this to understand the shape, not as copy-paste production code
liburing = ctypes.CDLL("liburing.so.2")

# io_uring_queue_init(depth, ring_ptr, flags)
# io_uring_get_sqe(ring_ptr)
# io_uring_prep_read(sqe, fd, buf, len, offset)
# io_uring_submit(ring_ptr)
# io_uring_wait_cqe(ring_ptr, cqe_ptr)
# ... you get the idea

print("This path leads to madness. Use a proper binding.")

Option 2: liburing Python bindings or Rust extensions — python-liburing exists on PyPI and gives you actual ergonomic access:

pip install python-liburing

from python_liburing import io_uring, io_uring_queue_init, io_uring_queue_exit
from python_liburing import io_uring_get_sqe, io_uring_prep_read
from python_liburing import io_uring_submit, io_uring_wait_cqe, io_uring_cqe_seen
import os

ring = io_uring()
io_uring_queue_init(8, ring, 0)

fd = os.open("/etc/hostname", os.O_RDONLY)
buf = bytearray(256)

sqe = io_uring_get_sqe(ring)
io_uring_prep_read(sqe, fd, buf, len(buf), 0)
io_uring_submit(ring)

cqe = io_uring_wait_cqe(ring)
print(bytes(buf[:cqe.res]).decode().strip())
io_uring_cqe_seen(ring, cqe)

io_uring_queue_exit(ring)
os.close(fd)

Honestly though — if you need io_uring performance in Python, you’re probably solving the wrong problem. The GIL, the object overhead, the interpreter itself will eat your gains before io_uring gets a chance to shine. Write the hot path in C or Rust, call it from Python. That’s the real answer.

The Security Saga (Read This Before Enabling It)

Here’s the part everyone glosses over: io_uring has had a rough CVE history. Between 2022 and 2024, it was one of the most exploited kernel subsystems for local privilege escalation. The attack surface is genuinely large — you’re giving user processes deep, shared-memory-level access to kernel I/O machinery.

The distro response:

# Google disabled io_uring for unprivileged users in ChromeOS (2023)
# Meta restricted it in their container environments
# Ubuntu 24.04+ defaults:
 kernel.io_uring_disabled = 0  (fully open)
 kernel.io_uring_disabled = 1  (disabled for unprivileged users)

# RHEL 9 / CentOS Stream 9:
 kernel.io_uring_disabled = 2  (disabled entirely unless root)

Check your current setting:

sysctl kernel.io_uring_disabled
# 0 = fully enabled
# 1 = disabled for unprivileged users (CAP_SYS_ADMIN required)
# 2 = completely disabled

If you’re running containers and haven’t thought about this, go check now. An unprivileged container process with io_uring access is a known attack vector. Most CVEs in this space (CVE-2022-29582, CVE-2023-2598, a handful of others) used io_uring as the mechanism for UAF and heap spray attacks.

For servers doing untrusted user workloads: kernel.io_uring_disabled=1 or 2 is the right call. You can still grant it to privileged services via capabilities. For a single-user dev machine or a storage server you control end-to-end: leaving it enabled is fine, just know what you’re doing.

Setting it:

# Temporary
sudo sysctl -w kernel.io_uring_disabled=1

# Permanent
echo "kernel.io_uring_disabled=1" | sudo tee /etc/sysctl.d/99-io-uring.conf
sudo sysctl -p /etc/sysctl.d/99-io-uring.conf

Benchmarks: The Numbers That Matter

Real-world numbers from Fio (flexible I/O tester) comparing backends on NVMe:

# Sequential read, 4K blocks, queue depth 32, 60s
fio --name=seq_read --ioengine=io_uring --iodepth=32 --rw=randread \
    --bs=4k --direct=1 --size=4G --numjobs=4 --runtime=60 --filename=/dev/nvme0n1

# Same test with libaio
fio --name=seq_read --ioengine=libaio --iodepth=32 --rw=randread \
    --bs=4k --direct=1 --size=4G --numjobs=4 --runtime=60 --filename=/dev/nvme0n1

Typical results on modern NVMe (your mileage varies wildly by hardware):

Backend      IOPS      Latency (p99)   CPU %
libaio       650K      180µs           22%
epoll        580K      210µs           28%
io_uring     820K      140µs           14%
io_uring     1.1M      90µs            9%   (SQPOLL mode, dedicated core)

The CPU savings are often more impressive than the raw IOPS. If you’re running a storage-heavy workload and your app servers are burning CPU on I/O wait and context switches, io_uring can meaningfully drop that number.

For network I/O (sockets, accept, sendmsg), the story is similar — Nginx experimental io_uring patches showed 15-20% throughput gains under high connection rates. Not “throws out all your hardware” dramatic, but real.

When to Skip It Entirely

io_uring is not a universal upgrade. Cases where it’s the wrong tool:

Low-volume I/O — If you’re reading configs at startup or writing a handful of log lines, read()/write() is fine. The setup overhead and code complexity of io_uring buys you nothing at low call rates.

Interpreted languages without native bindings — Python, Ruby, Node.js (mostly). The runtime overhead drowns the gains. Node.js uses libuv which uses io_uring on Linux 5.10+ automatically — you’re already getting it for free without thinking about it.

Containers with locked-down seccomp profiles — io_uring_setup, io_uring_enter, io_uring_register all need to be in your allowlist. Many default Docker/k8s seccomp profiles block them. You’ll get EPERM and spend an hour wondering why your high-performance code is slower than printf.

# Check if io_uring syscalls are blocked in your container
strace -e trace=io_uring_setup ./your_app 2>&1 | grep -i "operation not permitted"

Kernel < 5.10 — Features were backfilled heavily between 5.1 and 5.10. Anything older than 5.10 has incomplete support and known bugs. If you’re still on a 5.4 LTS kernel, some operations (splice, tee, multi-shot accept) just aren’t there.

Security-sensitive multi-tenant environments — As covered above, the privilege escalation history is real. If you don’t control who runs code on the box, think hard before enabling it.

The Real Answer for Most Use Cases

If you’re writing a new high-performance server in C or Rust and targeting Linux 5.10+, io_uring should be your default I/O backend. The liburing API is mature, the performance gains are consistent, and the kernel support is stable.

If you’re writing Go, the runtime doesn’t use io_uring yet (it’s on the roadmap, tracked in the Go issue tracker). You’re on epoll. That’s fine — Go’s goroutine scheduler handles I/O concurrency elegantly without you thinking about it.

If you’re writing Rust, tokio (the dominant async runtime) uses io_uring on Linux via the tokio-uring crate, separate from the default epoll-based tokio:

# In your Cargo.toml
# tokio-uring = "0.4"
# Note: tokio-uring is separate from tokio and has a different API surface

The ecosystem is moving. Database engines (RocksDB, ScyllaDB), storage systems (SPDK), and web servers are all adopting io_uring as the default Linux I/O path. It’s not experimental anymore — it’s the direction the kernel is heading.

Your 2 AM self debugging an EPERM in a container will appreciate having read this section first.

TL;DR

io_uring batches I/O operations into shared ring buffers, slashing syscall overhead and CPU usage on high-throughput workloads. It beats epoll and libaio on raw IOPS and especially on CPU efficiency. The liburing C API is the sane way in. Python bindings exist but the GIL will eat your gains anyway.

The security track record is genuinely rough — check kernel.io_uring_disabled on your servers and lock it down for unprivileged users if you’re running untrusted workloads. Skip it entirely for low-volume I/O, old kernels, or locked-down container environments.

For everything else: it’s fast, it’s stable, and it’s already in production at Google, Meta, and basically every storage company that cares about IOPS. The future arrived in kernel 5.1. You just have to turn it on.

io_uring: Linux's Async I/O Future Is Already Here

Your Syscalls Are the Problem

What io_uring Actually Is

The API: liburing Makes It Survivable

The Kernel-Side Polling Mode

Python: You Can, But Should You?

The Security Saga (Read This Before Enabling It)

Benchmarks: The Numbers That Matter

When to Skip It Entirely

The Real Answer for Most Use Cases

TL;DR

Responses from around the web

Discussion

Related Posts

Sysctl Tuning: The Linux Kernel Knobs That Actually Matter

Tmpfs vs Ramfs: When Your Disk Is Too Slow and Your RAM Is Just Sitting There

Ulimit, Cgroups, and the Art of Stopping Processes From Eating Your Server

The Linux OOM Killer: Why It's Killing Your App

io_uring: Linux's Async I/O Future Is Already Here

Your Syscalls Are the Problem

What io_uring Actually Is

The API: liburing Makes It Survivable

The Kernel-Side Polling Mode

Python: You Can, But Should You?

The Security Saga (Read This Before Enabling It)

Benchmarks: The Numbers That Matter

When to Skip It Entirely

The Real Answer for Most Use Cases

TL;DR

Related Reading

Responses from around the web

Discussion

Related Posts

Sysctl Tuning: The Linux Kernel Knobs That Actually Matter

Tmpfs vs Ramfs: When Your Disk Is Too Slow and Your RAM Is Just Sitting There

Ulimit, Cgroups, and the Art of Stopping Processes From Eating Your Server

The Linux OOM Killer: Why It's Killing Your App