Skip to content
Go back

PostgreSQL on ZFS: Tuning, Snapshots, Pitfalls

By SumGuy 11 min read
PostgreSQL on ZFS: Tuning, Snapshots, Pitfalls

You Already Have ZFS. Now Put Postgres On It Properly.

If you’re running a home lab with ZFS — and at this point, who isn’t — you’ve probably already got PostgreSQL running on it. The question is whether it’s configured for ZFS or just plopped on top of it like a couch on a moving truck. Technically it works. But your neighbors (and your WAL logs) will have questions.

The good news: Postgres and ZFS are an unusually good match when tuned correctly. Atomic snapshots replace the pain of pg_basebackup. lz4 compression squeezes 2–3x on text-heavy databases. And ZFS checksums catch the silent block corruption that ext4 just quietly ignores until you’re restoring from a backup at 2 AM wondering why your users table has six thousand NULL rows.

The bad news: getting there requires about a dozen settings you won’t find in the Postgres docs, because ZFS doesn’t exist from Postgres’s perspective — it’s just a filesystem. So let’s fix that.


Why Bother in the First Place

Before you tune anything, here’s why the combination is worth it:

Atomic snapshots. ZFS snapshots are copy-on-write and instantaneous. A zfs snapshot at the filesystem level is consistent at the block level — no pg_start_backup dance, no long checkpoint stalls on busy databases. For home lab and small production workloads, this is transformational.

Compression. Postgres stores a lot of null bytes, fixed-width padding, and repetitive index structure. lz4 eats all of it. On a typical web app database with TEXT columns and JSON blobs, you’ll see 2x–3x reduction with near-zero CPU cost.

Checksums. ZFS checksums every block on every read. Postgres also has checksums (initdb --data-checksums), and you should enable both — they catch different failure modes at different layers. Silent disk corruption on consumer SATA drives is not a myth.

What you’re not getting: a speed miracle. Postgres on ZFS with default settings is slower than ext4. With proper tuning, you close most of that gap, and the operational benefits more than compensate for the remaining 10–20% overhead.


Dataset Layout: This Part Actually Matters

Don’t put everything in one dataset. Postgres has two distinctly different I/O patterns: random reads/writes to the data directory, and sequential append to WAL. ZFS lets you optimize each separately.

Terminal window
# Create datasets — adjust pool name (tank) as needed
zfs create -o recordsize=16K \
-o compression=lz4 \
-o atime=off \
-o xattr=sa \
-o dnodesize=auto \
tank/pgdata
zfs create -o recordsize=128K \
-o compression=lz4 \
-o atime=off \
-o logbias=throughput \
-o xattr=sa \
tank/pgwal

The logic:

Check your settings:

Terminal window
zfs get recordsize,compression,atime,logbias tank/pgdata tank/pgwal

Expected output:

NAME PROPERTY VALUE SOURCE
tank/pgdata recordsize 16K local
tank/pgdata compression lz4 local
tank/pgdata atime off local
tank/pgdata logbias latency default
tank/pgwal recordsize 128K local
tank/pgwal compression lz4 local
tank/pgwal atime off local
tank/pgwal logbias throughput local

Then configure PostgreSQL to use them:

Terminal window
# Assuming PostgreSQL 17 on Debian/Ubuntu
mkdir -p /tank/pgdata /tank/pgwal
chown postgres:postgres /tank/pgdata /tank/pgwal
# Initialize with separate WAL directory
su -c "initdb -D /tank/pgdata --waldir=/tank/pgwal --data-checksums" postgres

PostgreSQL Settings That ZFS Changes

Open postgresql.conf and find these settings. Most of them exist because traditional filesystems do things ZFS handles differently.

postgresql.conf
# ZFS gives you CoW — recycling and pre-zeroing WAL files is harmful
wal_init_zero = off
wal_recycle = off
# Full page writes: LEAVE THIS ON unless you've verified your
# ZFS recordsize == PG block size AND you understand the implications.
# The default (on) is safe. Only turn it off if you've done your homework.
full_page_writes = on
# Shared buffers: size appropriately for your RAM minus ZFS ARC
shared_buffers = 4GB # adjust to ~25% of RAM
# Checkpointing — ZFS handles fsync well, but don't hammer it
checkpoint_completion_target = 0.9
max_wal_size = 4GB
# Tell PG where WAL lives (matches --waldir above)
# This is set at initdb time, not in postgresql.conf directly

A word on full_page_writes: theoretically, if ZFS recordsize equals PG block size (both 8K), ZFS’s CoW makes torn writes impossible and you can turn this off. In practice, the recordsize tuning we did above (16K) means they don’t match, so keep full_page_writes = on. Turning it off incorrectly will corrupt your database in ways that are entertaining to read about and catastrophic to experience.

wal_init_zero = off and wal_recycle = off are unambiguously correct on ZFS. The defaults exist for filesystems where pre-zeroing and recycling reduce fragmentation. ZFS’s CoW makes both pointless and slightly harmful.


ARC Sizing: Don’t Let ZFS Eat Your RAM

This is where most people get hurt. ZFS ARC and PostgreSQL shared_buffers will both try to cache the same data. You end up with 8GB of database pages cached twice — once in shared_buffers, once in ARC — while your system OOMs at 2 AM.

Cap the ARC:

/etc/modprobe.d/zfs.conf
# For a 32GB machine with 4GB shared_buffers:
# Leave ~4GB for OS + connections, 4GB for PG, rest for ARC
# Formula: zfs_arc_max = (total_ram - shared_buffers - os_overhead) * 0.8
/etc/modprobe.d/zfs.conf
options zfs zfs_arc_max=17179869184

That’s 16GB in bytes (16 * 1024^3). Apply without rebooting:

Terminal window
echo 17179869184 > /sys/module/zfs/parameters/zfs_arc_max
# Verify
arc_summary | grep -E "ARC|Max"

The trade-off is real: ARC is great for read-heavy workloads where the working set doesn’t fit in shared_buffers. If your database is 80% reads on a small hot set, let ARC have more RAM. If it’s write-heavy or your working set exceeds shared_buffers anyway, keep ARC lean and let Postgres manage its own cache.


Snapshots: The Whole Point

Here’s where the investment pays off. Instead of fiddling with pg_basebackup and backup slots and WAL archiving complexity, you snapshot the filesystem.

Terminal window
# Manual snapshot — instant, space-efficient until data changes
zfs snapshot tank/pgdata@2026-07-06_0200
zfs snapshot tank/pgwal@2026-07-06_0200
# List snapshots
zfs list -t snapshot tank/pgdata
# Send to a backup pool (local or remote)
zfs send tank/pgdata@2026-07-06_0200 | zfs recv backup/pgdata
# Incremental send (much faster after the first)
zfs send -i tank/pgdata@2026-07-05_0200 tank/pgdata@2026-07-06_0200 \
| zfs recv backup/pgdata

For automated backups, here’s a script that’s actually useful:

pg-zfs-backup.sh
#!/usr/bin/env bash
set -euo pipefail
POOL="tank"
BACKUP_POOL="backup"
DATE=$(date +%Y-%m-%d_%H%M)
DATASETS=("pgdata" "pgwal")
# Optional: checkpoint postgres before snapshot for cleaner state
# Not required — ZFS snapshots are crash-consistent, PG recovers from WAL
# But a checkpoint reduces recovery time
psql -U postgres -c "CHECKPOINT;" 2>/dev/null || true
for ds in "${DATASETS[@]}"; do
SNAP="${POOL}/${ds}@${DATE}"
zfs snapshot "$SNAP"
echo "Snapshot: $SNAP"
# Get previous snapshot for incremental send
PREV=$(zfs list -t snapshot -H -o name "${POOL}/${ds}" \
| sort | tail -2 | head -1)
if [[ -n "$PREV" && "$PREV" != "$SNAP" ]]; then
zfs send -i "$PREV" "$SNAP" | zfs recv -F "${BACKUP_POOL}/${ds}"
echo "Incremental send complete: $PREV$SNAP"
else
zfs send "$SNAP" | zfs recv "${BACKUP_POOL}/${ds}"
echo "Full send complete: $SNAP"
fi
done
# Clean up snapshots older than 7 days
zfs list -t snapshot -H -o name "${POOL}/pgdata" \
| head -n -7 \
| xargs -r -n1 zfs destroy
/etc/cron.d/pg-zfs-backup
0 2 * * * root /usr/local/bin/pg-zfs-backup.sh >> /var/log/pg-zfs-backup.log 2>&1

If you want Restic on top for offsite, mount the snapshot and back it up without touching the live database:

Terminal window
# Mount snapshot read-only
zfs mount -o ro tank/pgdata@2026-07-06_0200
# Restic backup from snapshot mountpoint
restic -r s3:your-bucket/pgdata backup /.zfs/snapshot/2026-07-06_0200/

No hot file races. No partial writes. No drama.


Point-in-Time Recovery

Snapshots get you back to a known state. WAL gets you to an exact transaction. Together:

Terminal window
# Stop Postgres
systemctl stop postgresql
# Roll back to snapshot
zfs rollback tank/pgdata@2026-07-06_0200
zfs rollback tank/pgwal@2026-07-06_0200
# Configure recovery in postgresql.conf
# (PG 17 uses recovery_target_time in postgresql.conf, no recovery.conf)
postgresql.conf (recovery additions)
restore_command = 'cp /your/wal-archive/%f %p'
recovery_target_time = '2026-07-06 03:47:00'
recovery_target_action = 'promote'
Terminal window
# Create standby.signal to trigger recovery mode
touch /tank/pgdata/standby.signal
# Start Postgres — it will replay WAL to the target time
systemctl start postgresql
# Watch logs
journalctl -fu postgresql

This is exactly what database-level backups try to do, except here the “base backup” is a ZFS snapshot that took 0.3 seconds instead of 45 minutes.


Pitfalls That Will Waste Your Weekend

RAIDZ is not your friend here. RAIDZ has higher write amplification than mirrors because of the RAIDZ write hole — small random writes get padded to full stripe width. Postgres is full of small random writes. Use mirrors. RAIDZ is great for cold storage, NAS, archives. It’s measurably worse for database I/O.

Terminal window
# Good: mirrored vdevs
zpool create tank mirror sda sdb mirror sdc sdd
# Bad for Postgres:
# zpool create tank raidz sda sdb sdc sdd

SLOG (ZIL separate device) is probably not what you need. SLOG accelerates synchronous writes — specifically, fsync() calls that ZFS must commit before returning. Postgres does issue fsyncs, but on a ZFS pool with NVMe vdevs, the latency is already low. SLOG helps when: your pool vdevs are slow spinning rust, you have a power-loss-protected NVMe SLOG device, and your workload is fsync-heavy (OLTP with lots of small commits). For home lab use on all-flash, it adds complexity without measurable benefit.

Double-buffering is real and you must address it. If you don’t cap the ARC as described above, you will cache everything twice and your available memory for connections and query execution will be less than you think. pg_top showing 8GB used doesn’t mean 8GB of unique data is cached.

Snapshots are not free forever. Each snapshot holds a reference to blocks that existed at snapshot time. As data changes, those blocks can’t be freed. A busy database with 30-day snapshot retention can accumulate significant space. Monitor with:

Terminal window
zfs list -t snapshot -o name,used,refer tank/pgdata | sort -k2 -h

Don’t forget xattr=sa and dnodesize=auto. Extended attributes in ZFS default to storing in a hidden directory (slow for many small files). xattr=sa stores them in the inode. Postgres doesn’t heavily use xattrs, but it costs nothing and future-proofs the dataset.


Real Numbers

On a test setup: AMD Ryzen 7 5700G, 32GB RAM, 2x 1TB NVMe in mirror, Ubuntu 24.04, PostgreSQL 17.

Terminal window
zpool list
NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT
tank 1.82T 187G 1.63T - - 4% 10% 1.00x ONLINE -

pgbench at scale 100 (1.4GB database), 8 clients, 60 seconds:

ConfigTPSLatency (avg)
ext4, default PG settings4,1201.94ms
ZFS defaults (128K recordsize)2,8902.77ms
ZFS tuned (16K recordsize, settings above)3,6802.17ms

Tuned ZFS is about 11% slower than ext4 on this hardware. That gap buys you: instantaneous crash-consistent backups, 2.3x compression ratio on this database (real number from zfs get compressratio tank/pgdata), per-block checksums, and point-in-time recovery to within seconds.

Terminal window
zfs get compressratio,used,logicalused tank/pgdata
NAME PROPERTY VALUE SOURCE
tank/pgdata compressratio 2.31x -
tank/pgdata used 81.2G -
tank/pgdata logicalused 187G -

187GB of logical data stored in 81GB. On lz4. With near-zero CPU cost.


Should You Bother?

Yes, if:

Maybe not, if:

The 10–20% overhead is real and measurable. But “real and measurable” in home lab terms means the difference between 4,100 TPS and 3,700 TPS on a workload that your single-digit concurrent users will never saturate. Meanwhile, your next backup runs in 0.3 seconds and can be sent incrementally to a backup pool over the weekend.

Run ZFS. Tune it properly. Sleep better at 2 AM.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
Boundary vs Teleport

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts