Skip to content
Go back

Backup Workflow Patterns That Work

By SumGuy 10 min read
Backup Workflow Patterns That Work

Your Backup Strategy Is Probably a Single Cron Job. Fix It.

Here’s the thing: most backup setups I see in the wild are one cron job away from a disaster. You’ve got a script that runs nightly, shoves data into a tarball or Restic repo, and then… silence. No mirror. No off-site copy. No retention policy. Just hope.

Then one morning the disk fails, and you learn that your “backups” are actually just another copy of the thing that’s already broken.

This isn’t about picking between Restic and Kopia—that’s a tool question (and yes, that article exists). This is about patterns. The architecture. The thinking that survives when things actually go wrong.


The Three-Tier Mental Model

Good backup workflows run on three independent tiers, and each tier answers a different disaster:

Tier 1: Local Snapshot — Fast recovery from mistakes and small failures

Tier 2: Replicated Copy — Survival of site-level failures

Tier 3: Off-Site Archive — The insurance policy for actual disasters

This is the spirit of 3-2-1: three copies, two different media/locations, one off-site. But let’s stop pretending that rule is a magic incantation and actually understand why it matters.


Retention Windows: Deeper Than You Think

Retention is where backups go to die, because it’s boring and nobody tests it.

The naive approach: keep daily backups for 30 days, weeklies for a year, done. Fine. That works until:

GFS rotation (Grandfather-Father-Son) handles this better:

Daily backups: keep 7 days (covers mistakes, accidental deletes)
Weekly backups: keep 4-8 weeks (covers longer-term creep, corruption discovery)
Monthly backups: keep 12+ months (handles "wait, when did this break?" questions)

Each promotion is deliberate—a weekly is a snapshot of a daily, a monthly is a snapshot of a weekly. You’re not generating new data; you’re just tagging one for longer storage.

Why monthlies matter:

For most home labs, this looks like:

Snapshots (hourly): 7 backups (1 day)
Dailies: 7 backups (1 week)
Weeklies: 8 backups (2 months)
Monthlies: 12 backups (1 year)

Do you need this if you’re just self-hosting a Jellyfin server? Probably not. But if you’re running a database with actual users, or storing years of photos, or config that took months to tune—yes, you do.


The Real Problem: Deletion Safety

Here’s what breaks retention policies in the real world: you can’t be paranoid enough about what you delete.

Automated cleanup is necessary (you’ll run out of disk), but automation that deletes the wrong thing is worse than no automation at all. So retention needs two steps:

Step 1: Mark for Deletion (Dry-Run Phase) Before anything gets deleted, the script tags the backup as “candidate for removal” and logs it loudly.

Terminal window
# Mark old backups for removal (but don't touch them yet)
restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 12 --dry-run

Check the output. Verify it’s marking the right backups. Maybe sleep on it. If you use Prometheus or log aggregation, alert on this dry-run so you see it.

Step 2: Actual Deletion (Only After Confirmation) Once you’re confident, run the same command without --dry-run. But don’t do it immediately—separate these by at least a day. Give yourself time to notice if the dry-run looked wrong.

For immutable storage (cloud object-lock, Borg), deletion is already paranoid by design—you can’t delete until the retention expires. For Restic or filesystem backups, you have to add that paranoia yourself.


Push vs. Pull: The Throughput Question

Two architectures for getting backups to Tier 2 (the replica):

Push — Primary server sends backups to the replica

Pull — Replica fetches backups from the primary

For most setups, push is fine. But if you’re paranoid about ransomware or have large data, pull makes the replica less of a liability.


Immutability: Ransomware Insurance

Standard backups can be deleted or encrypted by ransomware. Immutable backups can’t—they just sit there, untouchable.

Object-lock (S3/R2 style):

# Immutable for 30 days—not even the admin can delete
retention:
mode: GOVERNANCE
days: 30

Borg’s append-only mode:

Terminal window
borg serve --append-only /path/to/repo

Restic + locked filesystems: Mount the backup filesystem read-only except during the backup window, then flip it back to read-only after.

The cost: you can’t free up space until the retention expires. Plan for it.


Cataloging and Search: The Boring Essential

You can restore individual files faster if you know what’s in each backup without actually restoring.

Restic has built-in search; Borg has borg list. Use them.

Terminal window
restic find filename.pdf
restic ls snapshot-id /path/to/dir

Set up a periodic task that indexes your backups (Restic’s metadata, Borg’s list output) into a searchable format—a JSON file, a simple database, even a text file. When disaster hits and you’re panicking at 2 AM, you want to know exactly which snapshot has the version you need without guessing.


Restore Drills: The Test That Matters

Backups you’ve never restored are not backups. They’re hope with good file organization.

Run a drill every quarter:

  1. Pick a random snapshot (not the newest)
  2. Restore it to a different machine or VM
  3. Verify the data is intact
  4. Check timestamps, permissions, symlinks
  5. Actually run the restored services if you can (database query, app startup)

Document the process. Time it. If it takes 4 hours to restore your Postgres database, you now know your RTO is 4 hours. Plan accordingly.

The patterns that matter:


The Underrated Question: Recovery Time

“How long until we’re back online?” isn’t a backup question; it’s a workflow question.

If your off-site backup is in cloud cold storage and takes 6 hours to retrieve, your RTO (Recovery Time Objective) is 6+ hours. That’s not bad—it’s just a fact you need to know.

If you expect to recover in 30 minutes but your Tier 3 archive is on tape in a data center across the country, you’ve got a mismatch.

Map it out:

Tier 1 (Local): Restore time: 5 min (same server, filesystem snapshot)
Tier 2 (Replica): Restore time: 15 min (SSH to replica, query/restore)
Tier 3 (Off-site): Restore time: 2 hours (download from cloud, decompress, restore)

Know these numbers. Build your SLOs around them. If you can’t tolerate 2 hours of downtime, you need a faster Tier 3 recovery—maybe a secondary replica in another region, or more frequent syncs.


Anti-Patterns That Bite

Single-tool reliance: You backed up with Tool X, but Tool X becomes unmaintained or incompatible with your new system. Can you restore it without that tool? (Hint: you should be able to, or at least understand the format well enough to extract manually.)

Encrypted backup, missing key backup: You encrypted the backup with a passphrase. Great. Did you back up the passphrase itself? Ideally on a different system, in a password manager that’s also backed up elsewhere, and written down and locked in a safe. Yes, this sounds paranoid. Yes, you need it.

Untested 5-year-old archives: You have monthlies going back five years. But have you tried to restore one? Format changes, software updates, and drift happen. That snapshot might be corrupt, or in a format your new tools can’t read.

No retention policy: Just “keep everything.” Until your storage runs out and you panic-delete the oldest backups. Have a policy. Write it down. Automate it.


A Sample Workflow

Here’s a pattern that works for a single-server setup with a replica elsewhere:

Local snapshots (on the primary):

#!/bin/bash
# Run hourly, keep 24 hours
restic backup /data --tag=local
restic forget --keep-hourly 24 --dry-run | grep "remove"
restic forget --keep-hourly 24

Replicate to Tier 2 (pull model):

Terminal window
# Run every 6 hours from the replica
restic -r /mnt/remote-backup sync
restic -r /mnt/remote-backup check

Monthly to off-site (push model):

Terminal window
# Run on the 1st of each month
restic -r s3://my-cloud-backup backup /data \
--tag=monthly \
--tag="$(date +%Y-%m)"

Retention (runs at the replica, targets all tiers):

# Cron: daily, after backups complete
daily:
- mark for deletion
- dry-run to logs
- sleep 24 hours
weekly:
- confirm dry-run looked good
- actually delete
- verify disk space decreased

The Patterns Worth Borrowing

  1. Three tiers serve three questions: Local for speed, replicated for resilience, off-site for catastrophe.
  2. GFS rotation is worth the small amount of complexity—it catches slow corruption and audit needs.
  3. Deletion must be paranoid: dry-run first, sleep on it, confirm, then delete.
  4. Immutability is cheap insurance against ransomware.
  5. Restore drills are the test that matters—not the backup, the restore.
  6. Recovery time is a business decision, not a technical one. Know your RTOs.
  7. Document as you go, because 2 AM you—the one actually panicking—is not the same person as right-now you.

A backup workflow that survives is one you’ve tested, one you’ve measured, and one you understand deeply enough to troubleshoot at 3 AM when everything is on fire.

That takes more than a cron job. It takes patterns.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
Boundary vs Teleport

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts