Your Backup Strategy Is Probably a Single Cron Job. Fix It.
Here’s the thing: most backup setups I see in the wild are one cron job away from a disaster. You’ve got a script that runs nightly, shoves data into a tarball or Restic repo, and then… silence. No mirror. No off-site copy. No retention policy. Just hope.
Then one morning the disk fails, and you learn that your “backups” are actually just another copy of the thing that’s already broken.
This isn’t about picking between Restic and Kopia—that’s a tool question (and yes, that article exists). This is about patterns. The architecture. The thinking that survives when things actually go wrong.
The Three-Tier Mental Model
Good backup workflows run on three independent tiers, and each tier answers a different disaster:
Tier 1: Local Snapshot — Fast recovery from mistakes and small failures
- Same location as your data (same server, same network, maybe same disk if you’re desperate)
- Keeps you from panicking when you delete a file by accident
- Can restore in minutes
- Dies if: the whole server melts, ransomware encrypts both the source and the snapshot
Tier 2: Replicated Copy — Survival of site-level failures
- Different physical location from your primary (different server, different VM host, different data center)
- Still nearby enough to access quickly
- Can shift services to the replica while you fix the primary
- Dies if: both locations go down at once, ransomware attacks both (hence: immutability)
Tier 3: Off-Site Archive — The insurance policy for actual disasters
- Geographically distant (cloud, remote office, that one friend’s server in another state)
- Can be slower—you’re not restoring from it unless something really broke
- Can be cold/archived tier (cheaper, slower to access)
- Dies if: global catastrophe takes everyone out (in which case you have bigger problems)
This is the spirit of 3-2-1: three copies, two different media/locations, one off-site. But let’s stop pretending that rule is a magic incantation and actually understand why it matters.
Retention Windows: Deeper Than You Think
Retention is where backups go to die, because it’s boring and nobody tests it.
The naive approach: keep daily backups for 30 days, weeklies for a year, done. Fine. That works until:
- You discover data corruption on day 45 (all 30 days of dailies are worthless because the corruption crept in slowly)
- You need to audit what changed three months ago and the monthly from that week is already gone
- Ransomware locked your files on the 15th but nobody noticed until the 28th—and you’ve already rotated away the unencrypted copies
GFS rotation (Grandfather-Father-Son) handles this better:
Daily backups: keep 7 days (covers mistakes, accidental deletes)Weekly backups: keep 4-8 weeks (covers longer-term creep, corruption discovery)Monthly backups: keep 12+ months (handles "wait, when did this break?" questions)Each promotion is deliberate—a weekly is a snapshot of a daily, a monthly is a snapshot of a weekly. You’re not generating new data; you’re just tagging one for longer storage.
Why monthlies matter:
- Regulatory/audit trails (some industries need 7 years of data snapshots)
- Slow corruption or logic bugs that take weeks to surface
- License key extraction (“my config file worked fine last March”)
- The “what does my database schema look like if I roll back 6 months?” question
For most home labs, this looks like:
Snapshots (hourly): 7 backups (1 day)Dailies: 7 backups (1 week)Weeklies: 8 backups (2 months)Monthlies: 12 backups (1 year)Do you need this if you’re just self-hosting a Jellyfin server? Probably not. But if you’re running a database with actual users, or storing years of photos, or config that took months to tune—yes, you do.
The Real Problem: Deletion Safety
Here’s what breaks retention policies in the real world: you can’t be paranoid enough about what you delete.
Automated cleanup is necessary (you’ll run out of disk), but automation that deletes the wrong thing is worse than no automation at all. So retention needs two steps:
Step 1: Mark for Deletion (Dry-Run Phase) Before anything gets deleted, the script tags the backup as “candidate for removal” and logs it loudly.
# Mark old backups for removal (but don't touch them yet)restic forget --keep-daily 7 --keep-weekly 4 --keep-monthly 12 --dry-runCheck the output. Verify it’s marking the right backups. Maybe sleep on it. If you use Prometheus or log aggregation, alert on this dry-run so you see it.
Step 2: Actual Deletion (Only After Confirmation)
Once you’re confident, run the same command without --dry-run. But don’t do it immediately—separate these by at least a day. Give yourself time to notice if the dry-run looked wrong.
For immutable storage (cloud object-lock, Borg), deletion is already paranoid by design—you can’t delete until the retention expires. For Restic or filesystem backups, you have to add that paranoia yourself.
Push vs. Pull: The Throughput Question
Two architectures for getting backups to Tier 2 (the replica):
Push — Primary server sends backups to the replica
- Simplest to reason about (source pushes, replica receives)
- Uses primary server resources (CPU, bandwidth)
- If the primary is compromised, the attacker can potentially corrupt the replica too
- Better for: small datasets, low-frequency backups
Pull — Replica fetches backups from the primary
- Isolates the replica (doesn’t trust the primary)
- Can be harder to orchestrate (replica needs to know when to pull, or runs on a schedule)
- Better for: large data, ransomware risk, network isolation
- Slower failure detection (“pull” schedules don’t react instantly to primary outages)
For most setups, push is fine. But if you’re paranoid about ransomware or have large data, pull makes the replica less of a liability.
Immutability: Ransomware Insurance
Standard backups can be deleted or encrypted by ransomware. Immutable backups can’t—they just sit there, untouchable.
Object-lock (S3/R2 style):
# Immutable for 30 days—not even the admin can deleteretention: mode: GOVERNANCE days: 30Borg’s append-only mode:
borg serve --append-only /path/to/repoRestic + locked filesystems: Mount the backup filesystem read-only except during the backup window, then flip it back to read-only after.
The cost: you can’t free up space until the retention expires. Plan for it.
Cataloging and Search: The Boring Essential
You can restore individual files faster if you know what’s in each backup without actually restoring.
Restic has built-in search; Borg has borg list. Use them.
restic find filename.pdfrestic ls snapshot-id /path/to/dirSet up a periodic task that indexes your backups (Restic’s metadata, Borg’s list output) into a searchable format—a JSON file, a simple database, even a text file. When disaster hits and you’re panicking at 2 AM, you want to know exactly which snapshot has the version you need without guessing.
Restore Drills: The Test That Matters
Backups you’ve never restored are not backups. They’re hope with good file organization.
Run a drill every quarter:
- Pick a random snapshot (not the newest)
- Restore it to a different machine or VM
- Verify the data is intact
- Check timestamps, permissions, symlinks
- Actually run the restored services if you can (database query, app startup)
Document the process. Time it. If it takes 4 hours to restore your Postgres database, you now know your RTO is 4 hours. Plan accordingly.
The patterns that matter:
- Restore from each tier (local, replicated, off-site) at different intervals. What if your local snapshot is corrupted? The replicated copy should save you.
- Rotate the backup you restore—don’t always restore the newest. Old snapshots might have lingering corruption that newer ones inherited.
- Catalog the results—write down what you found, what broke, what surprised you. That’s your actual recovery playbook.
The Underrated Question: Recovery Time
“How long until we’re back online?” isn’t a backup question; it’s a workflow question.
If your off-site backup is in cloud cold storage and takes 6 hours to retrieve, your RTO (Recovery Time Objective) is 6+ hours. That’s not bad—it’s just a fact you need to know.
If you expect to recover in 30 minutes but your Tier 3 archive is on tape in a data center across the country, you’ve got a mismatch.
Map it out:
Tier 1 (Local): Restore time: 5 min (same server, filesystem snapshot)Tier 2 (Replica): Restore time: 15 min (SSH to replica, query/restore)Tier 3 (Off-site): Restore time: 2 hours (download from cloud, decompress, restore)Know these numbers. Build your SLOs around them. If you can’t tolerate 2 hours of downtime, you need a faster Tier 3 recovery—maybe a secondary replica in another region, or more frequent syncs.
Anti-Patterns That Bite
Single-tool reliance: You backed up with Tool X, but Tool X becomes unmaintained or incompatible with your new system. Can you restore it without that tool? (Hint: you should be able to, or at least understand the format well enough to extract manually.)
Encrypted backup, missing key backup: You encrypted the backup with a passphrase. Great. Did you back up the passphrase itself? Ideally on a different system, in a password manager that’s also backed up elsewhere, and written down and locked in a safe. Yes, this sounds paranoid. Yes, you need it.
Untested 5-year-old archives: You have monthlies going back five years. But have you tried to restore one? Format changes, software updates, and drift happen. That snapshot might be corrupt, or in a format your new tools can’t read.
No retention policy: Just “keep everything.” Until your storage runs out and you panic-delete the oldest backups. Have a policy. Write it down. Automate it.
A Sample Workflow
Here’s a pattern that works for a single-server setup with a replica elsewhere:
Local snapshots (on the primary):
#!/bin/bash# Run hourly, keep 24 hoursrestic backup /data --tag=localrestic forget --keep-hourly 24 --dry-run | grep "remove"restic forget --keep-hourly 24Replicate to Tier 2 (pull model):
# Run every 6 hours from the replicarestic -r /mnt/remote-backup syncrestic -r /mnt/remote-backup checkMonthly to off-site (push model):
# Run on the 1st of each monthrestic -r s3://my-cloud-backup backup /data \ --tag=monthly \ --tag="$(date +%Y-%m)"Retention (runs at the replica, targets all tiers):
# Cron: daily, after backups completedaily: - mark for deletion - dry-run to logs - sleep 24 hours
weekly: - confirm dry-run looked good - actually delete - verify disk space decreasedThe Patterns Worth Borrowing
- Three tiers serve three questions: Local for speed, replicated for resilience, off-site for catastrophe.
- GFS rotation is worth the small amount of complexity—it catches slow corruption and audit needs.
- Deletion must be paranoid: dry-run first, sleep on it, confirm, then delete.
- Immutability is cheap insurance against ransomware.
- Restore drills are the test that matters—not the backup, the restore.
- Recovery time is a business decision, not a technical one. Know your RTOs.
- Document as you go, because 2 AM you—the one actually panicking—is not the same person as right-now you.
A backup workflow that survives is one you’ve tested, one you’ve measured, and one you understand deeply enough to troubleshoot at 3 AM when everything is on fire.
That takes more than a cron job. It takes patterns.