Your Backup Cron Failed Silently. You’ll Find Out in Six Months.
Here’s the thing: exit codes don’t email you. Your monitoring stack doesn’t know if a job didn’t run. It only sees what you explicitly tell it about. And that’s where 99% of backup failures hide.
You’ve got restic or borgmatic set up, right? Running every night at 2 AM via cron. It fails three times in a row due to network flakiness, but there’s no dashboard screaming about it. You keep sleeping. By the time you notice (usually when you need to restore), six months of “backups” are actually just corruption logs.
This is the dead-man-switch problem. Not “something went wrong” — but “something didn’t happen at all.”
Healthchecks.io solves this. The self-hosted version runs on your own hardware, integrates with everything from restic to systemd timers, and sends you an alert the second a periodic job fails to check in.
What Is a Dead-Man-Switch?
Picture this: you’re piloting a plane. There’s a button in your hand. While you’re conscious and holding it, all is well. The moment you fall asleep (or worse), your grip loosens. The button releases. Alarm sounds.
That’s a dead-man-switch. In monitoring terms: I expect you to ping me every 24 hours. If you don’t, something’s wrong.
It’s the inverse of traditional alerting:
- Alert on presence (traditional): “Fire a dashboard metric when a job succeeds” — but if the job doesn’t run, there’s no metric. Silence.
- Alert on absence (dead-man-switch): “If I don’t hear a ping in the next 26 hours, wake me up” — failure, silence, or absence all trigger.
Cron jobs are the poster child for this problem because they have no stdout, no metrics, no Prometheus scrape endpoint. They just… run (or don’t). Your monitoring won’t know the difference.
How Healthchecks.io Works
You create a “check” — essentially a URL with a grace period and a schedule. Your cron job (or systemd timer, or Kubernetes CronJob) POSTs to that URL after it finishes. Healthchecks watches the URL.
- Expected frequency: “This job should ping me every 24 hours”
- Grace period: “I’ll tolerate it being up to 4 hours late” (so a backup that runs 23:58 doesn’t alarm when it arrives at 00:02)
- Timeout: “If I don’t hear anything, you’ve got 30 minutes before I notify you”
If the ping shows up on time? Green. Late? Yellow. Missing? Red. Alert fires.
You can also use start and fail signals:
POST /ping/{uuid}— normal “I’m done” pingPOST /ping/{uuid}/start— “I’m about to run” (so you know the difference between “never started” vs “started and hung”)POST /ping/{uuid}/fail— explicit “abort, something’s broken” (your script detects an error, tells Healthchecks)
The Healthchecks dashboard shows you exactly when each job last pinged, how long it took, and whether it’s healthy or alarming. It’s not flashy, but it’s useful.
Why Self-Hosted?
Healthchecks.io has a free SaaS tier. It’s good. But:
- You’re storing ping timestamps (and thus backup execution windows) on someone else’s server.
- Network-dependent — if your internet is down, the ping fails even if your backup succeeded.
- One more third-party dependency.
Self-hosted Healthchecks runs on Docker, uses PostgreSQL (or SQLite for tiny setups), and sends alerts through your channels: email, Slack, ntfy.sh, webhook, Telegram, PagerDuty, whatever. Total control. And it’s dead simple to deploy.
Docker Compose Setup
Here’s a working stack (PostgreSQL + Healthchecks + Caddy reverse proxy):
version: '3.8'
services: postgres: image: postgres:16-alpine environment: POSTGRES_DB: healthchecks POSTGRES_USER: healthchecks POSTGRES_PASSWORD: ${DB_PASSWORD} volumes: - postgres_data:/var/lib/postgresql/data networks: - healthchecks restart: unless-stopped
healthchecks: image: healthchecks/healthchecks:latest environment: DEBUG: "False" ALLOWED_HOSTS: "checks.example.com" SECRET_KEY: ${SECRET_KEY} DB: postgresql DB_HOST: postgres DB_USER: healthchecks DB_PASSWORD: ${DB_PASSWORD} DB_NAME: healthchecks EMAIL_HOST: smtp.example.com EMAIL_PORT: 587 EMAIL_HOST_PASSWORD: ${SMTP_PASSWORD} EMAIL_USE_TLS: "True" SITE_NAME: "Healthchecks" SITE_ROOT: "https://checks.example.com" ports: - "8000:8000" depends_on: - postgres networks: - healthchecks restart: unless-stopped volumes: - healthchecks_data:/opt/healthchecks
caddy: image: caddy:latest ports: - "80:80" - "443:443" volumes: - ./Caddyfile:/etc/caddy/Caddyfile:ro - caddy_data:/data networks: - healthchecks restart: unless-stopped
volumes: postgres_data: healthchecks_data: caddy_data:
networks: healthchecks:Caddyfile for reverse proxy and HTTPS:
checks.example.com { reverse_proxy healthchecks:8000 encode gzip}Spin it up:
# Generate secrets (keep these safe)export SECRET_KEY=$(openssl rand -base64 32)export DB_PASSWORD=$(openssl rand -base64 32)export SMTP_PASSWORD="your-smtp-password"
docker-compose up -dVisit https://checks.example.com, create an account, log in. You’re done.
Integrating with Your Backups
Let’s say you’ve got a restic backup script that runs nightly. You create a check in Healthchecks (grab the URL from the dashboard — looks like https://checks.example.com/ping/abc123def456/).
After your backup finishes, curl that URL:
#!/bin/bashset -e
# ... your restic backup commands here ...restic -r s3://bucket/backup backup /home/user/important-stuff
# Ping Healthchecks to say "I finished successfully"curl -m 10 --retry 5 https://checks.example.com/ping/abc123def456/
# If you want to catch errors:if [ $? -eq 0 ]; then echo "Backup and ping succeeded"else # Notify Healthchecks of failure curl -m 10 https://checks.example.com/ping/abc123def456/fail exit 1fiIn crontab:
# Run at 2 AM every day0 2 * * * /opt/backup.sh >> /var/log/backup.log 2>&1That’s it. Healthchecks now knows whether your backup ran, whether it succeeded, and exactly when. If the script doesn’t run for 26+ hours, you get an email.
Borgmatic and Rclone Examples
If you’re using borgmatic (which abstracts Borg backup):
hooks: after_backup: - curl --silent --show-error --max-time 10 \ https://checks.example.com/ping/abc123def456/
on_error: - curl --silent --show-error --max-time 10 \ https://checks.example.com/ping/abc123def456/failFor rclone sync jobs (replicating to cloud):
#!/bin/bashrclone sync /local/photos gdrive:/backup/photos --delete-during
# Only ping if rclone succeededif [ $? -eq 0 ]; then curl https://checks.example.com/ping/sync-photos-uuid/else curl https://checks.example.com/ping/sync-photos-uuid/failfiSchedule and Grace Syntax
When you create a check, you define its expected cadence using standard cron syntax (or friendly names):
daily— once per day* * * * *— standard cron (every minute)0 2 * * *— your backup runs at 2 AM daily
The grace period is how late you’ll tolerate before alarming. Set it generously enough for network jitter and occasional slowness, but tight enough to catch real problems:
- Backup usually takes 30 min? Set grace to 2 hours.
- Sync job takes 5 minutes? Set grace to 15 minutes.
The timeout (how long Healthchecks waits for a ping after the scheduled time) is separate. If your job runs at 2:00 AM and doesn’t finish until 4:30 AM, you want the grace period to cover that. Timeout is your safety net: “If I still haven’t heard by 4:30 AM + timeout, send the alert.”
Alert Channels
Once a check goes red, Healthchecks can notify you via:
- Email — the default, relies on your SMTP setup
- Slack — webhook integration, posts to a channel
- Webhook — POST to an arbitrary URL (great for custom integrations)
- ntfy.sh — self-hosted push notifications over WebSocket
- Telegram — via bot token
- Apprise — multi-channel notifier (supports 50+ services)
Set up a Slack channel #monitoring and route all backup alerts there. Bonus: the alert includes a link back to the Healthchecks dashboard with the exact ping history.
Complementary, Not Replacement
Healthchecks is a dead-man-switch, not a full monitoring system. It answers one question: “Did this job run?”
It doesn’t:
- Monitor CPU, memory, or disk space (that’s Prometheus + Grafana).
- Parse logs for errors (that’s ELK or Loki).
- Alert on slow queries or latency spikes (that’s APM).
But it’s perfect at what it does: catching the silent failures that traditional monitoring misses. Use it alongside Prometheus for the complete picture.
The Cron + Alertmanager Integration
If you’re already running Prometheus + Alertmanager, you can wire Healthchecks checks into Alertmanager webhooks as a receiver.
Create a custom integration that fires a webhook on check failure:
integrations: - name: alertmanager webhook_url: http://alertmanager:9093/api/v1/alertsWhen Healthchecks detects a failure, it POSTs an alert to Alertmanager, which routes it alongside your other alerts. Now your on-call dashboard treats a missed backup the same as a failing API endpoint.
Maintenance Windows and Pausing
Sometimes you need to take a server offline for maintenance. If you unpause a check without disabling it first, Healthchecks will alarm the moment the grace period expires.
The dashboard has a “Pause” button on each check. Use it:
- Click “Pause” before you shut down for maintenance.
- Do your work.
- Come back and manually resume the check, or let it auto-unpause.
Pro tip: Healthchecks can auto-pause for a fixed window if you set it up, but for one-off maintenance, the manual button is clearer.
Comparing to Alternatives
Cronitor (SaaS): Feature-rich, beautiful dashboard, but you’re paying per check and your data lives with them.
Custom Prometheus blackbox exporter: You could run blackbox probes against these URLs and scrape the results into Prometheus. Overkill for simple cron monitoring, but flexible if you’re already heavy on Prometheus.
Systemd notify: Built into systemd timers, but only notifies systemd-journald, not external systems. Useful locally, not sufficient for distributed alerting.
Dead simple: logger + log shipping: Pipe cron output to syslog, ship to Loki, alert on missing logs. Works, but requires more infrastructure.
Healthchecks wins on simplicity and purpose-built design. It’s the HTTP ping philosophy: minimal overhead, maximum clarity.
The Crons That Need Watching
Every periodic job that matters deserves a check:
- Backups — restic, borgmatic, rclone sync, duplicati
- Database maintenance — VACUUM, REINDEX, replication tests
- Certificate renewal — certbot, acme.sh
- Health checks — checks that your monitoring itself is working (recursive, I know)
- Replication and sync — Syncthing, rsync, cloud sync
- Snapshot-restore drills — periodic restore tests to prove your backups actually work
- Log rotation and cleanup — logrotate, old cache purging
- DNS updates — dynamic DNS clients, DDNS scripts
Create a Healthchecks check for each. Green dashboard = peace of mind. Red dashboard = you know exactly what’s broken before it bites you.
Your backups are worthless if you don’t know they’re running. Healthchecks makes sure you do.