Healthchecks.io Self-Hosted: Cron Monitoring

Your Backup Cron Failed Silently. You’ll Find Out in Six Months.

Exit codes don’t email you. Your monitoring stack doesn’t know if a job didn’t run. It only sees what you explicitly tell it about. And that’s where 99% of backup failures hide.

You’ve got restic or borgmatic set up, right? Running every night at 2 AM via cron. It fails three times in a row due to network flakiness, but there’s no dashboard screaming about it. You keep sleeping. By the time you notice (usually when you need to restore), six months of “backups” are actually just corruption logs.

This is the dead-man-switch problem. Not “something went wrong,” but “something didn’t happen at all.”

Healthchecks.io solves this. The self-hosted version runs on your own hardware, integrates with everything from restic to systemd timers, and sends you an alert the second a periodic job fails to check in.

What Is a Dead-Man-Switch?

Picture this: you’re piloting a plane. There’s a button in your hand. While you’re conscious and holding it, all is well. The moment you fall asleep (or worse), your grip loosens. The button releases. Alarm sounds.

That’s a dead-man-switch. In monitoring terms: I expect you to ping me every 24 hours. If you don’t, something’s wrong.

It’s the inverse of traditional alerting:

Alert on presence (traditional): “Fire a dashboard metric when a job succeeds”, but if the job doesn’t run, there’s no metric. Silence.
Alert on absence (dead-man-switch): “If I don’t hear a ping in the next 26 hours, wake me up”, failure, silence, or absence all trigger.

Cron jobs are the poster child for this problem because they have no stdout, no metrics, no Prometheus scrape endpoint. They just… run (or don’t). Your monitoring won’t know the difference.

How Healthchecks.io Works

You create a “check”, essentially a URL with a grace period and a schedule. Your cron job (or systemd timer, or Kubernetes CronJob) POSTs to that URL after it finishes. Healthchecks watches the URL.

Expected frequency: “This job should ping me every 24 hours”
Grace period: “I’ll tolerate it being up to 4 hours late” (so a backup that runs 23:58 doesn’t alarm when it arrives at 00:02)
Timeout: “If I don’t hear anything, you’ve got 30 minutes before I notify you”

If the ping shows up on time? Green. Late? Yellow. Missing? Red. Alert fires.

You can also use start and fail signals:

POST /ping/{uuid}: normal “I’m done” ping
POST /ping/{uuid}/start: “I’m about to run” (so you know the difference between “never started” vs “started and hung”)
POST /ping/{uuid}/fail: explicit “abort, something’s broken” (your script detects an error, tells Healthchecks)

The Healthchecks dashboard shows you exactly when each job last pinged, how long it took, and whether it’s healthy or alarming. It’s not flashy, but it’s useful.

Why Self-Hosted?

Healthchecks.io has a free SaaS tier. It’s good. But:

You’re storing ping timestamps (and thus backup execution windows) on someone else’s server.
Network-dependent: if your internet is down, the ping fails even if your backup succeeded.
One more third-party dependency.

Self-hosted Healthchecks runs on Docker, uses PostgreSQL (or SQLite for tiny setups), and sends alerts through your channels: email, Slack, ntfy.sh, webhook, Telegram, PagerDuty, whatever. Total control. And it’s dead simple to deploy.

Docker Compose Setup

Here’s a working stack (PostgreSQL + Healthchecks + Caddy reverse proxy):

services:
  postgres:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: healthchecks
      POSTGRES_USER: healthchecks
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - postgres_data:/var/lib/postgresql/data
    networks:
      - healthchecks
    restart: unless-stopped

  healthchecks:
    image: healthchecks/healthchecks:latest
    environment:
      DEBUG: "False"
      ALLOWED_HOSTS: "checks.example.com"
      SECRET_KEY: ${SECRET_KEY}
      DB: postgresql
      DB_HOST: postgres
      DB_USER: healthchecks
      DB_PASSWORD: ${DB_PASSWORD}
      DB_NAME: healthchecks
      EMAIL_HOST: smtp.example.com
      EMAIL_PORT: 587
      EMAIL_HOST_USER: [email protected]
      EMAIL_HOST_PASSWORD: ${SMTP_PASSWORD}
      EMAIL_USE_TLS: "True"
      DEFAULT_FROM_EMAIL: [email protected]
      SITE_NAME: "Healthchecks"
      SITE_ROOT: "https://checks.example.com"
    ports:
      - "8000:8000"
    depends_on:
      - postgres
    networks:
      - healthchecks
    restart: unless-stopped
    volumes:
      - healthchecks_data:/opt/healthchecks

  caddy:
    image: caddy:latest
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile:ro
      - caddy_data:/data
    networks:
      - healthchecks
    restart: unless-stopped

volumes:
  postgres_data:
  healthchecks_data:
  caddy_data:

networks:
  healthchecks:

Caddyfile for reverse proxy and HTTPS:

checks.example.com {
  reverse_proxy healthchecks:8000
  encode gzip
}

Spin it up:

# Generate secrets (keep these safe)
export SECRET_KEY=$(openssl rand -base64 32)
export DB_PASSWORD=$(openssl rand -base64 32)
export SMTP_PASSWORD="your-smtp-password"

docker-compose up -d

Visit https://checks.example.com, create an account, log in. You’re done.

Integrating with Your Backups

Let’s say you’ve got a restic backup script that runs nightly. You create a check in Healthchecks (grab the URL from the dashboard, looks like https://checks.example.com/ping/abc123def456/).

After your backup finishes, curl that URL:

#!/bin/bash

# ... your restic backup commands here ...
restic -r s3://bucket/backup backup /home/user/important-stuff

# Capture restic's exit code (NOT curl's) before doing anything else
if [ $? -eq 0 ]; then
  # Backup worked — ping the success URL
  curl -m 10 --retry 5 https://checks.example.com/ping/abc123def456/
else
  # Backup failed — tell Healthchecks explicitly
  curl -m 10 https://checks.example.com/ping/abc123def456/fail
  exit 1
fi

(Note: don’t use set -e here, you want the script to survive a failed backup long enough to fire the /fail ping, otherwise it dies silently and Healthchecks only catches it later via the missed-ping timeout.)

In crontab:

# Run at 2 AM every day
0 2 * * * /opt/backup.sh >> /var/log/backup.log 2>&1

That’s it. Healthchecks now knows whether your backup ran, whether it succeeded, and exactly when. If the script doesn’t run for 26+ hours, you get an email.

Borgmatic and Rclone Examples

If you’re using borgmatic (which abstracts Borg backup):

hooks:
  after_backup:
    - curl --silent --show-error --max-time 10 \
        https://checks.example.com/ping/abc123def456/

  on_error:
    - curl --silent --show-error --max-time 10 \
        https://checks.example.com/ping/abc123def456/fail

For rclone sync jobs (replicating to cloud):

#!/bin/bash
rclone sync /local/photos gdrive:/backup/photos --delete-during

# Only ping if rclone succeeded
if [ $? -eq 0 ]; then
  curl https://checks.example.com/ping/sync-photos-uuid/
else
  curl https://checks.example.com/ping/sync-photos-uuid/fail
fi

Schedule and Grace Syntax

When you create a check, you define its expected cadence using standard cron syntax (or friendly names):

daily: once per day
* * * * *: standard cron (every minute)
0 2 * * *: your backup runs at 2 AM daily

The grace period is how late you’ll tolerate before alarming. Set it generously enough for network jitter and occasional slowness, but tight enough to catch real problems:

Backup usually takes 30 min? Set grace to 2 hours.
Sync job takes 5 minutes? Set grace to 15 minutes.

The timeout (how long Healthchecks waits for a ping after the scheduled time) is separate. If your job runs at 2:00 AM and doesn’t finish until 4:30 AM, you want the grace period to cover that. Timeout is your safety net: “If I still haven’t heard by 4:30 AM + timeout, send the alert.”

Alert Channels

Once a check goes red, Healthchecks can notify you via:

Email: the default, relies on your SMTP setup
Slack: webhook integration, posts to a channel
Webhook: POST to an arbitrary URL (great for custom integrations)
ntfy.sh: self-hosted push notifications over a simple HTTP POST
Telegram: via bot token
Apprise: multi-channel notifier (supports 50+ services)

Set up a Slack channel #monitoring and route all backup alerts there. Bonus: the alert includes a link back to the Healthchecks dashboard with the exact ping history.

Complementary, Not Replacement

Healthchecks is a dead-man-switch, not a full monitoring system. It answers one question: “Did this job run?”

It doesn’t:

Monitor CPU, memory, or disk space (that’s Prometheus + Grafana).
Parse logs for errors (that’s ELK or Loki).
Alert on slow queries or latency spikes (that’s APM).

But it’s perfect at what it does: catching the silent failures that traditional monitoring misses. Use it alongside Prometheus for the complete picture.

The Cron + Alertmanager Integration

If you’re already running Prometheus + Alertmanager, you can wire Healthchecks checks into Alertmanager webhooks as a receiver.

Create a custom integration that fires a webhook on check failure:

integrations:
  - name: alertmanager
    webhook_url: http://alertmanager:9093/api/v1/alerts

When Healthchecks detects a failure, it POSTs an alert to Alertmanager, which routes it alongside your other alerts. Now your on-call dashboard treats a missed backup the same as a failing API endpoint.

Maintenance Windows and Pausing

Sometimes you need to take a server offline for maintenance. If you don’t pause the check first, Healthchecks will alarm the moment the grace period expires, exactly the behavior you want normally, but annoying when you took the box down on purpose.

The dashboard has a “Pause” button on each check. Use it:

Click “Pause” before you shut down for maintenance.
Do your work.
Come back and manually resume the check, or let it auto-unpause.

Pro tip: Healthchecks can auto-pause for a fixed window if you set it up, but for one-off maintenance, the manual button is clearer.

Comparing to Alternatives

Cronitor (SaaS): Feature-rich, beautiful dashboard, but you’re paying per check and your data lives with them.

Custom Prometheus blackbox exporter: You could run blackbox probes against these URLs and scrape the results into Prometheus. Overkill for simple cron monitoring, but flexible if you’re already heavy on Prometheus.

Systemd notify: Built into systemd timers, but only notifies systemd-journald, not external systems. Useful locally, not sufficient for distributed alerting.

Dead simple: logger + log shipping: Pipe cron output to syslog, ship to Loki, alert on missing logs. Works, but requires more infrastructure.

Healthchecks wins on simplicity and purpose-built design. It’s the HTTP ping philosophy: minimal overhead, maximum clarity.

The Crons That Need Watching

Every periodic job that matters deserves a check:

Backups: restic, borgmatic, rclone sync, duplicati
Database maintenance: VACUUM, REINDEX, replication tests
Certificate renewal: certbot, acme.sh
Health checks: checks that your monitoring itself is working (recursive, I know)
Replication and sync: Syncthing, rsync, cloud sync
Snapshot-restore drills: periodic restore tests to prove your backups actually work
Log rotation and cleanup: logrotate, old cache purging
DNS updates: dynamic DNS clients, DDNS scripts

Create a Healthchecks check for each. Green dashboard = peace of mind. Red dashboard = you know exactly what’s broken before it bites you.

Your backups are worthless if you don’t know they’re running. Healthchecks makes sure you do.

Healthchecks.io Self-Hosted: Cron Monitoring

Your Backup Cron Failed Silently. You’ll Find Out in Six Months.

What Is a Dead-Man-Switch?

How Healthchecks.io Works

Why Self-Hosted?

Docker Compose Setup

Integrating with Your Backups

Borgmatic and Rclone Examples

Schedule and Grace Syntax

Alert Channels

Complementary, Not Replacement

The Cron + Alertmanager Integration

Maintenance Windows and Pausing

Comparing to Alternatives

The Crons That Need Watching

Responses from around the web

Discussion

Related Posts

TIG: Telegraf + InfluxDB + Grafana

Promtail to Alloy Migration: A Practical Diff

LibreNMS for SNMP-Heavy Home Networks

SmokePing for Internet Connection Sanity

Healthchecks.io Self-Hosted: Cron Monitoring

Your Backup Cron Failed Silently. You’ll Find Out in Six Months.

What Is a Dead-Man-Switch?

How Healthchecks.io Works

Why Self-Hosted?

Docker Compose Setup

Integrating with Your Backups

Borgmatic and Rclone Examples

Schedule and Grace Syntax

Alert Channels

Complementary, Not Replacement

The Cron + Alertmanager Integration

Maintenance Windows and Pausing

Comparing to Alternatives

The Crons That Need Watching

Related Reading

Responses from around the web

Discussion

Related Posts

TIG: Telegraf + InfluxDB + Grafana

Promtail to Alloy Migration: A Practical Diff

LibreNMS for SNMP-Heavy Home Networks

SmokePing for Internet Connection Sanity