Skip to content
Go back

Anubis: Anti-AI-Crawler Proof-of-Work

By SumGuy 9 min read
Anubis: Anti-AI-Crawler Proof-of-Work

Your Content Is Being Stolen. Here’s How To Fight Back.

It’s 2026. Your blog posts, your code snippets, your carefully-written tutorials — they’re being vacuumed into training datasets by the dozen. Some polite, with a User-Agent that says ClaudeBot or GPTBot. Most not even trying to hide. You can block them at robots.txt, but that’s like putting a polite sign on a chain-link fence. A smarter thief just walks around it.

What if instead of a sign, you put up a math problem? One that humans solve instantly (with their browser doing the work in the background) but that makes AI scrapers’ cost curve go vertical?

That’s the idea behind Anubis — a proof-of-work gating system that sits in front of your site, challenges bots with computational puzzles, and lets real humans through without breaking a sweat. No CAPTCHA farms, no email gates, just: “Solve this hash, or get out.”

This is not theoretical. It’s a real tool, it works, and for a self-hosted blog in 2026, it’s worth understanding.


Why PoW Gating Matters Now

Three reasons you should care:

1. Your data pays someone else’s bills. Training data is valuable. Every AI scraper is betting the math to extract your content costs less than what they’ll make selling the model. Right now, they’re winning that bet — because there’s no cost to them. PoW changes that arithmetic.

2. robots.txt is theater. GPTBot respects it. Most others don’t. And even the polite ones can be spoofed with a proper User-Agent header. A robots.txt file is like a velvet rope in a store with no security guard. Anubis is the security guard.

3. It actually gets better with regulation. As more sites deploy PoW gating, AI companies will need to either pay (through Proof-of-Work CDN providers) or crawl less aggressively. That’s regulation through code, and it scales.

The downside? False positives. We’ll get to that.


How Anubis PoW Works (The 30-Second Version)

When a bot (or human) requests your site:

  1. Your reverse proxy (Caddy/Nginx) intercepts the request
  2. It returns a lightweight PoW challenge — “hash this data until you get a result starting with five zeros”
  3. The client solves it (humans’ browsers do this in the background; bots’ CPUs work overtime)
  4. The solution is verified and the real request goes through
  5. Repeat on interval to prevent session hijacking

The work is tunable. Set it light and you barely notice. Crank it up and a bot’s cost-per-page explodes from ~$0.001 to ~$0.50. At scale, that destroys the ROI on scraping.

The beauty: humans don’t see anything. Their browser does the work without blocking the page load.


Deploying Anubis with Caddy

Here’s the practical setup. You’re running a self-hosted blog (or any site) behind Caddy, and you want to gate AI scrapers.

Step 1: Set Up the Anubis Reverse Proxy

Anubis runs as a sidecar service. It sits between your CDN/firewall and your actual web server.

version: '3.9'
services:
anubis:
image: anubis:latest
# or: thorax/anubis:latest from Docker Hub
container_name: anubis-proxy
ports:
- "8080:8080" # HTTP listener
- "8443:8443" # HTTPS listener (optional, use Caddy's TLS)
environment:
# Upstream target (your actual blog)
UPSTREAM_URL: "http://blog:3000"
# PoW difficulty (0-30, default: 16)
# 16 = ~100ms for decent laptop, ~5-10s for bot
POW_DIFFICULTY: "18"
# Whitelist bypass (commas separated)
# Real browsers get free passes via JWT or session cookie
WHITELIST_BYPASS_ENABLED: "true"
WHITELIST_COOKIE_NAME: "anubis-pass"
WHITELIST_TTL_SECONDS: "3600" # 1 hour free pass after solving once
# Bot detection rules (exact Match on User-Agent)
BOT_RULES: |
{
"always_challenge": ["GPTBot", "ClaudeBot", "PerplexityBot"],
"never_challenge": ["Googlebot", "Bingbot", "Slurp"],
"always_block": ["MJ12Bot", "DotBot"]
}
# Logging
LOG_LEVEL: "info"
LOG_FILE: "/var/log/anubis/access.log"
volumes:
- anubis_logs:/var/log/anubis
restart: unless-stopped
networks:
- sumguy
blog:
image: sumguy-astro:latest
container_name: sumguy-blog
ports:
- "3000:3000"
environment:
NODE_ENV: "production"
restart: unless-stopped
networks:
- sumguy
volumes:
anubis_logs:
networks:
sumguy:
driver: bridge

Spin it up:

Terminal window
docker compose up -d

Anubis is now running on localhost:8080. It will forward all requests to your blog at blog:3000 after PoW challenge.

Step 2: Route Traffic Through Anubis with Caddy

Your Caddy config points to the Anubis proxy instead of the blog directly:

sumguy.com {
# Point to Anubis, not directly to the blog
reverse_proxy localhost:8080 {
# Preserve headers for bot detection
header_up User-Agent "{http.request.header.User-Agent}"
header_up X-Forwarded-For "{http.request.header.X-Forwarded-For}"
# Long timeout for PoW solving on slow connections
timeout 30s
}
# Optionally, log raw User-Agents for tuning
log {
output file /var/log/caddy/access.log {
roll_size 100MiB
roll_keep 5
}
format json
}
# Security headers (unchanged)
header X-Content-Type-Options nosniff
header X-Frame-Options DENY
header Referrer-Policy no-referrer
header Permissions-Policy "geolocation=(), microphone=(), camera=()"
}

Reload Caddy:

Terminal window
caddy reload

Now traffic flows: Browser/Bot → Caddy → Anubis → Blog


Tuning Bot Rules & False Positives

The GPTBot Problem

GPTBot (OpenAI’s crawler) is smart. It respects robots.txt, uses honest User-Agents, and comes with good intentions. But OpenAI’s terms say they’ll scrape anyway if you don’t explicitly opt out (via robots.txt or x-robots-tag).

You have two choices:

  1. Let GPTBot through (no PoW) — they’ll train on your content, you get attribution in the LLM’s training data. Some people call that marketing.

  2. Challenge GPTBot (medium PoW) — make it expensive but not impossible. They’ll sample less aggressively but still crawl.

  3. Block GPTBot entirely (highest PoW) — they give up. No training, no attribution.

Here’s a moderate config:

BOT_RULES: |
{
"always_challenge": {
"GPTBot": 20,
"ClaudeBot": 20,
"PerplexityBot": 18
},
"never_challenge": ["Googlebot", "Bingbot", "Slurp", "Yandex"],
"always_block": ["MJ12Bot", "DotBot", "SemrushBot"]
}

The numbers are difficulty levels. "GPTBot": 20 means GPTBot gets a PoW challenge with difficulty 20 (harder than the default). "never_challenge" lets search engines index normally (they’re indexing, not scraping for training).

False Positives: When Real Browsers Get Challenged

Here’s the annoying part: legitimate tools that aren’t browsers will hit PoW walls.

Common false positives:

Solutions:

1. Whitelist by User-Agent (surgical):

BOT_RULES: |
{
"never_challenge": [
"Googlebot",
"Slurp",
"bingbot",
"Feedly",
"Inoreader",
"Slack",
"facebookexternalhit",
"Twitterbot"
]
}

2. Whitelist by IP (for your own tools):

WHITELIST_IPS: "10.0.0.5, 192.168.1.100" # Uptime Kuma, your status page

3. Use session cookies (best UX):

Once a human solves the PoW, they get a cookie valid for 1 hour. On refresh, no challenge. Bots don’t preserve cookies across sessions, so they re-solve every time (expensive).

WHITELIST_COOKIE_NAME: "anubis-solved"
WHITELIST_TTL_SECONDS: "3600"

The tradeoff: Tighter rules = fewer false positives = easier discovery by bots. Looser rules = fewer bots = happy readers.


Monitoring & Observability

Keep logs. You’ll want to know which bots are hitting you hardest and whether your PoW is actually slowing them down.

Check Anubis logs:

Terminal window
docker logs anubis-proxy | grep -E "bot|challenge|solved|failed"

Sample output (hypothetical):

2026-11-27T10:15:22Z INFO request=GET:/blog/anubis-post user_agent=GPTBot difficulty=20 solved=true latency_ms=4203
2026-11-27T10:15:45Z INFO request=GET:/blog/anubis-post user_agent=Mozilla/5.0 difficulty=0 solved=true latency_ms=12
2026-11-27T10:16:03Z WARN request=GET/ user_agent=DotBot challenge_failed=true ip=203.0.113.45 attempts=3

Read the story: GPTBot took 4+ seconds (the PoW), real browser took 12ms (browser cache), DotBot failed 3 times and gave up. Working as designed.

Set up metrics export (Prometheus optional, but useful):

PROMETHEUS_ENABLED: "true"
PROMETHEUS_PORT: "9090"

Then scrape localhost:9090/metrics from your Prometheus instance.


Edge Cases & Gotchas

1. CDN Caching Breaks PoW Challenges

If you’re using Cloudflare or another CDN, they might cache PoW responses. Don’t. Disable caching on Anubis endpoints:

# Caddyfile
sumguy.com {
reverse_proxy localhost:8080 {
# Disable caching for PoW responses
header Cache-Control "no-store, no-cache, must-revalidate"
}
}

2. Mobile Users on Slow Networks

PoW can take longer on older phones or 3G. Set difficulty conservatively (16-18). Test on a throttled connection.

Terminal window
# Chrome DevTools → Network tab → Slow 3G
# Verify page load still feels responsive (<2s before content visible)

3. Legitimate Scrapers (Wayback Machine, Archive.org)

Wayback Machine’s crawler is well-intentioned. But it’s a scraper. You have to pick: let them preserve your site for posterity, or block them.

never_challenge: ["archive.org_bot", "ia_archiver"]
# OR
always_challenge: ["archive.org_bot"] # Medium PoW, they'll sample less

4. China & Great Firewall

If you have readers in mainland China, PoW adds latency. High difficulty (>22) might make the site unusable over GFW. Keep it at 16-18 if you expect international traffic.


The Decision: Is Anubis Right for You?

Use Anubis if:

Skip Anubis if:

The Honest Take

Anubis doesn’t stop scraping. It doesn’t kill the problem. What it does is raise the cost high enough that bots become pickier about which sites to scrape. If a bot can get your content for $0.001/page or someone else’s for free, they’ll pick someone else. That’s the goal.

In 2026, as more sites deploy PoW gating, this becomes an arms race. Smarter bots will optimize their PoW solvers. You’ll crank up difficulty. It’ll get weird. But right now, today, Anubis gives you leverage where you had none before.

Deploy it. Monitor it. Tune it. Your 2 AM self will appreciate knowing your content stays yours a little bit longer.



Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
iperf3 + nload: Network Diagnosis

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts