Your Content Is Being Stolen. Here’s How To Fight Back.
It’s 2026. Your blog posts, your code snippets, your carefully-written tutorials — they’re being vacuumed into training datasets by the dozen. Some polite, with a User-Agent that says ClaudeBot or GPTBot. Most not even trying to hide. You can block them at robots.txt, but that’s like putting a polite sign on a chain-link fence. A smarter thief just walks around it.
What if instead of a sign, you put up a math problem? One that humans solve instantly (with their browser doing the work in the background) but that makes AI scrapers’ cost curve go vertical?
That’s the idea behind Anubis — a proof-of-work gating system that sits in front of your site, challenges bots with computational puzzles, and lets real humans through without breaking a sweat. No CAPTCHA farms, no email gates, just: “Solve this hash, or get out.”
This is not theoretical. It’s a real tool, it works, and for a self-hosted blog in 2026, it’s worth understanding.
Why PoW Gating Matters Now
Three reasons you should care:
1. Your data pays someone else’s bills. Training data is valuable. Every AI scraper is betting the math to extract your content costs less than what they’ll make selling the model. Right now, they’re winning that bet — because there’s no cost to them. PoW changes that arithmetic.
2. robots.txt is theater. GPTBot respects it. Most others don’t. And even the polite ones can be spoofed with a proper User-Agent header. A robots.txt file is like a velvet rope in a store with no security guard. Anubis is the security guard.
3. It actually gets better with regulation. As more sites deploy PoW gating, AI companies will need to either pay (through Proof-of-Work CDN providers) or crawl less aggressively. That’s regulation through code, and it scales.
The downside? False positives. We’ll get to that.
How Anubis PoW Works (The 30-Second Version)
When a bot (or human) requests your site:
- Your reverse proxy (Caddy/Nginx) intercepts the request
- It returns a lightweight PoW challenge — “hash this data until you get a result starting with five zeros”
- The client solves it (humans’ browsers do this in the background; bots’ CPUs work overtime)
- The solution is verified and the real request goes through
- Repeat on interval to prevent session hijacking
The work is tunable. Set it light and you barely notice. Crank it up and a bot’s cost-per-page explodes from ~$0.001 to ~$0.50. At scale, that destroys the ROI on scraping.
The beauty: humans don’t see anything. Their browser does the work without blocking the page load.
Deploying Anubis with Caddy
Here’s the practical setup. You’re running a self-hosted blog (or any site) behind Caddy, and you want to gate AI scrapers.
Step 1: Set Up the Anubis Reverse Proxy
Anubis runs as a sidecar service. It sits between your CDN/firewall and your actual web server.
version: '3.9'
services: anubis: image: anubis:latest # or: thorax/anubis:latest from Docker Hub container_name: anubis-proxy ports: - "8080:8080" # HTTP listener - "8443:8443" # HTTPS listener (optional, use Caddy's TLS) environment: # Upstream target (your actual blog) UPSTREAM_URL: "http://blog:3000"
# PoW difficulty (0-30, default: 16) # 16 = ~100ms for decent laptop, ~5-10s for bot POW_DIFFICULTY: "18"
# Whitelist bypass (commas separated) # Real browsers get free passes via JWT or session cookie WHITELIST_BYPASS_ENABLED: "true" WHITELIST_COOKIE_NAME: "anubis-pass" WHITELIST_TTL_SECONDS: "3600" # 1 hour free pass after solving once
# Bot detection rules (exact Match on User-Agent) BOT_RULES: | { "always_challenge": ["GPTBot", "ClaudeBot", "PerplexityBot"], "never_challenge": ["Googlebot", "Bingbot", "Slurp"], "always_block": ["MJ12Bot", "DotBot"] }
# Logging LOG_LEVEL: "info" LOG_FILE: "/var/log/anubis/access.log"
volumes: - anubis_logs:/var/log/anubis
restart: unless-stopped networks: - sumguy
blog: image: sumguy-astro:latest container_name: sumguy-blog ports: - "3000:3000" environment: NODE_ENV: "production" restart: unless-stopped networks: - sumguy
volumes: anubis_logs:
networks: sumguy: driver: bridgeSpin it up:
docker compose up -dAnubis is now running on localhost:8080. It will forward all requests to your blog at blog:3000 after PoW challenge.
Step 2: Route Traffic Through Anubis with Caddy
Your Caddy config points to the Anubis proxy instead of the blog directly:
sumguy.com { # Point to Anubis, not directly to the blog reverse_proxy localhost:8080 { # Preserve headers for bot detection header_up User-Agent "{http.request.header.User-Agent}" header_up X-Forwarded-For "{http.request.header.X-Forwarded-For}"
# Long timeout for PoW solving on slow connections timeout 30s }
# Optionally, log raw User-Agents for tuning log { output file /var/log/caddy/access.log { roll_size 100MiB roll_keep 5 } format json }
# Security headers (unchanged) header X-Content-Type-Options nosniff header X-Frame-Options DENY header Referrer-Policy no-referrer header Permissions-Policy "geolocation=(), microphone=(), camera=()"}Reload Caddy:
caddy reloadNow traffic flows: Browser/Bot → Caddy → Anubis → Blog
Tuning Bot Rules & False Positives
The GPTBot Problem
GPTBot (OpenAI’s crawler) is smart. It respects robots.txt, uses honest User-Agents, and comes with good intentions. But OpenAI’s terms say they’ll scrape anyway if you don’t explicitly opt out (via robots.txt or x-robots-tag).
You have two choices:
-
Let GPTBot through (no PoW) — they’ll train on your content, you get attribution in the LLM’s training data. Some people call that marketing.
-
Challenge GPTBot (medium PoW) — make it expensive but not impossible. They’ll sample less aggressively but still crawl.
-
Block GPTBot entirely (highest PoW) — they give up. No training, no attribution.
Here’s a moderate config:
BOT_RULES: | { "always_challenge": { "GPTBot": 20, "ClaudeBot": 20, "PerplexityBot": 18 }, "never_challenge": ["Googlebot", "Bingbot", "Slurp", "Yandex"], "always_block": ["MJ12Bot", "DotBot", "SemrushBot"] }The numbers are difficulty levels. "GPTBot": 20 means GPTBot gets a PoW challenge with difficulty 20 (harder than the default). "never_challenge" lets search engines index normally (they’re indexing, not scraping for training).
False Positives: When Real Browsers Get Challenged
Here’s the annoying part: legitimate tools that aren’t browsers will hit PoW walls.
Common false positives:
- Feed readers (Feedly, Inoreader) — they’re checking for RSS updates, not training data, but they look like bots
- Monitoring tools (Uptime Kuma, Pingdom) — they ping your site; PoW will fail
- Slack link previews — Slack’s crawler extracts the OG image and title before your user sees it
- Email clients — some email apps pre-fetch links to show rich previews
Solutions:
1. Whitelist by User-Agent (surgical):
BOT_RULES: | { "never_challenge": [ "Googlebot", "Slurp", "bingbot", "Feedly", "Inoreader", "Slack", "facebookexternalhit", "Twitterbot" ] }2. Whitelist by IP (for your own tools):
WHITELIST_IPS: "10.0.0.5, 192.168.1.100" # Uptime Kuma, your status page3. Use session cookies (best UX):
Once a human solves the PoW, they get a cookie valid for 1 hour. On refresh, no challenge. Bots don’t preserve cookies across sessions, so they re-solve every time (expensive).
WHITELIST_COOKIE_NAME: "anubis-solved"WHITELIST_TTL_SECONDS: "3600"The tradeoff: Tighter rules = fewer false positives = easier discovery by bots. Looser rules = fewer bots = happy readers.
Monitoring & Observability
Keep logs. You’ll want to know which bots are hitting you hardest and whether your PoW is actually slowing them down.
Check Anubis logs:
docker logs anubis-proxy | grep -E "bot|challenge|solved|failed"Sample output (hypothetical):
2026-11-27T10:15:22Z INFO request=GET:/blog/anubis-post user_agent=GPTBot difficulty=20 solved=true latency_ms=42032026-11-27T10:15:45Z INFO request=GET:/blog/anubis-post user_agent=Mozilla/5.0 difficulty=0 solved=true latency_ms=122026-11-27T10:16:03Z WARN request=GET/ user_agent=DotBot challenge_failed=true ip=203.0.113.45 attempts=3Read the story: GPTBot took 4+ seconds (the PoW), real browser took 12ms (browser cache), DotBot failed 3 times and gave up. Working as designed.
Set up metrics export (Prometheus optional, but useful):
PROMETHEUS_ENABLED: "true"PROMETHEUS_PORT: "9090"Then scrape localhost:9090/metrics from your Prometheus instance.
Edge Cases & Gotchas
1. CDN Caching Breaks PoW Challenges
If you’re using Cloudflare or another CDN, they might cache PoW responses. Don’t. Disable caching on Anubis endpoints:
# Caddyfilesumguy.com { reverse_proxy localhost:8080 { # Disable caching for PoW responses header Cache-Control "no-store, no-cache, must-revalidate" }}2. Mobile Users on Slow Networks
PoW can take longer on older phones or 3G. Set difficulty conservatively (16-18). Test on a throttled connection.
# Chrome DevTools → Network tab → Slow 3G# Verify page load still feels responsive (<2s before content visible)3. Legitimate Scrapers (Wayback Machine, Archive.org)
Wayback Machine’s crawler is well-intentioned. But it’s a scraper. You have to pick: let them preserve your site for posterity, or block them.
never_challenge: ["archive.org_bot", "ia_archiver"]# ORalways_challenge: ["archive.org_bot"] # Medium PoW, they'll sample less4. China & Great Firewall
If you have readers in mainland China, PoW adds latency. High difficulty (>22) might make the site unusable over GFW. Keep it at 16-18 if you expect international traffic.
The Decision: Is Anubis Right for You?
Use Anubis if:
- Your content is evergreen and valuable — tutorials, code, research, opinions that AI companies want to scrape
- You’re okay with slight latency — PoW adds 100-500ms on first visit per session
- You run your own infrastructure — Anubis is self-hosted (no third-party dependency)
- You can tune bot rules — false positives need ongoing tweaking
Skip Anubis if:
- You want a 100% open site — PoW is a speedbump, not a wall
- Your audience is mostly mobile — PoW hits mobile harder
- You have zero DevOps bandwidth — it’s one more service to monitor
- You’re on a shared host — you can’t install custom reverse proxies
The Honest Take
Anubis doesn’t stop scraping. It doesn’t kill the problem. What it does is raise the cost high enough that bots become pickier about which sites to scrape. If a bot can get your content for $0.001/page or someone else’s for free, they’ll pick someone else. That’s the goal.
In 2026, as more sites deploy PoW gating, this becomes an arms race. Smarter bots will optimize their PoW solvers. You’ll crank up difficulty. It’ll get weird. But right now, today, Anubis gives you leverage where you had none before.
Deploy it. Monitor it. Tune it. Your 2 AM self will appreciate knowing your content stays yours a little bit longer.
Links & Further Reading
- Anubis GitHub: https://github.com/thorax/anubis (YMMV — check actively maintained forks)
- Proof of Work explainer: https://en.wikipedia.org/wiki/Proof_of_work (Bitcoin uses the same math)
- robots.txt & AI: Add
User-agent: */Disallow: //User-agent: ChatGPT-User/Allow: /to robots.txt for selective blocking (GPT respects this) - Caddy reverse proxy docs: https://caddyserver.com/docs/caddyfile/directives/reverse_proxy