When One Prometheus Becomes Three
You start with a single Prometheus instance scraping your homelab. Life is simple. Metrics flow in, dashboards work, alerts fire at 2 AM like clockwork. Then you add a second location — a NAS at a cabin, a Pi in the garage, a Docker Swarm across two sites. Suddenly you’re staring at a problem: how do you correlate metrics across geographically separated infrastructure without turning your central Prometheus into a bottleneck?
This is where federation comes in. It’s not sexy. It won’t get you any DevOps credibility. But it works, and it’s exactly what you need when your monitoring infrastructure starts looking like a proper distributed system instead of a single-server setup.
Prometheus federation is hierarchical metric pulling. Instead of scraping everything from the edge, you run a Prometheus at each site, then have a central Prometheus federate from those edge instances. It’s like delegation — the district managers report to the CEO, not the other way around.
Here’s what makes it different from the alternatives you might’ve heard about:
- Remote write (push to central storage): all metrics flow to a central database. Great for long-term retention, terrible for cardinality explosions and for querying when your uplink hiccups.
- Mimir / Thanos (proper multi-tenant long-term storage): these are tanks. You need them if you’re running a SaaS. In a homelab? You’re using a sledgehammer to hang a picture.
- Federation (pull from remote Prometheus): selective aggregation, lightweight, edge alerting. If your sites can still reach each other over WireGuard, this is your jam.
Why Federation Instead of One Big Prometheus?
Picture this: you’re running Prometheus on a beefy server, and it’s scraping 50,000 metrics from across three sites plus a Kubernetes cluster. Your retention is a week because disk space is finite. You hit a cardinality problem and suddenly Prometheus is eating 32 GB of RAM. Your SSD is thrashing. It’s slow.
Now imagine this: each site has a small Prometheus instance (2 CPU, 4 GB RAM) that keeps 30 days of local data. It scrapes its own infrastructure, runs its own alert rules, and is independent. A central Prometheus scrapes the aggregated metrics from each edge Prometheus — not the raw metrics, but the pre-computed results you actually care about.
Your central Prometheus is lean. Your edge instances are resilient. If the WAN link drops, each site keeps running and alerting. When it comes back, the central box resync.
This is the difference between federation and just throwing everything at one box: cardinality isolation, geographic fault tolerance, and the ability to tune each Prometheus independently.
Federation vs Remote Write: The Car Analogy
Remote write is like hiring a truck to haul every single component of your car to a warehouse: engines, wheels, every bolt, every sensor reading. You get everything, but now you’re paying for the bandwidth and the warehouse rent.
Federation is like each car reporting its fuel consumption, oil pressure, and mileage to a dispatch center. You get the useful summary, not the raw oscilloscope data from every sensor.
For a homelab:
- Use federation if you want each site independent, low cardinality at the center, and immediate local alerting.
- Use remote_write if you’re running 100+ servers, need real long-term storage (years), and your metric volume is manageable.
- Use Thanos/Mimir if you’re running this at scale or need multi-tenancy. You’re not. Don’t.
How Federation Works: Hierarchical Scraping
Each edge Prometheus runs normally. It scrapes its own targets, stores data locally, and runs alert rules. Nothing fancy.
The central Prometheus then scrapes the /metrics endpoint of each edge Prometheus, but with a special honor_labels: true directive that says: “treat the labels from the remote Prometheus as authoritative — don’t add your own.” Without this, you’ll get label conflicts and chaos.
Here’s what that looks like:
Edge Site 1 (garage-prom.local)
global: scrape_interval: 15s evaluation_interval: 15s
scrape_configs: - job_name: 'garage-docker' static_configs: - targets: ['100.60.0.5:9323'] # Docker daemon via WireGuard
- job_name: 'garage-node' static_configs: - targets: ['100.60.0.5:9100'] # Node exporter
alerting: alertmanagers: - static_configs: - targets: ['100.60.0.5:9093']
rule_files: - '/etc/prometheus/rules.yml'The edge Prometheus runs independently. It scrapes local targets, fires alerts locally to its own Alertmanager. If the WAN drops, it keeps working.
Edge Site 2 (cabin-nas.local)
Same setup. Different targets. Different rules maybe.
Central Prometheus (home-prom.local)
global: scrape_interval: 30s evaluation_interval: 30s external_labels: site: 'central' environment: 'production'
scrape_configs: # Federate from garage Prometheus - job_name: 'federate-garage' scrape_interval: 30s scrape_timeout: 10s honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="garage-docker"}' - '{job="garage-node"}' static_configs: - targets: ['100.60.0.5:9090']
# Federate from cabin NAS - job_name: 'federate-cabin' scrape_interval: 30s scrape_timeout: 10s honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="cabin-nas"}' - '{job="cabin-docker"}' static_configs: - targets: ['100.60.1.10:9090']
# Local scrapes if needed - job_name: 'central-node' static_configs: - targets: ['localhost:9100']
alerting: alertmanagers: - static_configs: - targets: ['localhost:9093']
rule_files: - '/etc/prometheus/rules.yml'Notice the key parts:
metrics_path: '/federate'— this is the federation endpoint, not/metrics. Prometheus opens it automatically; you don’t run a separate scraper.honor_labels: true— critical. Tells the central Prometheus: “the labels from the remote Prometheus are the source of truth.”match[]— filters which metrics to federate. You’re saying “gimme only the garage-docker and garage-node jobs.” This is why federation is lighter than remote_write: you’re not pulling everything. You’re pulling what matters.scrape_timeout: 10s— if a remote Prometheus is slow or flaky, 10 seconds is your cutoff. Default is 10s anyway, but if you’re federating over a slow WireGuard link, increase it to 20s.
The match[] Selector: Your Cardinality Firewall
This is the magic bullet. Each site generates metrics. Without match[], the central Prometheus would federate everything — all the ephemeral container metrics, all the kernel subsystem data, everything. Your central box would blow up.
With match[], you’re explicit:
params: 'match[]': - '{job="site1-core"}' # Only core metrics - '{job="site1-infra"}' # Infra layer - '{instance="pi-01:9100"}' # Specific target - 'up' # Just the up metricsThe match syntax is PromQL-style label matching. You’re saying: “federate metrics that match this label set.” Prometheus evaluates it server-side, so the remote box does the filtering and sends back only what you asked for.
This is why federation scales. Remote_write can’t do this — it pushes everything or nothing.
The honor_labels Gotcha
Set this to true or you’ll tear your hair out.
When honor_labels: true, the central Prometheus respects the labels as they exist on the remote Prometheus. The remote box is the authority.
When honor_labels: false (the default), Prometheus adds its own labels — like the scrape job name and the target. You end up with duplicate label sets, conflicting label values, and Prometheus has to reconcile them. It’s a mess.
Always use honor_labels: true for federation. The remote Prometheus is the source of truth.
Scrape Timeout & Backpressure: Federation Over Flaky Links
When you’re federating across a WireGuard link over residential internet, sometimes things get slow.
If your central Prometheus tries to federate from an edge box and the link is congested, the /federate endpoint might take 30 seconds to respond instead of 3 seconds. By default, Prometheus times out after 10 seconds and marks the scrape as failed.
Solution:
- job_name: 'federate-remote-site' scrape_interval: 45s # Longer interval scrape_timeout: 25s # Longer timeout honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="site-core"}' static_configs: - targets: ['100.62.0.50:9090']The tradeoff: if the link is dead, you wait 25 seconds before failing. Your central Prometheus might batch up failed scrapes. That’s okay. Federation is resilient — it’s designed to survive connectivity hiccups.
Alerting at the Edge, Not the Center
Here’s where federation really shines: each edge Prometheus runs its own alert rules. If a service goes down at the cabin, the cabin Prometheus alerts immediately. It doesn’t wait for the WAN link, doesn’t depend on the central box.
Edge alerts are fast and independent.
But you might also want central alerts — cross-site correlations. For example: “alert if any two sites are down simultaneously.” That’s a central rule:
# Central rules.ymlgroups: - name: multi_site rules: - alert: MultipleSitesDown expr: count(up{job=~"federate-.*"} == 0) > 1 for: 2m annotations: summary: "More than one site is down"This is a central-only rule that triggers when two or more federated sites report up=0. Edge Prometheus can’t know about other sites; only central can.
Federation Across Slow/Flaky Links: Tailscale Practicality
Your edge Prometheus boxes are on Tailscale or WireGuard. Bandwidth is low, latency is high sometimes. Federation handles this better than pulling raw metrics.
But there are limits. If your link is dropping packets or you have sustained jitter:
- Consider increasing
scrape_intervalon the central box (30s or 45s instead of 15s). - Increase
scrape_timeoutto account for slow uploads. - Use
match[]to minimize data transfer — only federate what you query. - Consider running smaller retention at the center (7 days instead of 30).
Federation is designed for this. It’s not real-time; it’s eventual-consistent monitoring. That’s fine for a homelab.
Retention Split: Edge vs Central
Each Prometheus can have different retention settings:
- Edge Prometheus:
--storage.tsdb.retention.time=30d— keep detailed data locally. It’s just your site, cardinality is low. - Central Prometheus:
--storage.tsdb.retention.time=7d— keep aggregated data short-term. It’s only what you federated.
The advantage: you have 30 days of detail at the edge, 7 days of correlated view at the center. If you need historical data from a specific site, you query that site’s Prometheus directly.
When Federation Falls Over (Don’t Federate Everything)
The biggest mistake: trying to federate all 50,000 metrics from every job.
If you do that, your central Prometheus becomes as big as if you’d scraped directly. You get no cardinality benefit. The link gets hammered. Everyone loses.
Instead, federate results, not raw metrics. Define aggregation rules at the edge:
# Edge Prometheus rules.ymlgroups: - name: aggregation interval: 30s rules: - record: 'job:container_memory_usage_bytes:sum' expr: sum(container_memory_usage_bytes) by (job, instance)
- record: 'job:node_load5:avg' expr: avg(node_load5)Now instead of federating raw metrics, you federate pre-computed recordings. Your central Prometheus sees clean aggregates. Cardinality stays low.
Authentication: Basic Auth vs Reverse Proxy
If your central Prometheus is reaching edge boxes over a WireGuard tunnel, you probably don’t need auth. WireGuard is encrypted and access-controlled.
But if you’re reaching edge boxes over untrusted networks (unlikely in a homelab, but worth noting), use basic_auth:
- job_name: 'federate-remote' honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="core"}' basic_auth: username: 'prometheus' password: 'your-secure-password' static_configs: - targets: ['remote-prom.example.com:9090']Or better: put a reverse proxy (Caddy, nginx) in front of your edge Prometheus and handle auth there. The Prometheus box itself doesn’t authenticate; the proxy does.
When Federation Is the Right Tool
You have multiple Prometheus instances across different locations or clusters.
You want each site independent and resilient.
Your cardinality is exploding when you try to centralize everything.
You query from a central dashboard most of the time, but need detail at the edges.
You can tolerate a few minutes of stale data (federation isn’t real-time).
If all of this sounds like your setup, federation is your answer. It’s not flashy. It won’t impress anyone. But on a Tuesday night at 2 AM when your WAN link hiccups and your edge Prometheus keeps alerting while the central box waits for reconnection, you’ll appreciate the simplicity.
Set it up once, forget about it, and let it do its job.