Skip to content
Go back

Prometheus Federation for Multi-Site Home Labs

By SumGuy 10 min read
Prometheus Federation for Multi-Site Home Labs

When One Prometheus Becomes Three

You start with a single Prometheus instance scraping your homelab. Life is simple. Metrics flow in, dashboards work, alerts fire at 2 AM like clockwork. Then you add a second location — a NAS at a cabin, a Pi in the garage, a Docker Swarm across two sites. Suddenly you’re staring at a problem: how do you correlate metrics across geographically separated infrastructure without turning your central Prometheus into a bottleneck?

This is where federation comes in. It’s not sexy. It won’t get you any DevOps credibility. But it works, and it’s exactly what you need when your monitoring infrastructure starts looking like a proper distributed system instead of a single-server setup.

Prometheus federation is hierarchical metric pulling. Instead of scraping everything from the edge, you run a Prometheus at each site, then have a central Prometheus federate from those edge instances. It’s like delegation — the district managers report to the CEO, not the other way around.

Here’s what makes it different from the alternatives you might’ve heard about:

Why Federation Instead of One Big Prometheus?

Picture this: you’re running Prometheus on a beefy server, and it’s scraping 50,000 metrics from across three sites plus a Kubernetes cluster. Your retention is a week because disk space is finite. You hit a cardinality problem and suddenly Prometheus is eating 32 GB of RAM. Your SSD is thrashing. It’s slow.

Now imagine this: each site has a small Prometheus instance (2 CPU, 4 GB RAM) that keeps 30 days of local data. It scrapes its own infrastructure, runs its own alert rules, and is independent. A central Prometheus scrapes the aggregated metrics from each edge Prometheus — not the raw metrics, but the pre-computed results you actually care about.

Your central Prometheus is lean. Your edge instances are resilient. If the WAN link drops, each site keeps running and alerting. When it comes back, the central box resync.

This is the difference between federation and just throwing everything at one box: cardinality isolation, geographic fault tolerance, and the ability to tune each Prometheus independently.

Federation vs Remote Write: The Car Analogy

Remote write is like hiring a truck to haul every single component of your car to a warehouse: engines, wheels, every bolt, every sensor reading. You get everything, but now you’re paying for the bandwidth and the warehouse rent.

Federation is like each car reporting its fuel consumption, oil pressure, and mileage to a dispatch center. You get the useful summary, not the raw oscilloscope data from every sensor.

For a homelab:

How Federation Works: Hierarchical Scraping

Each edge Prometheus runs normally. It scrapes its own targets, stores data locally, and runs alert rules. Nothing fancy.

The central Prometheus then scrapes the /metrics endpoint of each edge Prometheus, but with a special honor_labels: true directive that says: “treat the labels from the remote Prometheus as authoritative — don’t add your own.” Without this, you’ll get label conflicts and chaos.

Here’s what that looks like:

Edge Site 1 (garage-prom.local)

global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'garage-docker'
static_configs:
- targets: ['100.60.0.5:9323'] # Docker daemon via WireGuard
- job_name: 'garage-node'
static_configs:
- targets: ['100.60.0.5:9100'] # Node exporter
alerting:
alertmanagers:
- static_configs:
- targets: ['100.60.0.5:9093']
rule_files:
- '/etc/prometheus/rules.yml'

The edge Prometheus runs independently. It scrapes local targets, fires alerts locally to its own Alertmanager. If the WAN drops, it keeps working.

Edge Site 2 (cabin-nas.local)

Same setup. Different targets. Different rules maybe.

Central Prometheus (home-prom.local)

global:
scrape_interval: 30s
evaluation_interval: 30s
external_labels:
site: 'central'
environment: 'production'
scrape_configs:
# Federate from garage Prometheus
- job_name: 'federate-garage'
scrape_interval: 30s
scrape_timeout: 10s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="garage-docker"}'
- '{job="garage-node"}'
static_configs:
- targets: ['100.60.0.5:9090']
# Federate from cabin NAS
- job_name: 'federate-cabin'
scrape_interval: 30s
scrape_timeout: 10s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="cabin-nas"}'
- '{job="cabin-docker"}'
static_configs:
- targets: ['100.60.1.10:9090']
# Local scrapes if needed
- job_name: 'central-node'
static_configs:
- targets: ['localhost:9100']
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- '/etc/prometheus/rules.yml'

Notice the key parts:

The match[] Selector: Your Cardinality Firewall

This is the magic bullet. Each site generates metrics. Without match[], the central Prometheus would federate everything — all the ephemeral container metrics, all the kernel subsystem data, everything. Your central box would blow up.

With match[], you’re explicit:

params:
'match[]':
- '{job="site1-core"}' # Only core metrics
- '{job="site1-infra"}' # Infra layer
- '{instance="pi-01:9100"}' # Specific target
- 'up' # Just the up metrics

The match syntax is PromQL-style label matching. You’re saying: “federate metrics that match this label set.” Prometheus evaluates it server-side, so the remote box does the filtering and sends back only what you asked for.

This is why federation scales. Remote_write can’t do this — it pushes everything or nothing.

The honor_labels Gotcha

Set this to true or you’ll tear your hair out.

When honor_labels: true, the central Prometheus respects the labels as they exist on the remote Prometheus. The remote box is the authority.

When honor_labels: false (the default), Prometheus adds its own labels — like the scrape job name and the target. You end up with duplicate label sets, conflicting label values, and Prometheus has to reconcile them. It’s a mess.

Always use honor_labels: true for federation. The remote Prometheus is the source of truth.

When you’re federating across a WireGuard link over residential internet, sometimes things get slow.

If your central Prometheus tries to federate from an edge box and the link is congested, the /federate endpoint might take 30 seconds to respond instead of 3 seconds. By default, Prometheus times out after 10 seconds and marks the scrape as failed.

Solution:

- job_name: 'federate-remote-site'
scrape_interval: 45s # Longer interval
scrape_timeout: 25s # Longer timeout
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="site-core"}'
static_configs:
- targets: ['100.62.0.50:9090']

The tradeoff: if the link is dead, you wait 25 seconds before failing. Your central Prometheus might batch up failed scrapes. That’s okay. Federation is resilient — it’s designed to survive connectivity hiccups.

Alerting at the Edge, Not the Center

Here’s where federation really shines: each edge Prometheus runs its own alert rules. If a service goes down at the cabin, the cabin Prometheus alerts immediately. It doesn’t wait for the WAN link, doesn’t depend on the central box.

Edge alerts are fast and independent.

But you might also want central alerts — cross-site correlations. For example: “alert if any two sites are down simultaneously.” That’s a central rule:

# Central rules.yml
groups:
- name: multi_site
rules:
- alert: MultipleSitesDown
expr: count(up{job=~"federate-.*"} == 0) > 1
for: 2m
annotations:
summary: "More than one site is down"

This is a central-only rule that triggers when two or more federated sites report up=0. Edge Prometheus can’t know about other sites; only central can.

Your edge Prometheus boxes are on Tailscale or WireGuard. Bandwidth is low, latency is high sometimes. Federation handles this better than pulling raw metrics.

But there are limits. If your link is dropping packets or you have sustained jitter:

Federation is designed for this. It’s not real-time; it’s eventual-consistent monitoring. That’s fine for a homelab.

Retention Split: Edge vs Central

Each Prometheus can have different retention settings:

The advantage: you have 30 days of detail at the edge, 7 days of correlated view at the center. If you need historical data from a specific site, you query that site’s Prometheus directly.

When Federation Falls Over (Don’t Federate Everything)

The biggest mistake: trying to federate all 50,000 metrics from every job.

If you do that, your central Prometheus becomes as big as if you’d scraped directly. You get no cardinality benefit. The link gets hammered. Everyone loses.

Instead, federate results, not raw metrics. Define aggregation rules at the edge:

# Edge Prometheus rules.yml
groups:
- name: aggregation
interval: 30s
rules:
- record: 'job:container_memory_usage_bytes:sum'
expr: sum(container_memory_usage_bytes) by (job, instance)
- record: 'job:node_load5:avg'
expr: avg(node_load5)

Now instead of federating raw metrics, you federate pre-computed recordings. Your central Prometheus sees clean aggregates. Cardinality stays low.

Authentication: Basic Auth vs Reverse Proxy

If your central Prometheus is reaching edge boxes over a WireGuard tunnel, you probably don’t need auth. WireGuard is encrypted and access-controlled.

But if you’re reaching edge boxes over untrusted networks (unlikely in a homelab, but worth noting), use basic_auth:

- job_name: 'federate-remote'
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="core"}'
basic_auth:
username: 'prometheus'
password: 'your-secure-password'
static_configs:
- targets: ['remote-prom.example.com:9090']

Or better: put a reverse proxy (Caddy, nginx) in front of your edge Prometheus and handle auth there. The Prometheus box itself doesn’t authenticate; the proxy does.

When Federation Is the Right Tool

You have multiple Prometheus instances across different locations or clusters.

You want each site independent and resilient.

Your cardinality is exploding when you try to centralize everything.

You query from a central dashboard most of the time, but need detail at the edges.

You can tolerate a few minutes of stale data (federation isn’t real-time).

If all of this sounds like your setup, federation is your answer. It’s not flashy. It won’t impress anyone. But on a Tuesday night at 2 AM when your WAN link hiccups and your edge Prometheus keeps alerting while the central box waits for reconnection, you’ll appreciate the simplicity.

Set it up once, forget about it, and let it do its job.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
iperf3 + nload: Network Diagnosis

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts