Skip to content
Go back

Dragonfly: P2P Container Image Distribution at Scale

By SumGuy 10 min read
Dragonfly: P2P Container Image Distribution at Scale

Your Registry Is Not a CDN (Stop Pretending It Is)

You’ve got a shiny K8s cluster. Rolling deployment kicks off. Kubernetes signals all 200 nodes to pull the new ml-inference:v3.14 image. It’s 4GB. Every single node hits your registry — or your Cloudflare-cached mirror, or your Harbor instance — simultaneously, at full speed.

Congratulations. You’ve just invented a thundering herd attack against your own infrastructure.

Your registry buckles. Nodes time out. The deployment stalls. Someone gets paged at 2 AM (maybe you). And the worst part? 80% of those nodes ended up pulling the same bytes from the same upstream. They were sitting right next to each other, network-wise, but they all went the long way around.

Here’s the fix: Dragonfly. CNCF graduated project, P2P overlay for container image and file distribution, and the reason Alibaba doesn’t have a senior engineer crying into their keyboard every deploy day.


What Dragonfly Actually Is

Dragonfly is a peer-to-peer file distribution system purpose-built for container images and large file transfers. It sits between your nodes and your registry and turns every completed download into a seed peer for the next node in line.

Think of it like BitTorrent, except instead of pirating Linux ISOs, you’re distributing your production container images — and it’s architected to handle the chaos of a K8s rolling deploy without falling over.

The project was open-sourced by Alibaba (they were running it at a scale that makes most people’s “big” feel very small) and is now a CNCF graduated project. That “graduated” status matters — it’s not just a promising experiment anymore. It’s been hardened, audited, and adopted by enough organizations that the rough edges are mostly gone.

The Four Pieces That Matter

Manager — The control plane. Stores configuration, manages seed peer groups, exposes a UI and REST API. Think of it as the coordinator that knows where everything lives.

Scheduler — The brains of the P2P operation. When a peer wants a file, the scheduler figures out who has what pieces and builds the optimal download plan. It’s doing the BitTorrent “who has chunk X” math, but for container layers.

dfdaemon (dfget) — Runs on every node. This is your local agent: intercepts image pulls, talks to the scheduler, downloads from peers and seeds what it has. It also acts as a transparent proxy for containerd or docker via a mirror config — no application changes needed.

Seed Peers — Dedicated nodes that always maintain a warm cache. They pull from the origin registry once, then the rest of the cluster pulls from them (and from each other). Like having a local mirror, except the mirror is distributed across your entire fleet.


Why This Crushes Large Deployments

Let’s do the math on 200 nodes pulling a 4GB image without Dragonfly:

With Dragonfly and seed peers:

The analogy that fits here: it’s like hiring a forklift to move boxes across a warehouse vs. having every worker walk to the loading dock individually. Technically they’d all get their boxes eventually, but your warehouse operations have ground to a halt and the forklift driver is wondering why no one planned this better.


Setting Up Dragonfly with Helm

The fastest path to Dragonfly in a K8s cluster is the official Helm chart. You’ll need cert-manager installed first (Dragonfly uses webhooks).

Terminal window
# Add the Dragonfly Helm repo
helm repo add dragonfly https://dragonflyoss.github.io/helm-charts
helm repo update

Here’s a minimal but production-leaning values.yaml. Adjust resource requests for your cluster size.

values.yaml
manager:
replicas: 1
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "1"
memory: "1Gi"
scheduler:
replicas: 2 # HA — at least 2 in prod
resources:
requests:
cpu: "250m"
memory: "512Mi"
seedPeer:
enable: true
replicas: 2 # One per failure domain ideally
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"
dfdaemon:
enable: true
config:
proxy:
defaultFilter: "Expires&Signature&ns"
security:
insecure: true # set false if your registry has valid TLS
registryMirror:
url: https://index.docker.io # your registry here
insecure: false

Deploy it:

Terminal window
kubectl create namespace dragonfly-system
helm install dragonfly dragonfly/dragonfly \
--namespace dragonfly-system \
--values values.yaml \
--wait

Check the pods come up:

Terminal window
kubectl get pods -n dragonfly-system

You should see manager, scheduler, seed-peer, and dfdaemon pods. The dfdaemon runs as a DaemonSet — one per node, which is exactly what you want.

Wire containerd to Use the Proxy

This is the step most tutorials gloss over. Dragonfly’s dfdaemon runs as a local proxy (default port 65001). You need to tell containerd to use it as a registry mirror.

On each node (or via your node provisioning tool), create/edit the containerd mirror config:

/etc/containerd/certs.d/docker.io/hosts.toml
server = "https://index.docker.io"
[host."http://127.0.0.1:65001"]
capabilities = ["pull", "resolve"]
skip_verify = true
override_path = true

Restart containerd, and from that point on every docker.io pull goes through dfdaemon first. The proxy handles the P2P magic transparently — your Kubernetes workloads have no idea Dragonfly exists.

For Harbor, ECR, GCR — same pattern, different server value and a different directory name under certs.d/.


Dragonfly vs. the Competition

Three alternatives worth knowing about:

Kraken (Uber, ~archived)

Uber open-sourced Kraken back when they needed to distribute images across massive fleets. It works, the architecture is solid, but the project is essentially in maintenance mode now. Last meaningful activity was a while ago, issues pile up without responses, and if something breaks you’re largely on your own.

If you’re evaluating now, Dragonfly is the safer bet — active community, CNCF backing, regular releases.

Spegel (K8s-native P2P mirror)

Spegel is the newer kid on the block and takes a different approach: it’s purely K8s-native, uses containerd’s mirror configuration, and nodes discover each other via the K8s API. No separate scheduler process, no seed peers — just nodes sharing directly.

For smaller clusters (say, 10-50 nodes) Spegel is honestly simpler to operate. Less infrastructure overhead, easier mental model. For large clusters or heterogeneous environments (bare metal + cloud + edge), Dragonfly’s more sophisticated scheduling wins out.

If you’re running a 3-node home lab: use Spegel. Or honestly, just stop fussing — your problem isn’t image distribution bandwidth.

Harbor’s Preheat Integration

Harbor (the registry) has a built-in preheat feature that can push images to Dragonfly seed peers proactively — before any node asks for them. So instead of the first rolling-deploy node being slow while the seed peer warms up, Harbor fires off the preheat as soon as a new image is pushed.

This combo (Harbor + Dragonfly preheat) is the setup Alibaba and ByteDance run in production. It means by the time your deployment kicks off, the image is already warm in the P2P network. Deploy times look almost magical.

The Harbor preheat setup requires configuring a distribution instance in Harbor’s admin UI pointing at your Dragonfly manager. Not complicated, but it’s an extra config step.


Real Numbers

Alibaba’s published benchmarks (take with appropriate skepticism since they’re the vendor, but the scale is real): deploying a 1GB image to 2000 nodes took ~20 minutes without P2P distribution, ~2 minutes with Dragonfly. That’s a 10x improvement.

ByteDance (yes, TikTok’s parent) has talked publicly about distributing images to tens of thousands of nodes using Dragonfly. At that scale, the alternative isn’t “slightly slower” — it’s “your registry melts and someone has a very bad day.”

For a more realistic 50-200 node cluster doing daily redeploys with images in the 1-5GB range, you’re looking at:

The crossover point where Dragonfly stops being overkill and starts being necessary is roughly: 20+ nodes + images > 500MB + frequent deploys. Below that, your CDN-backed registry mirror is probably fine.


When NOT to Use Dragonfly

Let’s be honest about when this is the wrong tool:

Small home lab (1-10 nodes): You don’t have a thundering herd problem. You have one machine pulling an image that’s cached locally after the first pull. Dragonfly adds operational complexity with zero actual benefit. Use Spegel if you want P2P, or just let containerd’s local image cache do its job.

Infrequent, small images: If you deploy once a week and your images are 200MB, the overhead of running Dragonfly’s manager, scheduler, and seed peers costs more in compute than you’d ever save in bandwidth.

Registry is already local: If your registry is in the same datacenter as your nodes on a fat internal network, pull times are already low. Dragonfly helps most when there’s a WAN hop or a shared upstream connection being hammered.

You want simple: Dragonfly’s architecture is not simple. Manager, scheduler, seed peers, dfdaemon, cert-manager dependency, mirror config on every node — there’s real operational overhead here. If your team isn’t comfortable debugging distributed systems, Spegel’s “just K8s” approach is more maintainable.


The Bottom Line

If you’re running a K8s cluster with 50+ nodes, doing regular deploys of multi-gigabyte images, and you’ve noticed that your registry starts sweating every time you push a new release — Dragonfly is the fix.

The setup is maybe 2-3 hours the first time. The payoff is immediate: your registry stops being a bottleneck, deploys get faster, and that one engineer who keeps getting paged at 2 AM for deployment timeouts can finally sleep through the night.

Pair it with Harbor’s preheat feature if you want to get fancy and have seed peers warmed up before deployments even kick off. At that point you’re doing container image distribution the way the big shops do it — not because you’re trying to impress anyone, but because it actually works better.

Your 2 AM self will appreciate it.


Quick Reference

Terminal window
# Check Dragonfly pod health
kubectl get pods -n dragonfly-system
# View scheduler logs (useful for debugging slow pulls)
kubectl logs -n dragonfly-system -l component=scheduler --tail=100
# Check dfdaemon proxy is reachable on a node
kubectl exec -n dragonfly-system ds/dragonfly-dfdaemon -- \
curl -s http://127.0.0.1:65001/healthy
# Force a test pull through Dragonfly (on a node with mirror configured)
crictl pull docker.io/library/alpine:latest

Dragonfly docs live at d7y.io — the architecture docs in particular are worth reading before your first production rollout.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
iperf3 + nload: Network Diagnosis

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts