Skip to content
Go back

cert-manager: ACME at Scale

By SumGuy 12 min read
cert-manager: ACME at Scale

Your First cert-manager Ingress Works Fine. It’s the Fifth One That Burns You.

You set up cert-manager in your Kubernetes cluster because manual TLS certificates are a 2 AM nightmare. You add an Ingress with one annotation — cert-manager.io/cluster-issuer: letsencrypt-prod — and boom, Let’s Encrypt ACME flow spins up automatically. Certificate appears. Site loads over HTTPS. You feel like a genius.

Then you add four more services. You deploy a wildcard certificate. Rate limits reject your requests. Your Ingress sits pending for hours. The ACME challenge fails silently. Your logs are screaming about webhook timeouts and DNS propagation delays. Suddenly cert-manager doesn’t feel so automatic anymore.

Here’s the thing: cert-manager is solid — really solid. But it’s a state machine for certificate lifecycle, and state machines have edge cases. This guide walks you through the full picture: how cert-manager works at scale, where the footguns hide, and how to debug when it inevitably breaks.


What cert-manager Actually Does

cert-manager is a Kubernetes operator that watches Certificate and Ingress resources, orchestrates ACME flows with certificate authorities (Let’s Encrypt, ZeroSSL, Buypass, your private Vault), and stores the resulting certificate in a Secret. You don’t manually openssl req, you don’t renew certs every 90 days, you don’t track expiry dates. cert-manager does all that.

But here’s what trips people up: cert-manager doesn’t just get a certificate. It watches the certificate’s lifecycle, monitors expiry, and re-runs ACME validation before renewal. That means it’s making API calls to your DNS provider, running HTTP challenges across the internet, polling ACME servers, and storing secrets in etcd. At scale, each of those operations becomes a lever someone pulls wrong.

The resource hierarchy is this:

Issuer/ClusterIssuer (defines how to get certs: Let's Encrypt ACME, Vault, etc.)
Certificate (what you want: domain, validity period, renewal threshold)
CertificateRequest (cert-manager's internal object, talks to the Issuer)
Order (ACME-specific: represents the order with Let's Encrypt)
Challenge (the actual HTTP-01 or DNS-01 validation)
Secret (the final private key + cert bundle, stored in etcd)

When you add an Ingress with a cert-manager annotation, the controller automatically creates a Certificate resource, which creates a CertificateRequest, which creates an Order, which creates Challenges. If any step fails — DNS doesn’t propagate, webhook times out, rate limit hits — you’re stuck waiting for the next reconciliation loop.


Issuer vs ClusterIssuer: Scope Matters

An Issuer is namespaced. A ClusterIssuer is cluster-wide. Pick the wrong one and your ACME credentials live in the wrong place.

Most setups use a single ClusterIssuer for Let’s Encrypt. That issuer holds ACME account credentials (email, private key) in a Secret. If you use a namespaced Issuer, you’re duplicating that Secret across namespaces, which is messy. For production, one ClusterIssuer per CA is the pattern.

cluster-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-prod
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
privateKeySecretRef:
name: letsencrypt-prod-key
solvers:
- http01:
ingress:
class: nginx
- dns01:
cloudflare:
apiTokenSecretRef:
name: cloudflare-token
key: api-token

This single issuer supports both HTTP-01 (for simple domains) and DNS-01 (for wildcards). cert-manager picks the solver based on the Certificate resource’s dnsNames and dnsPolicy fields.

One more thing: if you’re paranoid, add a second ClusterIssuer with letsencrypt-staging. Hit staging first to debug. Rate limits are 100× higher on staging. When your Certificate works with staging, flip the annotation and retry.


HTTP-01 vs DNS-01: Choose Your Complexity

HTTP-01 is simpler. cert-manager creates a temporary HTTP route, Let’s Encrypt hits http://<your-domain>/.well-known/acme-challenge/<token>, reads the token, and marks the challenge complete. It’s straightforward, but:

DNS-01 is slower but more flexible. cert-manager (via a webhook) adds a TXT record to your DNS, Let’s Encrypt queries DNS for that record, and the challenge passes. Then cert-manager deletes the record. It works for:

The trade-off: DNS-01 requires credentials to your DNS provider (Cloudflare API token, AWS Route53 key, Google CloudDNS service account). cert-manager stores those in a Secret. If someone compromises that Secret, they can modify your DNS. Use strong RBAC.

Here’s a Certificate that uses DNS-01 for a wildcard:

certificate.yaml
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: sumguy-wildcard
namespace: default
spec:
secretName: sumguy-wildcard-tls
commonName: "*.sumguy.com"
dnsNames:
- "*.sumguy.com"
- "sumguy.com"
issuerRef:
name: letsencrypt-prod
kind: ClusterIssuer
dnsPolicy: dns01

cert-manager will resolve the dnsPolicy to a specific solver defined in the ClusterIssuer. If you omit dnsPolicy, it falls back to HTTP-01 (if defined).


Wildcard Gotchas: Your Cert Doesn’t Cover What You Think

A wildcard certificate for *.sumguy.com covers:

But NOT:

So always include both the wildcard and the apex in dnsNames. And remember: Let’s Encrypt’s rate limit for wildcard issuance is the same as regular domains (50 certificates per week per registered domain), but the per-duplicate limit (5 per week) counts towards that same pool. If you’re reissuing wildcards constantly, you’re burning your quota.


Rate Limits: The Wall You Didn’t See Coming

Let’s Encrypt has rate limits:

If you hit these, Let’s Encrypt rejects the request with a 429 error. cert-manager retries on an exponential backoff. Your Certificate sits pending for days.

Causes:

The fix: use staging (letsencrypt-staging) until your setup is stable. Then flip to prod once. Set renewBefore to at least 30 days (cert-manager default is fine). If you need to test wildcard issuance, issue to a test domain first.


DNS Provider Integration: The Webhook Dance

cert-manager ships with built-in support for a handful of DNS providers (Cloudflare, Route53, Google CloudDNS, Azure DNS). For unsupported providers, you use a webhook — a sidecar that cert-manager calls via HTTP to add/remove DNS records.

Here’s Cloudflare DNS-01 with secrets:

cloudflare-dns-issuer.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-dns01
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
privateKeySecretRef:
name: letsencrypt-key
solvers:
- dns01:
cloudflare:
apiTokenSecretRef:
name: cloudflare-api-token
key: token
---
apiVersion: v1
kind: Secret
metadata:
name: cloudflare-api-token
namespace: cert-manager
type: Opaque
stringData:
token: "your-api-token-here"

The webhook pattern (for Route53, Google, others):

Terminal window
helm repo add jetstack https://charts.jetstack.io
helm install cert-manager-webhook-route53 jetstack/cert-manager-webhook-aws \
--namespace cert-manager \
--set aws.secretAccessKey=YOUR_KEY \
--set aws.accessKeyId=YOUR_ID

This deploys a pod that exposes an HTTP webhook. cert-manager calls it when it needs to add/delete DNS records. The webhook has access to AWS credentials and talks to Route53.

The catch: webhooks add latency. If your webhook pod is slow or times out, the ACME challenge fails. Use horizontal pod autoscaling and give the pod enough memory.


Ingress Integration: The Annotation Path

The simplest way to get TLS is to annotate an Ingress:

ingress-example.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: my-service
annotations:
cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
tls:
- hosts:
- example.sumguy.com
secretName: example-tls
rules:
- host: example.sumguy.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: my-service
port:
number: 80

cert-manager watches the Ingress, sees the annotation, and auto-creates a Certificate. The cert ends up in the Secret example-tls. The Ingress controller (nginx, traefik, etc.) mounts that Secret and terminates TLS.

Important: Make sure your Ingress controller can read Secrets in that namespace. RBAC should allow it. If you use cert-manager in one namespace and your Ingress in another, the Secret ends up in the Ingress namespace, not cert-manager’s.


Gateway API: The Modern Path (2026+)

If you’re running Kubernetes 1.30+, consider Gateway API instead of Ingress. cert-manager supports HTTPRoute (and others) natively now.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: my-route
spec:
parentRefs:
- name: my-gateway
kind: Gateway
hostnames:
- example.sumguy.com
rules:
- backendRefs:
- name: my-service
port: 80

Attach TLS via the Gateway:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: Gateway
metadata:
name: my-gateway
spec:
gatewayClassName: istio
listeners:
- name: https
port: 443
protocol: HTTPS
hostname: example.sumguy.com
tls:
mode: Terminate
certificateRefs:
- name: example-tls

cert-manager watches the HTTPRoute and creates the Certificate automatically. Gateway API is cleaner than Ingress annotations for complex setups.


Monitoring: Know When It’s About to Break

Use Prometheus. cert-manager exports certmanager_certificate_expiration_timestamp_seconds — alert on this.

prometheus-rule.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: cert-manager-alerts
spec:
groups:
- name: cert-manager
interval: 30s
rules:
- alert: CertificateExpiringSoon
expr: |
(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
for: 1h
annotations:
summary: "Certificate {{ $labels.name }} expires in less than 7 days"
- alert: CertificateNeverRenewed
expr: |
(time() - certmanager_certificate_renewal_errors_total) > 604800
for: 1h
annotations:
summary: "Certificate {{ $labels.name }} hasn't been renewed in 7+ days"

Also monitor cert-manager itself:

Terminal window
kubectl logs -n cert-manager deployment/cert-manager -f
kubectl logs -n cert-manager deployment/cert-manager-webhook -f

Watch for:


Troubleshooting: The Debug Ladder

When a Certificate is stuck pending, follow this ladder:

Terminal window
# 1. Describe the Certificate
kubectl describe certificate <cert-name>
# Look at Status.Conditions. Is it "Ready"? What's the message?
# 2. Describe the Order
kubectl describe order <order-name>
# Orders are created by cert-manager. Check Status.State: is it "pending", "processing", "valid", or "invalid"?
# 3. Describe the Challenge
kubectl describe challenge <challenge-name>
# Challenges are per-solver. Look for errors in Status.Reason.
# 4. Check the Issuer
kubectl describe clusterissuer letsencrypt-prod
# Is the issuer configured correctly? Can cert-manager reach ACME servers?
# 5. Webhook logs (if using DNS-01)
kubectl logs -n cert-manager deployment/cert-manager-webhook-<provider>
# Is the webhook healthy? Can it reach your DNS provider?
# 6. Full cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager --tail=500
# Search for the Certificate name. Look for the full reconciliation flow.

Example workflow: Your wildcard certificate is stuck.

Terminal window
$ kubectl describe certificate sumguy-wildcard
Status:
Conditions:
- Type: Ready
Status: False
Reason: InvalidRequest
Message: "Invalid request [urn:ietf:params:acme:error:rateLimited]: Error creating new order..."
$ kubectl describe order sumguy-wildcard-abc123
Status:
State: invalid
Reason: "urn:ietf:params:acme:error:rateLimited"
# You hit the rate limit. Check when you can retry:
$ kubectl logs -n cert-manager deployment/cert-manager | grep sumguy-wildcard | tail -20
# Look for the next reconciliation time.

The fix: wait. Or use staging. Or reissue to a test domain. Don’t restart cert-manager — that triggers new ACME orders and makes it worse.


Private CA with Vault: When Let’s Encrypt Isn’t Enough

Not all setups use Let’s Encrypt. If you’re running internal services, Vault-backed cert issuance is cleaner. cert-manager supports Vault Issuer:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: vault-pki
spec:
vault:
server: https://vault.internal:8200
path: pki/sign/my-role
auth:
kubernetes:
mountPath: /v1/auth/kubernetes
role: cert-manager
caBundle: |
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----

cert-manager uses Kubernetes RBAC to authenticate to Vault, requests a certificate, and stores it in a Secret. No ACME, no rate limits, just signed certs on-demand.


CrashLoopBackOff: When cert-manager Spirals

Sometimes cert-manager controller crashes and restarts in a loop. Common causes:

To debug:

Terminal window
kubectl logs -n cert-manager deployment/cert-manager --previous
# Check the last log before the crash.
kubectl get events -n cert-manager --sort-by='.lastTimestamp'
# Kubernetes events often have hints.
kubectl describe pod -n cert-manager <pod-name>
# Check resource limits and restarts.

If you’re deploying a flood of Certificates at once, cert-manager might OOM or hit timeout. Stagger the rollout. Use renewBefore: 720h to avoid unnecessary renewals.


One More Thing: CAA Records

Some ACME providers (ZeroSSL, Buypass, private Vault) require a CAA (Certification Authority Authorization) DNS record before they’ll issue. Let’s Encrypt doesn’t enforce it, but others do.

Terminal window
# Add this to your DNS:
example.sumguy.com CAA 0 issue "letsencrypt.org"
example.sumguy.com CAA 0 issue "zerossl.com"

If you forget and try to issue with ZeroSSL, the challenge fails silently. Add the CAA record, wait for DNS propagation, and retry.


The Full Picture

cert-manager is a surprisingly deep rabbit hole. On the surface, it’s “add an annotation, get a cert, done.” But scale it to 20 services, add DNS-01 validation, hit a rate limit, and suddenly you’re neck-deep in ACME order states, webhook logs, and exponential backoff math.

The pattern that works:

  1. Start with one ClusterIssuer pointing to letsencrypt-staging.
  2. Use HTTP-01 for simple domains, DNS-01 for wildcards.
  3. Add both the domain and apex to dnsNames.
  4. Monitor with Prometheus. Alert on expiry.
  5. Test with staging before moving to prod.
  6. Don’t restart cert-manager unless necessary.
  7. When stuck, describe the resource hierarchy: Certificate → Order → Challenge.

Your 2 AM self will appreciate it when a certificate nearly expired, cert-manager silently renewed it, and the alerts let you know it’s all fine.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
Jellyseerr Tagging Workflows for Real Libraries

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts