cert-manager: ACME at Scale

Your First cert-manager Ingress Works Fine. It’s the Fifth One That Burns You.

You set up cert-manager in your Kubernetes cluster because manual TLS certificates are a 2 AM nightmare. You add an Ingress with one annotation, cert-manager.io/cluster-issuer: letsencrypt-prod, and boom, Let’s Encrypt ACME flow spins up automatically. Certificate appears. Site loads over HTTPS. You feel like a genius.

Then you add four more services. You deploy a wildcard certificate. Rate limits reject your requests. Your Ingress sits pending for hours. The ACME challenge fails silently. Your logs are screaming about webhook timeouts and DNS propagation delays. Suddenly cert-manager doesn’t feel so automatic anymore.

cert-manager is solid, really solid. But it’s a state machine for certificate lifecycle, and state machines have edge cases. This guide walks you through the full picture: how cert-manager works at scale, where the footguns hide, and how to debug when it inevitably breaks.

What cert-manager Actually Does

cert-manager is a Kubernetes operator that watches Certificate and Ingress resources, orchestrates ACME flows with certificate authorities (Let’s Encrypt, ZeroSSL, Buypass, your private Vault), and stores the resulting certificate in a Secret. You don’t manually openssl req, you don’t renew certs every 90 days, you don’t track expiry dates. cert-manager does all that.

But here’s what trips people up: cert-manager doesn’t just get a certificate. It watches the certificate’s lifecycle, monitors expiry, and re-runs ACME validation before renewal. That means it’s making API calls to your DNS provider, running HTTP challenges across the internet, polling ACME servers, and storing secrets in etcd. At scale, each of those operations becomes a lever someone pulls wrong.

The resource hierarchy is this:

Issuer/ClusterIssuer (defines how to get certs: Let's Encrypt ACME, Vault, etc.)
  ↓
Certificate (what you want: domain, validity period, renewal threshold)
  ↓
CertificateRequest (cert-manager's internal object, talks to the Issuer)
  ↓
Order (ACME-specific: represents the order with Let's Encrypt)
  ↓
Challenge (the actual HTTP-01 or DNS-01 validation)
  ↓
Secret (the final private key + cert bundle, stored in etcd)

When you add an Ingress with a cert-manager annotation, the controller automatically creates a Certificate resource, which creates a CertificateRequest, which creates an Order, which creates Challenges. If any step fails, DNS doesn’t propagate, webhook times out, rate limit hits, you’re stuck waiting for the next reconciliation loop.

Issuer vs ClusterIssuer: Scope Matters

An Issuer is namespaced. A ClusterIssuer is cluster-wide. Pick the wrong one and your ACME credentials live in the wrong place.

Most setups use a single ClusterIssuer for Let’s Encrypt. That issuer holds ACME account credentials (email, private key) in a Secret. If you use a namespaced Issuer, you’re duplicating that Secret across namespaces, which is messy. For production, one ClusterIssuer per CA is the pattern.

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-prod-key
    solvers:
      - http01:
          ingress:
            class: nginx
      - dns01:
          cloudflare:
            email: [email protected]
            apiTokenSecretRef:
              name: cloudflare-token
              key: api-token

This single issuer supports both HTTP-01 (for simple domains) and DNS-01 (for wildcards). cert-manager picks the solver based on the issuer’s solver selector blocks (match on dnsNames, dnsZones, or labels). With no selectors, the first matching solver wins, and DNS-01 is required for wildcards regardless.

One more thing: if you’re paranoid, add a second ClusterIssuer with letsencrypt-staging. Hit staging first to debug. Rate limits are 100× higher on staging. When your Certificate works with staging, flip the annotation and retry.

HTTP-01 vs DNS-01: Choose Your Complexity

HTTP-01 is simpler. cert-manager creates a temporary HTTP route, Let’s Encrypt hits http://<your-domain>/.well-known/acme-challenge/<token>, reads the token, and marks the challenge complete. It’s straightforward, but:

Requires a public IP and open port 80 (or port-forward in your firewall)
Doesn’t work for wildcard certificates
Fails if your Ingress controller is slow to propagate routes

DNS-01 is slower but more flexible. cert-manager (via a webhook) adds a TXT record to your DNS, Let’s Encrypt queries DNS for that record, and the challenge passes. Then cert-manager deletes the record. It works for:

Wildcard certificates (*.sumguy.com)
Domains on private networks (cert-manager has network access to DNS, Let’s Encrypt doesn’t need to)
Environments where port 80 is blocked or behind a NAT

The trade-off: DNS-01 requires credentials to your DNS provider (Cloudflare API token, AWS Route53 key, Google CloudDNS service account). cert-manager stores those in a Secret. If someone compromises that Secret, they can modify your DNS. Use strong RBAC.

Here’s a Certificate that uses DNS-01 for a wildcard:

apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: sumguy-wildcard
  namespace: default
spec:
  secretName: sumguy-wildcard-tls
  commonName: "*.sumguy.com"
  dnsNames:
    - "*.sumguy.com"
    - "sumguy.com"
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer

There’s no solver field on the Certificate, cert-manager picks the solver from the ClusterIssuer based on the issuer’s selector blocks and the cert’s dnsNames. Because this cert has a wildcard, the DNS-01 solver is the only one that can satisfy it.

Wildcard Gotchas: Your Cert Doesn’t Cover What You Think

A wildcard certificate for *.sumguy.com covers:

api.sumguy.com ✓
blog.sumguy.com ✓

But NOT:

sumguy.com (apex domain): you need a separate SAN
deeply.nested.sumguy.com ✗, wildcards only cover one level
sub.sub.sumguy.com ✗, same reason, two levels down

So always include both the wildcard and the apex in dnsNames. And remember: Let’s Encrypt’s rate limit for wildcard issuance is the same as regular domains (50 new certificates per week per registered domain), with a separate Duplicate Certificate limit of 5 per week for the exact same set of names. If you’re reissuing the same wildcard constantly, that duplicate limit is what bites you.

Rate Limits: The Wall You Didn’t See Coming

Let’s Encrypt has rate limits:

50 new certificates per week per registered domain (renewals don’t count against this)
5 duplicate certificates per week (same exact set of names: this is a separate limit, not part of the 50)
5 failed validations per account, per hostname, per hour (bad challenges burn this fast)

If you hit these, Let’s Encrypt rejects the request with a 429 error. cert-manager retries on an exponential backoff. Your Certificate sits pending for days.

Causes:

Restarting cert-manager repeatedly (each reconciliation triggers a new ACME order)
Deploying the same Certificate across multiple namespaces by accident
Forgetting to add a renewBefore: 720h threshold (default is 30 days before expiry; if you redeploy constantly, you’re renewing early)

The fix: use staging (letsencrypt-staging) until your setup is stable. Then flip to prod once. Set renewBefore to at least 30 days (cert-manager default is fine). If you need to test wildcard issuance, issue to a test domain first.

DNS Provider Integration: The Webhook Dance

cert-manager ships with built-in support for a handful of DNS providers (Cloudflare, Route53, Google CloudDNS, Azure DNS, ACME-DNS, RFC2136). For everything else (DigitalOcean, Hetzner, OVH, deSEC, and friends) you use a webhook, a separate pod that cert-manager calls to add/remove DNS records.

Here’s Cloudflare DNS-01 with secrets:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-dns01
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-key
    solvers:
      - dns01:
          cloudflare:
            email: [email protected]
            apiTokenSecretRef:
              name: cloudflare-api-token
              key: token
---
apiVersion: v1
kind: Secret
metadata:
  name: cloudflare-api-token
  namespace: cert-manager
type: Opaque
stringData:
  token: "your-api-token-here"

The webhook pattern (here, DigitalOcean, an unsupported provider) looks like this:

helm repo add cert-manager-webhook-digitalocean \
  https://kubernetes.github.io/cert-manager-webhook-digitalocean
helm install cert-manager-webhook-digitalocean \
  cert-manager-webhook-digitalocean/cert-manager-webhook-digitalocean \
  --namespace cert-manager

(Route53, Google CloudDNS and Azure DNS are built in, you don’t need a webhook for those, just configure the solver directly.) This deploys a pod that exposes the webhook API group. cert-manager calls it when it needs to add/delete DNS records, and the webhook talks to the provider’s API.

The catch: webhooks add latency. If your webhook pod is slow or times out, the ACME challenge fails. Use horizontal pod autoscaling and give the pod enough memory.

Ingress Integration: The Annotation Path

The simplest way to get TLS is to annotate an Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-service
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  tls:
    - hosts:
        - example.sumguy.com
      secretName: example-tls
  rules:
    - host: example.sumguy.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: my-service
                port:
                  number: 80

cert-manager watches the Ingress, sees the annotation, and auto-creates a Certificate. The cert ends up in the Secret example-tls. The Ingress controller (nginx, traefik, etc.) mounts that Secret and terminates TLS.

Important: Make sure your Ingress controller can read Secrets in that namespace. RBAC should allow it. If you use cert-manager in one namespace and your Ingress in another, the Secret ends up in the Ingress namespace, not cert-manager’s.

Gateway API: The Modern Path (2026+)

If you’re running Kubernetes 1.30+, consider Gateway API instead of Ingress. cert-manager supports HTTPRoute (and others) natively now.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: my-route
spec:
  parentRefs:
    - name: my-gateway
      kind: Gateway
  hostnames:
    - example.sumguy.com
  rules:
    - backendRefs:
        - name: my-service
          port: 80

Attach TLS via the Gateway:

apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: my-gateway
spec:
  gatewayClassName: istio
  listeners:
    - name: https
      port: 443
      protocol: HTTPS
      hostname: example.sumguy.com
      tls:
        mode: Terminate
        certificateRefs:
          - name: example-tls

cert-manager watches the HTTPRoute and creates the Certificate automatically. Gateway API is cleaner than Ingress annotations for complex setups.

Monitoring: Know When It’s About to Break

Use Prometheus. cert-manager exports certmanager_certificate_expiration_timestamp_seconds, alert on this.

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cert-manager-alerts
spec:
  groups:
    - name: cert-manager
      interval: 30s
      rules:
        - alert: CertificateExpiringSoon
          expr: |
            (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
          for: 1h
          annotations:
            summary: "Certificate {{ $labels.name }} expires in less than 7 days"
        - alert: CertificateNotReady
          expr: |
            certmanager_certificate_ready_status{condition="True"} == 0
          for: 1h
          annotations:
            summary: "Certificate {{ $labels.name }} is not in a Ready state"

Also monitor cert-manager itself:

kubectl logs -n cert-manager deployment/cert-manager -f
kubectl logs -n cert-manager deployment/cert-manager-webhook -f

Watch for:

context deadline exceeded (webhook timeout)
rate limited (Let’s Encrypt rate limit hit)
dns: server misbehaving (DNS provider unhappy)
certificate secret already exists (Secret collision)

Troubleshooting: The Debug Ladder

When a Certificate is stuck pending, follow this ladder:

# 1. Describe the Certificate
kubectl describe certificate <cert-name>
# Look at Status.Conditions. Is it "Ready"? What's the message?

# 2. Describe the Order
kubectl describe order <order-name>
# Orders are created by cert-manager. Check Status.State: is it "pending", "processing", "valid", or "invalid"?

# 3. Describe the Challenge
kubectl describe challenge <challenge-name>
# Challenges are per-solver. Look for errors in Status.Reason.

# 4. Check the Issuer
kubectl describe clusterissuer letsencrypt-prod
# Is the issuer configured correctly? Can cert-manager reach ACME servers?

# 5. Webhook logs (if using DNS-01)
kubectl logs -n cert-manager deployment/cert-manager-webhook-<provider>
# Is the webhook healthy? Can it reach your DNS provider?

# 6. Full cert-manager logs
kubectl logs -n cert-manager deployment/cert-manager --tail=500
# Search for the Certificate name. Look for the full reconciliation flow.

Example workflow: Your wildcard certificate is stuck.

$ kubectl describe certificate sumguy-wildcard
Status:
  Conditions:
    - Type: Ready
      Status: False
      Reason: InvalidRequest
      Message: "Invalid request [urn:ietf:params:acme:error:rateLimited]: Error creating new order..."

$ kubectl describe order sumguy-wildcard-abc123
Status:
  State: invalid
  Reason: "urn:ietf:params:acme:error:rateLimited"

# You hit the rate limit. Check when you can retry:
$ kubectl logs -n cert-manager deployment/cert-manager | grep sumguy-wildcard | tail -20
# Look for the next reconciliation time.

The fix: wait. Or use staging. Or reissue to a test domain. Don’t restart cert-manager; that triggers new ACME orders and makes it worse.

Private CA with Vault: When Let’s Encrypt Isn’t Enough

Not all setups use Let’s Encrypt. If you’re running internal services, Vault-backed cert issuance is cleaner. cert-manager supports Vault Issuer:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: vault-pki
spec:
  vault:
    server: https://vault.internal:8200
    path: pki/sign/my-role
    auth:
      kubernetes:
        mountPath: /v1/auth/kubernetes
        role: cert-manager
    caBundle: |
      -----BEGIN CERTIFICATE-----
      ...
      -----END CERTIFICATE-----

cert-manager uses Kubernetes RBAC to authenticate to Vault, requests a certificate, and stores it in a Secret. No ACME, no rate limits, just signed certs on-demand.

CrashLoopBackOff: When cert-manager Spirals

Sometimes cert-manager controller crashes and restarts in a loop. Common causes:

Webhook timeout: cert-manager can’t reach the webhook pod. Fix: scale the webhook, check network policy.
Secret collision: two Certificates pointing to the same Secret name. Fix: use unique Secret names.
Malformed CRD: your Certificate YAML is invalid. Fix: check kubectl api-resources | grep certificate.
OOM: cert-manager is running out of memory under load. Fix: increase resource requests.

To debug:

kubectl logs -n cert-manager deployment/cert-manager --previous
# Check the last log before the crash.

kubectl get events -n cert-manager --sort-by='.lastTimestamp'
# Kubernetes events often have hints.

kubectl describe pod -n cert-manager <pod-name>
# Check resource limits and restarts.

If you’re deploying a flood of Certificates at once, cert-manager might OOM or hit timeout. Stagger the rollout. Use renewBefore: 720h to avoid unnecessary renewals.

One More Thing: CAA Records

Some ACME providers (ZeroSSL, Buypass, private Vault) require a CAA (Certification Authority Authorization) DNS record before they’ll issue. Let’s Encrypt doesn’t enforce it, but others do.

# Add this to your DNS:
example.sumguy.com  CAA  0 issue "letsencrypt.org"
example.sumguy.com  CAA  0 issue "zerossl.com"

If you forget and try to issue with ZeroSSL, the challenge fails silently. Add the CAA record, wait for DNS propagation, and retry.

The Full Picture

cert-manager is a surprisingly deep rabbit hole. On the surface, it’s “add an annotation, get a cert, done.” But scale it to 20 services, add DNS-01 validation, hit a rate limit, and suddenly you’re neck-deep in ACME order states, webhook logs, and exponential backoff math.

The pattern that works:

Start with one ClusterIssuer pointing to letsencrypt-staging.
Use HTTP-01 for simple domains, DNS-01 for wildcards.
Add both the domain and apex to dnsNames.
Monitor with Prometheus. Alert on expiry.
Test with staging before moving to prod.
Don’t restart cert-manager unless necessary.
When stuck, describe the resource hierarchy: Certificate → Order → Challenge.

Your 2 AM self will appreciate it when a certificate nearly expired, cert-manager silently renewed it, and the alerts let you know it’s all fine.

Your First cert-manager Ingress Works Fine. It’s the Fifth One That Burns You.

What cert-manager Actually Does

Issuer vs ClusterIssuer: Scope Matters

HTTP-01 vs DNS-01: Choose Your Complexity

Wildcard Gotchas: Your Cert Doesn’t Cover What You Think

Rate Limits: The Wall You Didn’t See Coming

DNS Provider Integration: The Webhook Dance

Ingress Integration: The Annotation Path

Gateway API: The Modern Path (2026+)

Monitoring: Know When It’s About to Break

Troubleshooting: The Debug Ladder

Private CA with Vault: When Let’s Encrypt Isn’t Enough

CrashLoopBackOff: When cert-manager Spirals

One More Thing: CAA Records

The Full Picture

Responses from around the web

Discussion

Related Posts

Sealed Secrets vs External Secrets Operator

stunnel vs spiped

Headlamp: K8s UI Without the License Drama

K9s vs Lens vs Headlamp: Cluster UIs

cert-manager: ACME at Scale

Your First cert-manager Ingress Works Fine. It’s the Fifth One That Burns You.

What cert-manager Actually Does

Issuer vs ClusterIssuer: Scope Matters

HTTP-01 vs DNS-01: Choose Your Complexity

Wildcard Gotchas: Your Cert Doesn’t Cover What You Think

Rate Limits: The Wall You Didn’t See Coming

DNS Provider Integration: The Webhook Dance

Ingress Integration: The Annotation Path

Gateway API: The Modern Path (2026+)

Monitoring: Know When It’s About to Break

Troubleshooting: The Debug Ladder

Private CA with Vault: When Let’s Encrypt Isn’t Enough

CrashLoopBackOff: When cert-manager Spirals

One More Thing: CAA Records

The Full Picture

Related Reading

Responses from around the web

Discussion

Related Posts

Sealed Secrets vs External Secrets Operator

stunnel vs spiped

Headlamp: K8s UI Without the License Drama

K9s vs Lens vs Headlamp: Cluster UIs