Your First cert-manager Ingress Works Fine. It’s the Fifth One That Burns You.
You set up cert-manager in your Kubernetes cluster because manual TLS certificates are a 2 AM nightmare. You add an Ingress with one annotation — cert-manager.io/cluster-issuer: letsencrypt-prod — and boom, Let’s Encrypt ACME flow spins up automatically. Certificate appears. Site loads over HTTPS. You feel like a genius.
Then you add four more services. You deploy a wildcard certificate. Rate limits reject your requests. Your Ingress sits pending for hours. The ACME challenge fails silently. Your logs are screaming about webhook timeouts and DNS propagation delays. Suddenly cert-manager doesn’t feel so automatic anymore.
Here’s the thing: cert-manager is solid — really solid. But it’s a state machine for certificate lifecycle, and state machines have edge cases. This guide walks you through the full picture: how cert-manager works at scale, where the footguns hide, and how to debug when it inevitably breaks.
What cert-manager Actually Does
cert-manager is a Kubernetes operator that watches Certificate and Ingress resources, orchestrates ACME flows with certificate authorities (Let’s Encrypt, ZeroSSL, Buypass, your private Vault), and stores the resulting certificate in a Secret. You don’t manually openssl req, you don’t renew certs every 90 days, you don’t track expiry dates. cert-manager does all that.
But here’s what trips people up: cert-manager doesn’t just get a certificate. It watches the certificate’s lifecycle, monitors expiry, and re-runs ACME validation before renewal. That means it’s making API calls to your DNS provider, running HTTP challenges across the internet, polling ACME servers, and storing secrets in etcd. At scale, each of those operations becomes a lever someone pulls wrong.
The resource hierarchy is this:
Issuer/ClusterIssuer (defines how to get certs: Let's Encrypt ACME, Vault, etc.) ↓Certificate (what you want: domain, validity period, renewal threshold) ↓CertificateRequest (cert-manager's internal object, talks to the Issuer) ↓Order (ACME-specific: represents the order with Let's Encrypt) ↓Challenge (the actual HTTP-01 or DNS-01 validation) ↓Secret (the final private key + cert bundle, stored in etcd)When you add an Ingress with a cert-manager annotation, the controller automatically creates a Certificate resource, which creates a CertificateRequest, which creates an Order, which creates Challenges. If any step fails — DNS doesn’t propagate, webhook times out, rate limit hits — you’re stuck waiting for the next reconciliation loop.
Issuer vs ClusterIssuer: Scope Matters
An Issuer is namespaced. A ClusterIssuer is cluster-wide. Pick the wrong one and your ACME credentials live in the wrong place.
Most setups use a single ClusterIssuer for Let’s Encrypt. That issuer holds ACME account credentials (email, private key) in a Secret. If you use a namespaced Issuer, you’re duplicating that Secret across namespaces, which is messy. For production, one ClusterIssuer per CA is the pattern.
apiVersion: cert-manager.io/v1kind: ClusterIssuermetadata: name: letsencrypt-prodspec: acme: server: https://acme-v02.api.letsencrypt.org/directory privateKeySecretRef: name: letsencrypt-prod-key solvers: - http01: ingress: class: nginx - dns01: cloudflare: apiTokenSecretRef: name: cloudflare-token key: api-tokenThis single issuer supports both HTTP-01 (for simple domains) and DNS-01 (for wildcards). cert-manager picks the solver based on the Certificate resource’s dnsNames and dnsPolicy fields.
One more thing: if you’re paranoid, add a second ClusterIssuer with letsencrypt-staging. Hit staging first to debug. Rate limits are 100× higher on staging. When your Certificate works with staging, flip the annotation and retry.
HTTP-01 vs DNS-01: Choose Your Complexity
HTTP-01 is simpler. cert-manager creates a temporary HTTP route, Let’s Encrypt hits http://<your-domain>/.well-known/acme-challenge/<token>, reads the token, and marks the challenge complete. It’s straightforward, but:
- Requires a public IP and open port 80 (or port-forward in your firewall)
- Doesn’t work for wildcard certificates
- Fails if your Ingress controller is slow to propagate routes
DNS-01 is slower but more flexible. cert-manager (via a webhook) adds a TXT record to your DNS, Let’s Encrypt queries DNS for that record, and the challenge passes. Then cert-manager deletes the record. It works for:
- Wildcard certificates (
*.sumguy.com) - Domains on private networks (cert-manager has network access to DNS, Let’s Encrypt doesn’t need to)
- Environments where port 80 is blocked or behind a NAT
The trade-off: DNS-01 requires credentials to your DNS provider (Cloudflare API token, AWS Route53 key, Google CloudDNS service account). cert-manager stores those in a Secret. If someone compromises that Secret, they can modify your DNS. Use strong RBAC.
Here’s a Certificate that uses DNS-01 for a wildcard:
apiVersion: cert-manager.io/v1kind: Certificatemetadata: name: sumguy-wildcard namespace: defaultspec: secretName: sumguy-wildcard-tls commonName: "*.sumguy.com" dnsNames: - "*.sumguy.com" - "sumguy.com" issuerRef: name: letsencrypt-prod kind: ClusterIssuer dnsPolicy: dns01cert-manager will resolve the dnsPolicy to a specific solver defined in the ClusterIssuer. If you omit dnsPolicy, it falls back to HTTP-01 (if defined).
Wildcard Gotchas: Your Cert Doesn’t Cover What You Think
A wildcard certificate for *.sumguy.com covers:
api.sumguy.com✓blog.sumguy.com✓
But NOT:
sumguy.com(apex domain) — you need a separate SANdeeply.nested.sumguy.com✗ — wildcards only cover one levelsub.sub.sumguy.com✗ — same reason, two levels down
So always include both the wildcard and the apex in dnsNames. And remember: Let’s Encrypt’s rate limit for wildcard issuance is the same as regular domains (50 certificates per week per registered domain), but the per-duplicate limit (5 per week) counts towards that same pool. If you’re reissuing wildcards constantly, you’re burning your quota.
Rate Limits: The Wall You Didn’t See Coming
Let’s Encrypt has rate limits:
- 50 certificates per week per registered domain (all SANs, all variants count)
- 5 duplicates per week (same domain + SANs, different key, counts towards the 50)
- 5 authorizations per account per domain per hour (per ACME order)
If you hit these, Let’s Encrypt rejects the request with a 429 error. cert-manager retries on an exponential backoff. Your Certificate sits pending for days.
Causes:
- Restarting cert-manager repeatedly (each reconciliation triggers a new ACME order)
- Deploying the same Certificate across multiple namespaces by accident
- Forgetting to add a
renewBefore: 720hthreshold (default is 30 days before expiry; if you redeploy constantly, you’re renewing early)
The fix: use staging (letsencrypt-staging) until your setup is stable. Then flip to prod once. Set renewBefore to at least 30 days (cert-manager default is fine). If you need to test wildcard issuance, issue to a test domain first.
DNS Provider Integration: The Webhook Dance
cert-manager ships with built-in support for a handful of DNS providers (Cloudflare, Route53, Google CloudDNS, Azure DNS). For unsupported providers, you use a webhook — a sidecar that cert-manager calls via HTTP to add/remove DNS records.
Here’s Cloudflare DNS-01 with secrets:
apiVersion: cert-manager.io/v1kind: ClusterIssuermetadata: name: letsencrypt-dns01spec: acme: server: https://acme-v02.api.letsencrypt.org/directory privateKeySecretRef: name: letsencrypt-key solvers: - dns01: cloudflare: apiTokenSecretRef: name: cloudflare-api-token key: token---apiVersion: v1kind: Secretmetadata: name: cloudflare-api-token namespace: cert-managertype: OpaquestringData: token: "your-api-token-here"The webhook pattern (for Route53, Google, others):
helm repo add jetstack https://charts.jetstack.iohelm install cert-manager-webhook-route53 jetstack/cert-manager-webhook-aws \ --namespace cert-manager \ --set aws.secretAccessKey=YOUR_KEY \ --set aws.accessKeyId=YOUR_IDThis deploys a pod that exposes an HTTP webhook. cert-manager calls it when it needs to add/delete DNS records. The webhook has access to AWS credentials and talks to Route53.
The catch: webhooks add latency. If your webhook pod is slow or times out, the ACME challenge fails. Use horizontal pod autoscaling and give the pod enough memory.
Ingress Integration: The Annotation Path
The simplest way to get TLS is to annotate an Ingress:
apiVersion: networking.k8s.io/v1kind: Ingressmetadata: name: my-service annotations: cert-manager.io/cluster-issuer: "letsencrypt-prod"spec: tls: - hosts: - example.sumguy.com secretName: example-tls rules: - host: example.sumguy.com http: paths: - path: / pathType: Prefix backend: service: name: my-service port: number: 80cert-manager watches the Ingress, sees the annotation, and auto-creates a Certificate. The cert ends up in the Secret example-tls. The Ingress controller (nginx, traefik, etc.) mounts that Secret and terminates TLS.
Important: Make sure your Ingress controller can read Secrets in that namespace. RBAC should allow it. If you use cert-manager in one namespace and your Ingress in another, the Secret ends up in the Ingress namespace, not cert-manager’s.
Gateway API: The Modern Path (2026+)
If you’re running Kubernetes 1.30+, consider Gateway API instead of Ingress. cert-manager supports HTTPRoute (and others) natively now.
apiVersion: gateway.networking.k8s.io/v1kind: HTTPRoutemetadata: name: my-routespec: parentRefs: - name: my-gateway kind: Gateway hostnames: - example.sumguy.com rules: - backendRefs: - name: my-service port: 80Attach TLS via the Gateway:
apiVersion: gateway.networking.k8s.io/v1beta1kind: Gatewaymetadata: name: my-gatewayspec: gatewayClassName: istio listeners: - name: https port: 443 protocol: HTTPS hostname: example.sumguy.com tls: mode: Terminate certificateRefs: - name: example-tlscert-manager watches the HTTPRoute and creates the Certificate automatically. Gateway API is cleaner than Ingress annotations for complex setups.
Monitoring: Know When It’s About to Break
Use Prometheus. cert-manager exports certmanager_certificate_expiration_timestamp_seconds — alert on this.
apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: cert-manager-alertsspec: groups: - name: cert-manager interval: 30s rules: - alert: CertificateExpiringSoon expr: | (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7 for: 1h annotations: summary: "Certificate {{ $labels.name }} expires in less than 7 days" - alert: CertificateNeverRenewed expr: | (time() - certmanager_certificate_renewal_errors_total) > 604800 for: 1h annotations: summary: "Certificate {{ $labels.name }} hasn't been renewed in 7+ days"Also monitor cert-manager itself:
kubectl logs -n cert-manager deployment/cert-manager -fkubectl logs -n cert-manager deployment/cert-manager-webhook -fWatch for:
context deadline exceeded(webhook timeout)rate limited(Let’s Encrypt rate limit hit)dns: server misbehaving(DNS provider unhappy)certificate secret already exists(Secret collision)
Troubleshooting: The Debug Ladder
When a Certificate is stuck pending, follow this ladder:
# 1. Describe the Certificatekubectl describe certificate <cert-name># Look at Status.Conditions. Is it "Ready"? What's the message?
# 2. Describe the Orderkubectl describe order <order-name># Orders are created by cert-manager. Check Status.State: is it "pending", "processing", "valid", or "invalid"?
# 3. Describe the Challengekubectl describe challenge <challenge-name># Challenges are per-solver. Look for errors in Status.Reason.
# 4. Check the Issuerkubectl describe clusterissuer letsencrypt-prod# Is the issuer configured correctly? Can cert-manager reach ACME servers?
# 5. Webhook logs (if using DNS-01)kubectl logs -n cert-manager deployment/cert-manager-webhook-<provider># Is the webhook healthy? Can it reach your DNS provider?
# 6. Full cert-manager logskubectl logs -n cert-manager deployment/cert-manager --tail=500# Search for the Certificate name. Look for the full reconciliation flow.Example workflow: Your wildcard certificate is stuck.
$ kubectl describe certificate sumguy-wildcardStatus: Conditions: - Type: Ready Status: False Reason: InvalidRequest Message: "Invalid request [urn:ietf:params:acme:error:rateLimited]: Error creating new order..."
$ kubectl describe order sumguy-wildcard-abc123Status: State: invalid Reason: "urn:ietf:params:acme:error:rateLimited"
# You hit the rate limit. Check when you can retry:$ kubectl logs -n cert-manager deployment/cert-manager | grep sumguy-wildcard | tail -20# Look for the next reconciliation time.The fix: wait. Or use staging. Or reissue to a test domain. Don’t restart cert-manager — that triggers new ACME orders and makes it worse.
Private CA with Vault: When Let’s Encrypt Isn’t Enough
Not all setups use Let’s Encrypt. If you’re running internal services, Vault-backed cert issuance is cleaner. cert-manager supports Vault Issuer:
apiVersion: cert-manager.io/v1kind: ClusterIssuermetadata: name: vault-pkispec: vault: server: https://vault.internal:8200 path: pki/sign/my-role auth: kubernetes: mountPath: /v1/auth/kubernetes role: cert-manager caBundle: | -----BEGIN CERTIFICATE----- ... -----END CERTIFICATE-----cert-manager uses Kubernetes RBAC to authenticate to Vault, requests a certificate, and stores it in a Secret. No ACME, no rate limits, just signed certs on-demand.
CrashLoopBackOff: When cert-manager Spirals
Sometimes cert-manager controller crashes and restarts in a loop. Common causes:
- Webhook timeout: cert-manager can’t reach the webhook pod. Fix: scale the webhook, check network policy.
- Secret collision: two Certificates pointing to the same Secret name. Fix: use unique Secret names.
- Malformed CRD: your Certificate YAML is invalid. Fix: check
kubectl api-resources | grep certificate. - OOM: cert-manager is running out of memory under load. Fix: increase resource requests.
To debug:
kubectl logs -n cert-manager deployment/cert-manager --previous# Check the last log before the crash.
kubectl get events -n cert-manager --sort-by='.lastTimestamp'# Kubernetes events often have hints.
kubectl describe pod -n cert-manager <pod-name># Check resource limits and restarts.If you’re deploying a flood of Certificates at once, cert-manager might OOM or hit timeout. Stagger the rollout. Use renewBefore: 720h to avoid unnecessary renewals.
One More Thing: CAA Records
Some ACME providers (ZeroSSL, Buypass, private Vault) require a CAA (Certification Authority Authorization) DNS record before they’ll issue. Let’s Encrypt doesn’t enforce it, but others do.
# Add this to your DNS:example.sumguy.com CAA 0 issue "letsencrypt.org"example.sumguy.com CAA 0 issue "zerossl.com"If you forget and try to issue with ZeroSSL, the challenge fails silently. Add the CAA record, wait for DNS propagation, and retry.
The Full Picture
cert-manager is a surprisingly deep rabbit hole. On the surface, it’s “add an annotation, get a cert, done.” But scale it to 20 services, add DNS-01 validation, hit a rate limit, and suddenly you’re neck-deep in ACME order states, webhook logs, and exponential backoff math.
The pattern that works:
- Start with one
ClusterIssuerpointing toletsencrypt-staging. - Use HTTP-01 for simple domains, DNS-01 for wildcards.
- Add both the domain and apex to
dnsNames. - Monitor with Prometheus. Alert on expiry.
- Test with staging before moving to prod.
- Don’t restart cert-manager unless necessary.
- When stuck,
describethe resource hierarchy: Certificate → Order → Challenge.
Your 2 AM self will appreciate it when a certificate nearly expired, cert-manager silently renewed it, and the alerts let you know it’s all fine.