Skip to content
Go back

Velero: K8s Backup and DR

By SumGuy 11 min read
Velero: K8s Backup and DR

Backups Are ETL for Your Future Regret

You’re three months into running a stateful workload on Kubernetes. The database tier looks bulletproof. The PVCs are mounted. Monitoring is configured. Everything is fine. Then, at 2 AM, someone typos a delete in a YAML and suddenly your production cluster is gone—not crashed, not degraded, just gone. The kubectl command fired. The resources evaporated.

That’s when you realize your backup strategy is “pray the cloud provider keeps redundancy.”

Here’s the thing: Velero exists for exactly this moment. It’s a Kubernetes-native backup and disaster recovery tool that captures your entire cluster state—every resource definition, custom resource, namespace, and persistent volume—and lets you replay it on a fresh cluster in minutes. No manual kubectl apply, no guessing which resources to restore, no 3 AM spreadsheet archaeology.

Originally built by Heptio (acquired by VMware, now Broadcom), Velero is open source, Cloud Native Computing Foundation Sandbox-level, and battle-tested by organizations running anything from one-node home labs to sprawling multi-cluster estates. We’re going to cover what it actually backs up, how to set it up, the storage tradeoffs, and the real runbook for “the cluster is burning, what now?”


What Velero Actually Does

Most backup solutions for Kubernetes focus on the data layer: “Save my databases, save my volumes.” That’s half the story. Velero takes a different approach—it backs up cluster state (all resources via the Kubernetes API), persistent volume data (via CSI snapshots or Restic node agents), and optionally hooks into your applications for quiescing (flushing buffers, freezing transactions).

The Backup Scope

When you create a Velero backup, it:

  1. Snapshots the Kubernetes API — Walks through all resources (Deployments, StatefulSets, Services, Ingress, ConfigMaps, Secrets, CRDs, everything) and serializes them to JSON/YAML
  2. Captures PV data — Either via CSI snapshots (if your storage driver supports them) or by mounting volumes on a node agent and uploading data to object storage
  3. Stores metadata in S3-compatible storage — MinIO, AWS S3, Backblaze B2, Garage, whatever you’ve got
  4. Optionally runs pre/post-backup hooks — Custom scripts (flush the database, quiesce the app) before and after the backup
  5. Creates Backup CRDs in-cluster — You can kubectl describe backup my-backup and see exactly what was captured

The result: a complete, reproducible snapshot of your cluster that you can restore to any Kubernetes cluster running the same version (or newer, usually).


Why This Beats Manual Backup Strategies

I’ve watched teams try to DIY this:

Velero splits the difference: it’s simple enough to run on a single-node cluster with a MinIO bucket, but sophisticated enough to handle namespaced backups, cross-cluster restores, and complex restoration workflows. And it’s free.


Installation: Helm or Velero CLI

You’ll need:

Terminal window
helm repo add vmware-tanzu https://vmware-tanzu.github.io/helm-charts
helm repo update
helm install velero vmware-tanzu/velero \
--namespace velero \
--create-namespace \
--values values-velero.yaml

Where values-velero.yaml includes your S3 backend credentials:

values-velero.yaml
configuration:
backupStorageLocation:
bucket: "velero-backups"
provider: aws
config:
s3Url: "https://minio.your-domain.com" # MinIO endpoint
accessKey: "minioadmin"
secretKey: "your-secret-key"
region: "us-east-1" # dummy for MinIO, required by boto3
volumeSnapshotLocation:
provider: aws
config:
snapshotLocation: "us-east-1"
schedules:
daily-backup:
schedule: "0 2 * * *"
template:
ttl: "720h" # 30 days
includedNamespaces: ["*"]
image:
repository: velero/velero
tag: "v1.13.1" # Pin the version
credentials:
useSecret: true
existingSecret: velero-credentials

Then create a Secret with your S3 credentials:

Terminal window
kubectl create secret generic velero-credentials \
--from-literal=cloud=aws \
--from-literal=aws-access-key-id=YOUR_KEY \
--from-literal=aws-secret-access-key=YOUR_SECRET \
-n velero

Option 2: Velero CLI

Terminal window
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket velero-backups \
--secret-file ./credentials \
--use-volume-snapshots=false \
--snapshot-location-config \
snapshotLocation=us-east-1

Either way, Velero will spin up a Deployment and a restic DaemonSet (if using Restic for PV backups).


The Backup CRD: Making Your First Backup

Once Velero is running, backups are defined as Kubernetes resources. Here’s a full-cluster backup:

backup.yaml
apiVersion: velero.io/v1
kind: Backup
metadata:
name: daily-snapshot
namespace: velero
spec:
includedNamespaces: ["*"]
excludedNamespaces: ["velero"]
# Exclude system namespaces if you want
# excludedNamespaces: ["kube-system", "kube-public", "velero"]
# Include all resources except secrets (if you don't trust S3 encryption)
includedResources: ["*"]
excludedResources: ["events", "events.events.k8s.io"]
# Include PV data
includedVolumes: ["*"]
excludedVolumes: ["cache-volume"]
# Time-to-live before auto-deletion
ttl: "720h"
# Run hooks before/after backup
hooks:
resources:
- name: my-db-quiesce
includedNamespaces: ["database"]
pre:
- exec:
container: postgres
command: ["/bin/sh", "-c", "pg_basebackup -D /tmp/backup"]
# For CSI snapshots (requires CSI driver + VolumeSnapshotClass)
defaultVolumesToFsBackup: false
defaultVolumesToRestic: true # Use Restic instead
# Metadata labels for filtering
labels:
backup-type: "full-cluster"
frequency: "daily"

Apply it:

Terminal window
kubectl apply -f backup.yaml
# Watch it progress
kubectl logs -n velero deployment/velero -f
# See the backup status
velero backup describe daily-snapshot
velero backup logs daily-snapshot

BackupSchedule: Automating the Pain Away

Rather than manually creating backups, use a BackupSchedule:

backup-schedule.yaml
apiVersion: velero.io/v1
kind: BackupSchedule
metadata:
name: daily-full-backup
namespace: velero
spec:
schedule: "0 2 * * *" # 2 AM daily, cron syntax
template:
includedNamespaces: ["*"]
excludedNamespaces: ["velero", "kube-system"]
# Namespace-scoped backup (backup only one namespace)
# includedNamespaces: ["production"]
includedResources: ["*"]
excludedResources: ["events"]
ttl: "720h" # Delete backups older than 30 days
defaultVolumesToRestic: true
labels:
schedule: "daily"

Velero will create Backup CRDs on the schedule and clean up old ones automatically.


Restore: Bringing It Back

When disaster strikes, restores are equally straightforward:

restore.yaml
apiVersion: velero.io/v1
kind: Restore
metadata:
name: restore-prod-db
namespace: velero
spec:
backupName: daily-snapshot # Which backup to restore
# Restore only one namespace (common for partial recovery)
includedNamespaces: ["database"]
# includedNamespaces: ["*"] # restore everything
# Map namespace during restore (restore 'prod' backup into 'prod-restored')
namespaceMapping:
prod: prod-restored
# Include/exclude resources
includedResources: ["*"]
excludedResources: ["events"]
# Restore hooks (run after restore, e.g., schema migration)
hooks:
resources:
- name: migrate-after-restore
includedNamespaces: ["database"]
postRestore:
- exec:
container: postgres
command: ["/bin/sh", "-c", "psql -f /migrations/post-restore.sql"]
# Don't restore persistent volumes (useful if reusing them from backup)
restorePVs: true
# Preserve secrets (skip recreating Secret objects)
existingResourcePolicy: update # or 'skip'

Restore via CLI:

Terminal window
velero restore create \
--from-backup daily-snapshot \
--include-namespaces "database" \
my-restore-job
# Watch it
velero restore describe my-restore-job
velero restore logs my-restore-job

Restic vs. CSI Snapshots: The Data Layer Tradeoff

This is the gotcha that trips people up.

Restic (Universal, Slower)

CSI Snapshots (Fast, Storage-Specific)

Recommendation: For production, start with Restic. It’s slower but universal—you can restore to any cluster with any storage. Once you’re comfortable, migrate to CSI if your storage driver supports it.


Cross-Cluster Restore: The Real Win

Here’s why Velero matters. You’re running a single cluster. A node dies. The cluster controller gets corrupted. You lose quorum. You need a new cluster.

With Velero:

  1. Spin up a new Kubernetes cluster (same version or newer)
  2. Install Velero with the same S3 backend
  3. List the backups: velero backup get
  4. Restore: velero restore create --from-backup daily-snapshot my-recovery
  5. Watch resources come back: kubectl get all -A

Your entire cluster is back, resources and all. That’s the entire point.


Encryption, Security, and Gotchas

Encryption

CRDs First, Then Resources

Velero restores in this order: Custom Resource DefinitionsCluster-scoped resourcesNamespaced resources. If you have a CRD that other resources depend on, you need that CRD to exist first or the restore will fail silently. Velero handles this, but be aware.

Plugins for AWS/GCP/Azure

Velero has plugins for cloud-provider snapshots:

Install them as --plugins in your Helm values or CLI.

Secrets and ConfigMaps

Velero backs up Secrets and ConfigMaps as-is. If you don’t want Secret data in object storage, exclude them:

excludedResources: ["secrets"]

You’ll need to restore those manually.

Namespace Deletion

Velero does NOT delete existing namespaces during restore. If you’re restoring a namespace that already exists, resources will merge (or conflict, depending on existingResourcePolicy). If you want a clean slate, delete the namespace first.


Real DR Runbook: Cluster is Burning

Your cluster just caught fire. Here’s the actual sequence:

Step 1: Spin Up a Fresh Cluster

Terminal window
# On your local machine, provision new infrastructure
# (Terraform, Ansible, whatever you use)
# Wait for cluster to be ready and kubeconfig accessible

Step 2: Install Velero

Terminal window
helm install velero vmware-tanzu/velero \
--namespace velero \
--create-namespace \
--values values-velero.yaml # Same S3 backend as production

Step 3: List Available Backups

Terminal window
velero backup get
# Output:
# NAME STATUS ERRORS WARNINGS CREATED
# daily-snapshot Completed 0 0 2026-06-24 02:00:00
# daily-snapshot-2 Completed 0 0 2026-06-23 02:00:00

Step 4: Restore

Terminal window
velero restore create \
--from-backup daily-snapshot \
my-cluster-recovery
# Watch the logs
velero restore logs my-cluster-recovery
# Verify workloads are back
kubectl get pods -A
kubectl get svc -A

Step 5: Validate

Terminal window
# Check if your app is running
kubectl logs -n production deployment/my-app
# Connect to services
kubectl port-forward -n production svc/my-db 5432:5432
# Run a sanity query
psql -h localhost -U dbuser -d mydb -c "SELECT COUNT(*) FROM users;"

Step 6: Update DNS/Ingress If your old cluster had a public IP, update DNS or the Ingress controller endpoint to point to the new cluster.

Step 7: Monitor

Terminal window
# Check Velero status for any warnings
velero restore describe my-cluster-recovery

The whole process: 15–30 minutes, depending on your cluster size and backup upload speed.


Alternatives: Stash, Kasten K10, TrilioVault

For self-hosters and small teams, Velero + MinIO (or S3) is unbeatable. Free, auditable, simple, and proven.


Monitoring and Alerts

Velero exposes Prometheus metrics:

8085/metrics
# In-cluster, the Velero Prometheus metrics are at:
# Useful metrics:
# velero_backups_total
# velero_backup_failure_total
# velero_backup_duration_seconds
# velero_restore_total
# velero_restic_backup_errors

Set up Prometheus scraping and alert on:


The Bottom Line

Velero is boring infrastructure—exactly what you want. It does one job: capture cluster state and PV data, store it durably, and let you replay it on a fresh cluster. No surprises, no proprietary formats, no vendor lock-in.

Your cluster will fail. Drives will die. Bugs will slip through. But with Velero running and a BackupSchedule cranking out daily snapshots, you’re not 2 AM debugging—you’re spinning up a new cluster and restoring from yesterday.

That peace of mind is worth the 20 minutes of setup.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
Jellyseerr Tagging Workflows for Real Libraries

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts