Chapter 20: Backup and Disaster Recovery
Self-managing clusters (Chapter 6) bound the blast radius of failure — one cluster’s loss doesn’t affect the fleet. But the lost cluster’s workloads still need recovery.
Most organizations treat backup as a checkbox: “we have Velero configured.” The harder question is: have you restored from it?
The decisions:
- What actually needs backup? (CRD specs can be re-derived. Persistent volume data can’t. Knowing the difference saves storage cost and recovery time.)
- Can the derivation pipeline be your recovery mechanism? (If the specs survive, the pipeline re-derives everything. But “everything” has caveats.)
- How do you know your backups work? (If you haven’t restored from a backup, you don’t have a backup. You have a file in S3 that you hope works.)
20.1 What Needs Backup
Section titled “20.1 What Needs Backup”Not everything in a cluster needs backup. The derivation pipeline (Chapter 8) can re-derive most of the cluster’s resources from CRD specs. Focus backups on what can’t be re-derived.
CRD specs. The source of truth for every service, job, and model. If stored in git (GitOps workflow), git is the backup. If applied directly via kubectl apply, they need cluster-level backup. The platform should encourage — or enforce — git-based spec management so that CRD specs are never the single point of failure.
Persistent volumes. Databases, caches with persistence, file stores. PV data can’t be re-derived. Cloud-native snapshots (EBS snapshots, GCP persistent disk snapshots) are the most reliable mechanism. The reference implementation supports LatticeClusterBackup CRDs that schedule Velero backups with configurable retention.
CAPI resources. On self-managing clusters, CAPI resources define the cluster’s infrastructure. If they’re lost, the cluster can’t scale, upgrade, or replace nodes. These are critical — and they exist only on the cluster itself after the pivot (Chapter 6).
etcd. CAPI resources, CRD specs, and all Kubernetes state live in etcd. etcd snapshots are the foundation of cluster-level backup. But “restore etcd snapshot” is not as simple as loading a file.
With 3 etcd members: losing 1 member is not a restore event — quorum (2 of 3) is intact, etcd continues operating, and you replace the failed member by adding a new one. Losing 2 members loses quorum — etcd is read-only (or unavailable, depending on configuration). Now you need the snapshot. An etcd restore from snapshot requires: stopping the surviving member, restoring the snapshot to a new single-member cluster, then adding members back to rebuild the 3-member topology. The restored cluster has new member IDs — kubeadm and the CAPI KubeadmControlPlane controller handle the membership reconfiguration, but it’s not instant.
The dangerous case: all 3 members lost (disk corruption, all control plane nodes gone). Restore from snapshot to a fresh single-member etcd, bootstrap a new control plane around it, then scale back to 3. This is the procedure that should be tested quarterly (Section 20.9) because it’s the one nobody practices until the disaster forces it.
Secrets backend data. External secrets are stored in Vault or cloud secret managers. These systems have their own backup mechanisms — the platform doesn’t back them up. But Kubernetes Secrets (ESO-synced copies) are ephemeral — ESO regenerates them from the backend on the next sync.
What doesn’t need backup: Derived output (Deployments, Services, NetworkPolicies — re-derived from CRD specs). Metrics and logs (stored externally in the metrics backend). Cached state (verification cache, DCGM baselines — regenerated). These are consequences of the CRD specs and the derivation pipeline. The specs are the source of truth; the derived output is disposable.
20.2 The Derivation Pipeline as Recovery
Section titled “20.2 The Derivation Pipeline as Recovery”This is the key insight: if the CRD specs survive, the derivation pipeline re-derives the entire cluster’s workload infrastructure.
Walk through a disaster recovery scenario:
- t=0. Cluster
us-east-productionis destroyed — etcd quorum lost, control plane unrecoverable. - t=5m. The operator provisions a new cluster from the same
LatticeClusterCRD (Chapter 5). Infrastructure provisioning begins. - t=20m. The new cluster is Ready — control plane, workers, CNI, mesh, platform operator all bootstrapped.
- t=21m. The operator restores CRD specs from git:
kubectl apply -f services/commerce/. 50 LatticeService, LatticeJob, and LatticeModel CRDs are created. - t=22m. The derivation pipeline begins processing. Each CRD is derived: image verification, Cedar authorization, workload compilation, layered application, mesh readiness wait. The pipeline processes specs in parallel (one reconciliation per CRD).
- t=30m. Most services are Ready. ESO has synced secrets from the backend. Mesh policies are in place. Observability scrape targets are active.
- t=35m. The operator restores persistent volume data from snapshots. Databases come online with data from the last snapshot.
- t=40m. DNS records are updated to point to the new cluster’s ingress. Traffic resumes.
Total recovery time: ~40 minutes. Of that, 20 minutes is cluster provisioning (infrastructure), 10 minutes is derivation (the pipeline), and 10 minutes is PV restoration (data). The derivation pipeline handled 50 services automatically — the operator didn’t manually re-create Deployments, NetworkPolicies, ExternalSecrets, or scrape targets.
These timings are best-case estimates from the reference implementation’s test environment. Production recovery depends on variables the platform doesn’t control: cloud provider API response times (EC2 launch can take 5-10 minutes on a bad day), volume snapshot sizes (a 500GB restore takes longer than a 10GB restore), DNS propagation (Section 20.5), and secret backend availability (if Vault is also down, ESO can’t sync). Use these numbers as a planning baseline, not a guarantee. Your quarterly restore tests (Section 20.9) will produce the timings that matter — the ones from your actual infrastructure.
What the pipeline CAN’T recover:
- Persistent volume data older than the last snapshot (RPO gap).
- In-flight requests at the time of failure (lost).
- DNS propagation delay (clients using cached DNS may hit the old endpoint for TTL duration).
- Cross-cluster bilateral agreements with other clusters (need the other clusters to re-match against the new cluster’s service specs).
20.3 RPO and RTO
Section titled “20.3 RPO and RTO”Recovery Point Objective — how much data can you lose. Determined by backup frequency:
- CRD specs in git: zero loss (git is always current).
- PV snapshots hourly: up to 1 hour of data loss.
- PV snapshots daily: up to 24 hours of data loss.
Recovery Time Objective — how quickly you recover:
- Cluster provisioning: 15-20 minutes.
- Derivation pipeline: 10-15 minutes for 200 services. Derivation parallelizes well — each CRD reconciles independently, so 4x services doesn’t mean 4x time. The walkthrough above re-derives 50 services in ~9 minutes; 200 services takes 10-15 minutes, not 36 minutes, because the pipeline processes specs concurrently.
- PV restoration: depends on snapshot size and restore speed.
- Total: 30-50 minutes for a full cluster recovery under favorable conditions. The range widens with larger volumes, slower providers, or degraded external dependencies.
For self-managing clusters, RPO and RTO are per-cluster. One cluster’s 40-minute RTO doesn’t affect other clusters. This is the architectural payoff of self-management and bounded blast radius.
20.4 Rebuild vs Restore
Section titled “20.4 Rebuild vs Restore”Two recovery modes exist. The chapter has described both without explicitly naming the choice.
Mode A: Rebuild. Provision a new cluster. Re-apply CRD specs from git. Restore PV snapshots. The cluster has a new identity — new IPs, new certificates, new mesh identities. This is the walkthrough in Section 20.2.
Mode B: Restore etcd. Restore the etcd snapshot to the same (or replacement) control plane nodes. The cluster retains its identity — same CAPI resources, same certificates, same mesh identities. This is faster for control-plane-only failures but riskier if the etcd snapshot contains corrupted state.
When to rebuild:
- Git is the source of truth and is current.
- Infrastructure is reproducible (Chapter 5’s declarative provisioning).
- You want clean state — no risk of restoring corrupted resources.
- The failure is total (all control plane nodes gone, no point restoring).
When to restore etcd:
- Git is incomplete (specs were applied directly, not committed).
- You need exact historical state (CAPI resources, in-progress reconciliations).
- Recovery must preserve cluster identity (external systems depend on specific IPs, certificates, or DNS names).
- The failure is partial (one etcd member lost, control plane degraded but recoverable).
The platform should support both modes. The default recommendation: rebuild when you can, restore when you must. Rebuild produces clean state from known-good inputs. Restore recovers unknown state from a point-in-time snapshot — faster, but you inherit whatever was in etcd at snapshot time, including any bugs or drift.
20.5 What Breaks Outside the Cluster
Section titled “20.5 What Breaks Outside the Cluster”Recovery is not just a cluster-internal event. External systems that depend on the cluster’s identity will break.
mTLS certificates. A rebuilt cluster has a new mesh CA intermediate. SPIFFE IDs change. Cross-cluster bilateral agreements that referenced the old cluster’s identities must be re-established through the parent’s PeerRouteSync (Chapter 15). An etcd restore preserves the old certificates — but they may be close to expiry.
IP addresses. New nodes have new IPs. External systems with IP allowlists (firewalls, third-party API providers, database connection limits) need updating. Cloud load balancers may get new IPs. The platform should use DNS-based service discovery rather than IP-based wherever possible to minimize this.
DNS propagation. “DNS records are updated, traffic resumes” is the theory. In practice, DNS is often the longest tail risk. Client-side DNS caching, ISP resolver TTLs, and CDN edge caches mean some traffic hits the old endpoint for minutes to hours. Set TTLs low (60-300 seconds) on production records before a disaster. During recovery, expect partial traffic during the TTL window — not an instant cutover.
OAuth redirect URIs, webhook endpoints, third-party integrations. Any external system that was configured with the cluster’s specific URLs or certificates needs updating. The platform can’t automate this — it’s an organizational inventory problem. The DR runbook should list every external dependency and the person responsible for updating each one.
PV restoration caveats. “Restore PV snapshots” involves: volume reattachment (the new node must be in the same availability zone as the snapshot), storage class compatibility (the new cluster must support the same storage class), and filesystem consistency (the snapshot captures the disk at a point in time, but the application may need to replay write-ahead logs or run recovery procedures). Snapshot restore is necessary but not always sufficient — applications with their own persistence layer (databases, message brokers) may require additional recovery steps beyond what the platform provides.
20.6 Partial Failures
Section titled “20.6 Partial Failures”Not every disaster is a total cluster loss. Real incidents are often partial: one availability zone down, a network partition, a degraded control plane, partial data corruption.
When not to fail over. If the control plane is degraded but workloads are running, failover may cause more disruption than repair. Existing pods continue serving traffic even if the API server is slow or partially unavailable. Failing over means provisioning a new cluster, migrating traffic, and accepting the DNS propagation delay — all of which may take longer than repairing the degraded control plane.
When to repair in place. A single failed etcd member, a crashed controller, a node that won’t drain — these are repair operations, not recovery operations. The self-managing cluster (Chapter 6) handles most node-level failures automatically through MachineHealthCheck. Control plane issues require manual intervention but don’t require a new cluster.
The decision framework: fail over when the cluster can’t self-heal and repair time exceeds RTO. Repair in place when the cluster is degraded but functional and repair time is shorter than failover. The platform should provide enough diagnostic information (control plane health, etcd member status, node readiness) for the operator to make this call quickly.
20.7 Backup Integrity
Section titled “20.7 Backup Integrity”Backups that can’t be read are not backups.
Corruption detection. etcd snapshots should be verified after creation — restore to a temporary etcd instance and confirm the data is readable. PV snapshots should be periodically mounted and checksummed. This is part of the monthly restore test (Section 20.9).
Credential loss. If the encryption keys for etcd-at-rest are lost, the etcd snapshot is unreadable. If the S3 bucket credentials are revoked, PV snapshots are inaccessible. The DR plan must include: where are the encryption keys stored, who has access, and what happens if the key management system is also down.
Compromised backups. If an attacker gains write access to the backup storage, they could modify CRD specs in the backup to include malicious images or policy changes. Backup storage should be append-only (S3 Object Lock, GCS retention policies) with separate credentials from the cluster. The restore procedure should verify spec integrity — compare restored specs against git before applying.
Systems should be designed so recovery is routine, not exceptional. If the only time you test recovery is during a real disaster, every recovery is an improvisation. Automated monthly restore tests, quarterly full-cluster recovery, and annual DR exercises (Section 20.9) make recovery a practiced procedure — not a panic response.
20.8 What Goes Wrong in Practice
Section titled “20.8 What Goes Wrong in Practice”Scenario: the specs in git are stale. A cluster is destroyed. The operator provisions a new cluster and restores CRD specs from git. But 15 of 50 services have specs in git that are 3 months old — developers applied changes with kubectl and never committed. The derivation pipeline produces services that match the git specs, not the services that were actually running.
Checkout in git says replicas: 2. The running checkout had replicas: 5 (scaled up after a traffic incident and never committed). The restored checkout runs with 2 replicas under production traffic. It’s overwhelmed.
The lesson. Git as backup only works if git is the source of truth. The platform should detect spec drift: compare the CRD spec on the cluster with the spec in git (by hash), and report divergence in the CRD status. SpecDriftDetected: True, reason: LatticeService commerce/checkout differs from git revision abc123. This makes drift visible before a disaster forces the discovery.
Alternatively — and this is the stronger recommendation — enforce git as the only write path: reject direct kubectl apply for LatticeService CRDs and require all changes through a GitOps pipeline or a CI-gated apply. This eliminates drift. The friction is real (a developer can’t make an emergency change without committing first), but the alternative is discovering during a disaster that your backup is 3 months stale. If git is your backup strategy, git must be your only write path. Most organizations that try “detect and warn” eventually fail — the warnings are ignored until the disaster forces the discovery.
Scenario: bilateral agreements fail during recovery. The operator restores 50 CRD specs simultaneously. Checkout depends on payments (bilateral agreement). The derivation pipeline processes checkout first. Payments hasn’t been created yet — the bilateral match fails. Checkout deploys but traffic to payments is denied.
Two minutes later, payments is created. The mesh-member controller matches the bilateral agreement. Policies are generated. Traffic flows.
During those two minutes, checkout was running without its payments connection. If checkout has a retry mechanism (which most HTTP clients do), it recovers automatically. If checkout crash-loops on the first failed request (a poorly written health check that requires payments connectivity), it’s in CrashLoopBackOff when payments arrives.
The lesson. Restoration order matters for services with tight dependencies. The platform can mitigate this by restoring all specs in a batch (the mesh-member controller reconciles bilateral agreements as each service becomes Ready), but the two-minute window is real. Services should tolerate dependency unavailability at startup — the same tolerance they need for rolling deployments and network partitions.
20.9 Testing Recovery
Section titled “20.9 Testing Recovery”If you haven’t restored from a backup, you don’t have a backup.
Monthly: Restore CRD specs to a test cluster. Verify the derivation pipeline produces correct output. Check that all services reach Ready. This validates: specs are complete, Cedar policies are present, secret backends are reachable, mesh policies compile.
Quarterly: Full cluster recovery. Provision a new cluster, restore CRD specs, restore PV snapshots, verify workloads run and serve traffic. This validates: the entire recovery path including infrastructure provisioning, data restoration, and DNS cutover.
Annually: Destroy a non-production cluster and recover it as a full disaster recovery exercise. This validates: team procedures, communication, and decision-making under pressure — not just technical mechanisms.
Automate the tests. Monthly and quarterly tests should be automated. A restore test that requires a human to schedule and run won’t happen consistently. Include restore validation in the E2E test suite (Chapter 21).
Treat restore test failures as incidents. If a monthly test fails — CRD specs are incomplete, a secret backend is unreachable, a mesh policy doesn’t compile — this is a production issue. The backup is broken. Fix it with the same urgency as a service outage, because the next real disaster will discover the same failure.
Exercises
Section titled “Exercises”20.1. [M10] A cluster is destroyed. CRD specs are in git. PV snapshots are in S3. The operator provisions a new cluster and restores. But the new cluster has different node IPs, different service endpoints, different mesh identities. What works automatically? What requires manual intervention? What is permanently lost?
20.2. [H30] Section 20.2 walks through a 40-minute recovery. During minutes 22-30, the derivation pipeline is processing 50 services. Some services depend on others (checkout depends on payments for bilateral agreements). Does the order of CRD restoration matter? What happens if checkout is derived before payments — does the bilateral agreement fail? Design the restoration ordering strategy.
20.3. [R] The derivation pipeline re-derives everything from CRD specs. But LatticePackage resources (Helm charts) aren’t derived by the pipeline — they’re installed by the package controller. How should the platform back up and restore packaged workloads? Is helm list + chart versions sufficient, or does the platform need to back up the full Helm release state?
20.4. [M10] RPO for PV data is determined by snapshot frequency. A team runs a database with 10 minutes of committed transactions per snapshot interval. A failure loses 10 minutes of data. Who decides the snapshot frequency — the platform team or the application team? Should the platform allow per-PV snapshot schedules?
20.5. [H30] The annual DR exercise destroys a non-production cluster. The team discovers: the CRD specs in git are 3 months stale (a developer applied changes with kubectl and never committed to git). 15 services have specs that differ from what’s in git. The restore produces the wrong services. Design the mitigation: how does the platform detect spec drift between the cluster and git? Should the platform enforce git as the source of truth?