Capstone Exercises

These exercises integrate concepts across multiple chapters. Each requires knowledge from an entire Part of the book. They are designed for study groups, courses, or readers who want to test their understanding beyond individual chapter exercises.

These exercises are intentionally open-ended. Unlike the per-chapter exercises, they do not have answer keys — the value is in the design process, not a single correct answer.

Part I Capstone: Design a Platform CRD [H30]

Design a PlatformService CRD for an organization that runs a monolithic Java application being broken into microservices. The CRD must support:

Services with 1-50 replicas, each needing 256Mi-4Gi memory
PostgreSQL and Redis dependencies (managed by the platform, not the developer)
Three environments: dev (no HA, small resources), staging (HA, medium), production (HA, large, strict security)
A migration path from the existing Helm chart (200+ values) to the new CRD

Deliverables: (a) the CRD spec schema, (b) a sample spec for the checkout service in production, (c) the status conditions the developer would see, (d) the escape hatch mechanism for services that need raw Kubernetes fields.

Evaluate your design against Chapter 4’s principles: minimal surface, structural constraints, status as documentation.

Part II Capstone: Disaster Recovery Plan [H30]

A self-managing cluster running 30 services loses its control plane (all 3 etcd members corrupted). Write the recovery plan:

What is recovered from etcd snapshots vs re-derived from CRD specs in git?
What is the expected RTO? Walk through each phase with time estimates.
What cross-cluster bilateral agreements break during recovery, and how are they re-established?
What data is permanently lost?

Assume: CRD specs are in git (current), PV snapshots are hourly, the cluster was self-managing (post-pivot), and 5 services had cross-cluster dependencies through a parent cluster.

Reference: Chapters 5, 6, 7, 15, 20.

Part III Capstone: Trace a Secret End-to-End [H30]

A developer adds this line to their LatticeService spec:

variables:
  DATABASE_URL: "postgres://app:${resources.orders-db.password}@db:5432/orders"

Trace every step from spec to running pod:

Which Cedar gate fires? What does the authorization request look like?
Which secret routing path does the pipeline choose? Why?
What ExternalSecret is produced? What ClusterSecretStore does it reference?
How does ESO sync the secret? What Kubernetes Secret is created?
How does the Deployment reference the secret? Environment variable or volume mount?
When the secret rotates in Vault, what happens? Does the pod restart?

Reference: Chapters 8, 9, 13.

Part IV Capstone: Attack Surface Analysis [R]

A compromised container in the checkout service has shell access. The attacker can execute arbitrary commands inside the container. Analyze what they can and cannot do under the platform’s security model:

What services can checkout reach? (Bilateral agreements, Chapter 13)
What secrets can the attacker read from the container’s environment? (Chapter 9)
Can the attacker execute a binary not in the original image? (Tetragon, Chapter 14)
Can the attacker establish an outbound connection to an attacker-controlled server? (External service egress, Chapter 13)
If the attacker modifies the CiliumNetworkPolicy to allow more egress, what happens? (Reconciliation, Chapter 12)
What is the actual blast radius? Is default-deny’s containment meaningful, or does the attacker already have everything they need through checkout’s declared dependencies?

Reference: Chapters 9, 12, 13, 14.

Part V Capstone: Mesh Failure Analysis [H30]

A workload cluster loses its ztunnel DaemonSet on 3 of 10 nodes simultaneously (node kernel update gone wrong). Walk through: which services are affected, what traffic fails (cross-node vs same-node, STRICT vs PERMISSIVE), how the compliance controller detects it, and the recovery procedure. Then: a cross-cluster bilateral agreement exists between this cluster and a sibling. Does cross-cluster traffic to the affected nodes fail differently than intra-cluster traffic?

Reference: Chapters 12, 14, 15.

Part VI Capstone: Design a Training Job [H30]

An ML team wants to run a distributed training job with these requirements:

4 worker nodes, each with 2 A100 GPUs (8 GPUs total)
The model is 30B parameters, FP16 (60GB — fits on 2 GPUs with tensor parallelism)
Training data is in S3 (s3://datasets/imagenet-2024/)
Checkpoints should be saved every 2 hours to S3
The job should take ~12 hours

Write: (a) the LatticeJob spec, (b) the Volcano resources the pipeline would derive, (c) the Cedar policies needed (image deployment, S3 egress), (d) what happens if one GPU develops ECC errors at hour 6.

Reference: Chapters 16, 17, 13, 14.

Part VII Capstone: Build the Monitoring Stack [H30]

A new cluster is provisioned and bootstrapped. List every observability resource the platform creates automatically (no developer action):

What DaemonSets are deployed? (VMAgent, DCGM exporter, Fluent Bit, ztunnel)
What scrape targets exist before any service is deployed?
When the first LatticeService is deployed, what observability resources are derived?
A service has a latency spike at 3 AM. Walk through what data is available to the on-call engineer without any application-level instrumentation.

Reference: Chapters 5, 15, 17, 19.

Part VIII Capstone: CRD Migration Plan [H30]

The platform has been running for 18 months. The platform team proposes upgrading from v1alpha1 to v1beta1 of the LatticeService CRD — removing 3 fields, renaming 2, and adding schema validation that rejects specs using the old field names. 180 of 200 services will auto-migrate. 20 use the removed fields through escape hatches. Write the migration plan: timeline, tooling, communication, fallback, and what happens to the 20 holdout services.

Reference: Chapters 4, 11, 23.

Full Book Capstone: The Platform Pitch [R]

You’re the tech lead proposing this platform architecture to your VP of Engineering. They ask:

“What’s the ROI? We have 6 engineers and 200 services on Helm charts that work.”
“What’s the migration path? We can’t rewrite everything.”
“What if Istio/Cilium/Cedar gets abandoned? How coupled are we?”
“Our compliance team needs SOC 2 and NIST 800-53. Does this help or hurt?”
“What’s the blast radius if YOUR platform has a bug?”

Write the 2-page memo that answers these questions. Reference specific chapters for each claim. The VP has read none of the book — your memo must stand alone.

Reference: All chapters, especially 1, 2, 11, 12, 14, 20, 21, 23.