Skip to content

Chapter 8: The Derivation Pipeline

Without a derivation pipeline, deploying a service looks like this: the developer writes a Deployment manifest, then a Service, then a NetworkPolicy (if they remember), then an ExternalSecret for their database password (if they know how), then a ServiceMonitor (if they know it exists), then a PodDisruptionBudget (if they’ve heard of one). Each resource is a separate YAML file, independently maintained, independently correct — or not.

With a derivation pipeline, the developer writes a 22-line spec. The platform produces all of those resources — correctly, consistently, with security and observability guaranteed. The gap between those two experiences is the pipeline.

Building a pipeline forces five design decisions that every platform builder faces:

  • When do you authorize? (Before producing resources, or after?)
  • What authorization questions do you ask? (How many gates, and what triggers them?)
  • Complete output or partial? (All resources or just the ones that passed?)
  • What order do you apply? (All at once, or in dependency layers?)
  • Shared logic or separate compilers? (One compiler per workload type, or a shared core?)

These decisions are not Lattice-specific. Any platform that derives infrastructure from intent must answer them. This chapter explains the reasoning behind each, with the reference implementation as one valid set of answers.

The first question is: when does the platform decide whether a service is allowed to deploy?

Option 1: At runtime. The platform produces Kubernetes resources. A policy engine (OPA Gatekeeper, Kyverno) evaluates them at admission time. If the Deployment references an unsigned image, Gatekeeper rejects it.

The problem is architectural, not just cosmetic. If the image gate passes but the secret gate fails at runtime, the Deployment already exists — the service is running without its secret. A partially-authorized service is live in the cluster. The platform produced a Deployment but Gatekeeper blocked its ExternalSecret. The developer sees a Gatekeeper denial on a resource they didn’t write, and meanwhile their service is running in a broken state.

Option 2: At derivation time. Before producing any resources, the platform evaluates authorization gates against the CRD spec. Is this service permitted to deploy this image? To access this secret? To call this external endpoint? If any gate denies, the pipeline stops. No resources are produced. Partially-authorized states are impossible — the Deployment without its ExternalSecret, the service without its network policies — these states can never exist.

The reference implementation uses Option 2. Authorization happens after the spec is validated (admission webhook) and before any Kubernetes resources are generated. The denial appears on the CRD’s status — the resource the developer wrote — with a clear message about which gate denied and why.

This is a fundamental design decision. It determines whether partially-authorized services can exist in the cluster. With runtime authorization (Option 1), they can — for the window between resource creation and policy evaluation. With derivation-time authorization (Option 2), they can’t.

8.2 The Decision: What Authorization Questions to Ask

Section titled “8.2 The Decision: What Authorization Questions to Ask”

A platform that authorizes at derivation time needs to decide what questions to ask. Too few gates and the platform is permissive. Too many and every spec change is a policy evaluation that slows derivation.

The reference implementation asks six questions, each triggered only when the spec uses the relevant feature:

  1. Can this service deploy this image? (DeployImage) — Fires for every container. Checks signature verification and Cedar policy. The first gate because it’s cheapest and most important.
  2. Can this service use a tag reference? (AllowTagReference) — Fires only for tag-based image references. Default deny — digests required.
  3. Can this service access this secret? (AccessSecret) — Fires for each type: secret resource. Default deny.
  4. Can this service use privileged capabilities? (OverrideSecurity) — Fires only when the spec requests capabilities, privilege escalation, or AppArmor overrides. Most services never trigger this.
  5. Can this service call this external host? (AccessExternalEndpoint) — Fires for type: external-service resources.
  6. Can this service mount this shared volume? (AccessVolume) — Fires for shared volumes. Requires both Cedar authorization AND owner consent.

Each gate is a focused question. Each defaults to deny. The developer’s spec determines which gates fire — a service with no secrets never triggers AccessSecret. This is the opt-in security model from Part I applied at the authorization layer: the developer doesn’t configure security, but the features they use trigger the appropriate checks.

The design principle: the gate fires when the spec uses the feature, not when the feature is configured. The developer doesn’t enable AccessSecret. They declare a secret dependency, and the gate fires as a consequence.

If you’re designing your own gates, the process is: look at what your spec can express. If the spec can declare secrets, you need a gate for secret access. If it can reference images, you need a gate for image deployment. If it can request capabilities, you need a gate for security overrides. The gates mirror the spec’s feature surface. A feature without a gate is a feature without authorization. The six gates above are the reference implementation’s answer to its spec’s features — your spec may need different or additional gates.

8.3 The Decision: Complete Output or Partial?

Section titled “8.3 The Decision: Complete Output or Partial?”

When the pipeline produces resources, should it produce all of them or allow partial output?

The argument for partial output. If the image gate passes but the secret gate fails, produce the Deployment (the image is fine) but skip the ExternalSecret (the secret is denied). The service is “partially deployed” — it runs but without its secret.

The argument against. A Deployment without its ExternalSecret is a broken service. A Deployment without its LatticeMeshMember has no network policies — it’s unsecured. A Deployment without its VMServiceScrape is unobservable. Partial output produces services that are deployed but incorrect — and incorrectness in security is worse than non-deployment.

A third option: produce what’s authorized, skip what’s not, report both. The Deployment is created (the image passed). The ExternalSecret is skipped (the secret was denied). The status says: “Deployment created, but ExternalSecret for payments/stripe-key denied — service running without Stripe integration.” This is transparent and partially functional.

The problem: a Deployment without its LatticeMeshMember has no network policies. The service is running and reachable by any pod in the cluster — it’s unsecured. Transparency about the gap doesn’t close the gap. A status message saying “network policies were not generated” doesn’t make the service secure. Partial output with disclosure is still partial output with security gaps.

The reference implementation uses all-or-nothing. Every gate must pass. Every resource is produced. Or nothing is produced and the status reports why. The developer sees a clean failure, not a partially-functional service with disclosed gaps.

This has a real cost — and the cost is sharper than it sounds. Consider: a service is down in production. The fix is a one-line image bump. But Vault is having a blip, and the AccessSecret gate can’t reach it. The all-or-nothing pipeline blocks the emergency fix because a gate that’s unrelated to the change is failing. The SRE on call has a working fix and a platform that won’t let them deploy it.

This is the tension between purity and recovery. The reference implementation’s answer: the pipeline is all-or-nothing, but the pipeline isn’t the only path. In an emergency, the SRE can kubectl edit the Deployment directly — the image changes immediately, the service recovers, and the reconciliation loop will overwrite the edit on the next successful derivation cycle. This is deliberate: direct edits are escape hatches for emergencies, and the reconciliation loop self-heals the drift once the underlying issue (Vault blip) resolves. The platform should log and alert on direct edits so the team knows the escape hatch was used.

A gap worth acknowledging: without a ValidatingAdmissionPolicy (or equivalent webhook) rejecting resources created outside the derivation pipeline, kubectl apply can create Deployments, NetworkPolicies, or any other resource directly — bypassing Cedar gates, bilateral agreements, and image verification entirely. The reconciliation loop will overwrite resources it owns, but it won’t delete resources it never created. Closing this gap — rejecting resources that lack the pipeline’s ownership labels — is on the roadmap but not yet implemented in the reference architecture.

The broader safeguards: Cedar policy validation at admission (catch syntax errors before they’re applied), dry-run evaluation against the current fleet before activating new policies, and fast rollback — delete the broken policy CRD and every service re-derives within 60 seconds.

Kubernetes resources have dependencies. A Deployment references ConfigMaps and Secrets — those must exist before the Deployment is created, or the pods crash-loop. A KEDA ScaledObject references a Deployment — the Deployment must exist first.

Option 1: Apply everything simultaneously. Fast. Let Kubernetes sort out the ordering through eventual consistency. The Deployment crashes because the Secret doesn’t exist yet; the kubelet retries; the Secret is created a few seconds later; the retry succeeds.

Option 2: Apply in dependency layers. Slower. But no crash-loops, no retry noise, no “CreateContainerConfigError” events cluttering the pod’s event log.

The reference implementation uses three layers:

graph TD
    Spec[LatticeService Spec] --> Auth[Authorization Gates<br/>Image, Secret, Security, Egress]
    Auth --> L1[Layer 1: Infrastructure<br/>ServiceAccount, ConfigMaps, Secrets,<br/>PVCs, ExternalSecrets, MeshMember,<br/>TracingPolicyNamespaced, PDB, Service]
    L1 --> Wait1[Wait: ESO sync<br/>2s poll, 120s timeout]
    Wait1 --> L2[Layer 2: Workload<br/>Deployment]
    L2 --> L3[Layer 3: Scaling<br/>ScaledObject]
    L3 --> Wait2[Wait: MeshMember Ready<br/>Network policies in place]
    Wait2 --> Ready[Phase: Ready]
    Auth -->|any gate denies| Failed[Phase: Failed<br/>Zero resources created]

Layer 1: Infrastructure. ServiceAccount, ConfigMaps, Secrets, PVCs, ExternalSecrets, LatticeMeshMember, TracingPolicyNamespaced, PodDisruptionBudget, Service. Everything the Deployment’s pods need to start, plus the resources that don’t depend on the Deployment existing.

Between Layer 1 and Layer 2, the pipeline waits for ExternalSecrets to sync. ESO must fetch the secret from the backend (Vault, AWS) and create the Kubernetes Secret. This takes seconds if the backend is healthy, up to 120 seconds if it’s slow. The pipeline checks on each reconciliation cycle (requeue after 2 seconds) — this is event-driven through the controller’s requeue mechanism, not a busy-wait loop. The controller yields between checks, so it’s not consuming CPU while waiting.

Layer 2: Workload. The Deployment. Created after infrastructure is in place.

Layer 3: Scaling. The KEDA ScaledObject. Created after the Deployment exists.

After Layer 3, the pipeline waits for the LatticeMeshMember to reach Ready — confirming that network policies are generated. The service doesn’t transition to Ready until its security infrastructure is in place. Security is a precondition, not an eventual property.

The trade-off: layered application adds 30-60 seconds to deployment for services with Vault-backed secrets and bilateral agreements. A simple service with no secrets deploys in seconds. The alternative (apply simultaneously) is faster but produces crash-loops and race conditions that are harder to debug than a 30-second wait.

8.5 The Decision: Shared Logic or Separate Compilers

Section titled “8.5 The Decision: Shared Logic or Separate Compilers”

A platform with multiple workload types (services, batch jobs, inference models) faces a design choice: does each type have its own compiler, or do they share a core?

Separate compilers. Each CRD controller has its own derivation logic. Simple to understand — each controller is self-contained. Easier to evolve independently — changing the job compiler doesn’t risk breaking the service compiler. But secret resolution logic, authorization gates, environment compilation, and mesh member generation are duplicated across all controllers. A bug fix in secret resolution requires fixing it in three places. A new authorization gate must be added to three controllers — and if one is missed, that workload type has a security gap.

Shared core with type-specific wrappers. A shared WorkloadCompiler handles everything common: containers, secrets, authorization, environment variables, file mounts, mesh integration. Each CRD controller wraps the shared output in its type-specific resources: LatticeService wraps in a Deployment + PDB + ScaledObject. LatticeJob wraps in a Volcano VCJob. LatticeModel wraps in disaggregated Deployments + routing config.

The reference implementation uses the shared core. A bug fix in the shared compiler fixes it for all workload types. A new authorization gate protects all types. The shared layer is the enforcement point — where the platform’s security guarantees are implemented once and applied universally.

The design principle: security and networking belong in the shared layer. Workload-specific lifecycle belongs in the wrapper. PodDisruptionBudgets are service-specific (jobs don’t need them). Gang scheduling is job-specific (services don’t need it). Model routing is model-specific. But image verification, Cedar authorization, secret resolution, and bilateral network agreements are universal.

8.6 Putting It Together: A Service Through the Pipeline

Section titled “8.6 Putting It Together: A Service Through the Pipeline”

Walk through the checkout service spec from the opening through all five decisions.

The spec. 22 lines: 1 container, 2 secrets, 1 service dependency, 3 replicas.

Authorization (Section 8.1-8.2). Image verification: checkout@sha256:abc123 is checked against TrustPolicy. Cedar DeployImage permits. AccessSecret for orders-db and stripe — Cedar permits both. No security overrides, no external endpoints, no shared volumes — those gates don’t fire. All gates pass.

Derivation (Section 8.5). The shared WorkloadCompiler processes: template rendering resolves ${resources.orders-db.password} and ${resources.stripe.api-key}. Environment compilation produces ConfigMaps (non-sensitive vars) and Secrets (sensitive vars). The DATABASE_URL with mixed content — a connection string where the password is a secret but the host and port are static — routes through an ESO-templated ExternalSecret that interpolates the secret value into the larger string (Chapter 9 covers the five routing paths: pure secret, mixed-content, file mount, image pull credentials, and bulk extraction). The LatticeMeshMember is generated with payments: outbound.

Wrapping. The service compiler wraps the shared output: Deployment (with the compiled pod template), Service (with ports), ServiceAccount, PodDisruptionBudget (maxUnavailable: 1 for 3 replicas), VMServiceScrape, TracingPolicyNamespaced.

Application (Section 8.4). Layer 1: ConfigMaps, Secrets, ExternalSecrets, LatticeMeshMember, TracingPolicyNamespaced, PDB, ServiceAccount, Service. Wait for ESO to sync. Layer 2: Deployment. Layer 3: no ScaledObject (no autoscaling in the spec). Wait for LatticeMeshMember Ready.

Timing. Authorization: ~50ms (Cedar evaluation is fast). Derivation: ~200ms (template rendering, environment compilation, mesh member generation). Layer 1 apply: ~2s (create infrastructure resources). ESO sync: ~8s (Vault responds, Secret is created). Layer 2 apply: ~1s (create Deployment). Mesh readiness: ~5s (mesh-member controller reconciles, generates CiliumNetworkPolicy and AuthorizationPolicy). Total: ~16 seconds from spec apply to Ready. For a service with a slow secret backend (Vault under load), the ESO sync can take 30-60 seconds — the dominant latency.

Result. 13 resources (Deployment, Service, ServiceAccount, ConfigMap, ExternalSecret ×2, CiliumNetworkPolicy ×2, AuthorizationPolicy ×2, VMServiceScrape, PodDisruptionBudget, TracingPolicyNamespaced). The status reports phase: Ready, compiledResources: 13. The checkout service is running with network policies, observability, availability protection, and runtime enforcement — none of which the developer configured.

Two failure scenarios, both important:

Authorization denial. The developer references a secret they’re not authorized for. The pipeline stops at the AccessSecret gate. Status: phase: Failed, conditions: [PolicyAuthorized: False, reason: AccessSecret denied for secret payments/stripe-webhook-secret]. Zero resources created. The developer sees the exact denial and knows what to do.

Update failure. A running service is updated with a spec that fails authorization. The pipeline fails the new derivation. The existing resources continue running — the previous deployment is not torn down. Status shows the new generation failed; the running resources reflect the previous generation. A typo in a spec update doesn’t take down production.

The theoretical failure modes (authorization denial, update failure) are clean. Production failures are messier.

Scenario: thundering herd after a policy change. A Cedar policy update touches a gate that every service triggers — DeployImage, for example. Every service is re-derived on the next reconciliation cycle. With 500 services on 60-second intervals, the controller attempts 500 derivations in rapid succession: Cedar evaluation, image verification, API server writes for every service. This can overwhelm the API server and spike controller memory. The mitigation is jittered reconciliation — spread requeues over the reconciliation window rather than requeuing all at once — and rate limiting on API server writes. The controller should treat fleet-wide re-derivation as a rollout, not a stampede.

Scenario: wrong label selector. The mesh-member controller has a bug: it generates a CiliumNetworkPolicy with the wrong pod label selector. The policy doesn’t match checkout’s pods. Traffic from checkout to payments is denied by default-deny — not because the bilateral agreement is missing, but because the generated policy doesn’t target the right pods.

The developer sees: connection timeout when checkout calls payments. The CRD status says phase: Ready — the pipeline succeeded. The LatticeMeshMember is Ready — the mesh-member controller reconciled. Everything looks correct from the platform’s perspective. The bug is in the content of the generated policy, not in its existence.

How the developer debugs: Check ztunnel logs — “RBAC denied, no matching AuthorizationPolicy.” Check Hubble — “egress from checkout to payments dropped by CiliumNetworkPolicy default-deny.” This tells them a policy should exist but isn’t matching. They inspect the CiliumNetworkPolicy (or run platform debug connectivity checkout payments) and see the label selector is wrong.

How the platform team discovers it: Either the developer files a ticket, or the randomized bilateral testing (Chapter 21) catches it. The randomized tests generate random topologies and verify enforcement — a wrong label selector would produce a “connection denied but should be allowed” failure. This is why randomized testing exists: it catches content bugs that per-service status checks miss.

The lesson for platform builders: The pipeline’s status reports whether resources exist. It doesn’t verify that the resources are correct. A CiliumNetworkPolicy with wrong labels exists (status: Ready) but doesn’t work (traffic: denied). The platform needs testing — not just status reporting — to catch this class of bug. Chapter 21 covers the testing strategy.

The derivation pipeline is a Kubernetes controller — it reconciles CRDs on the API server. But the chapter has been silent on how those CRDs arrive. The pipeline doesn’t care. It reconciles whatever appears. The platform team, however, should care deeply about the delivery mechanism — because the delivery mechanism determines which properties the deployment process has.

The properties that matter:

PropertyWhat it means
AuditabilityEvery change has an author, a timestamp, and a diff. You can answer “who changed this and when?”
ReviewabilityChanges can be reviewed before they take effect. A second pair of eyes sees the spec change before it reaches the cluster.
RollbackA bad change can be reversed to a known-good state. The reversal is a single action, not manual reconstruction.
Disaster recoveryIf the cluster is lost, the specs survive. The platform can re-derive everything from the surviving specs.
SpeedTime from “developer makes a change” to “change is live on the cluster.”

Different delivery mechanisms provide different subsets:

Direct apply (kubectl apply, API call from CI). Speed: immediate. Auditability: only if the CI system logs the action. Reviewability: only if a CI gate enforces it. Rollback: manual (apply the previous spec). DR: specs exist only on the cluster unless separately backed up.

GitOps (Flux, ArgoCD, or similar). A controller watches a git repo and applies changes to the cluster. Auditability: every change is a git commit. Reviewability: pull request review before merge. Rollback: git revert. DR: specs are in git, surviving cluster loss. Speed: slower (commit → sync delay, typically 30-60 seconds).

CI-driven apply. CI builds the image, signs it (Chapter 14), and applies the spec — either directly or by committing to a GitOps repo. CI is upstream of the platform: it produces the input (the spec with a newly-built image digest), and the platform produces the output. Auditability and reviewability depend on the CI system’s configuration, not on a structural guarantee.

The derivation pipeline is delivery-agnostic. It reconciles CRDs regardless of origin. This is deliberate — the platform doesn’t enforce how specs arrive. The platform team decides which properties matter for their organization and chooses (or builds) the delivery mechanism that provides them. All five properties can be achieved without GitOps — a CI pipeline with audit logging, PR-gated deploys, versioned spec storage, and a rollback command provides the same properties through different mechanisms.

Rollback in the derivation model. Regardless of delivery mechanism, rollback is a spec change. Apply the previous spec (however you apply specs), and the derivation pipeline re-derives from it — the old Deployment, old policies, old secrets are produced. This is the compilation model’s advantage: rollback is one input change, not manual reversal of 13 individual resources.

8.1. [M10] The pipeline applies resources in three layers. The ExternalSecret in Layer 1 takes 3 minutes to sync (slow secret backend). The pipeline polls for 120 seconds and times out. Should the pipeline fail, or create the Deployment without the secret? Design the timeout behavior and the status message.

8.2. [H30] Design the interface between the shared compiler and the CRD-specific wrapper. What does the shared compiler return? What does the wrapper add? How do you handle a feature relevant to services (PDB) but not jobs? Where is the boundary?

8.3. [R] The all-or-nothing rule means a Cedar policy bug can block the fleet. An adversary argues: “the blast radius of a policy bug is worse than the blast radius of a partial deployment.” Is the all-or-nothing rule worth this risk? What safeguards make it safe?

8.4. [H30] The pipeline delegates network policy generation to a separate mesh-member controller (via the LatticeMeshMember CR). Why this design instead of generating policies directly? What are the advantages and disadvantages?

8.5. [M10] The developer adds a second container (logging sidecar). How many additional resources does the pipeline produce? Which existing resources change?

8.6. [R] The pipeline re-derives every 60 seconds even if the spec hasn’t changed. Is this wasteful? Could you short-circuit on unchanged generation? What would you lose — and what would you gain?

8.7. [H30] You’re building a platform from scratch. You choose derivation-time authorization (Option 2 from Section 8.1). A team objects: “OPA Gatekeeper already validates our Deployments at admission. Adding derivation-time gates is redundant.” Construct the argument that derivation-time authorization is not redundant with admission-time validation. Identify the specific class of problems each catches that the other misses.