Chapter 13: Authorization and Network Policy

A developer declares a dependency on the payments service. From that single declaration, three things must happen: Cedar must authorize the dependency, L4 network policy must permit the traffic, and L7 identity-based authorization must permit the request. These are three separate mechanisms at three separate layers — but they’re all triggered by the same line in the spec.

The decisions — forbid-overrides-permit evaluation, bilateral network agreements, layered denial debugging — are universal to any platform that takes authorization seriously. The specific tools shown here (Cedar, Istio AuthorizationPolicy, Cilium NetworkPolicy) are the reference implementation’s choices. The decisions in this chapter:

How do you write authorization policies that don’t accidentally block the fleet? (Forbid-overrides-permit, policy lifecycle, blast radius management.)
Should network policies be unilateral or bilateral? (Does one side decide, or must both sides consent?)
How do you debug a denial when three layers might be responsible? (The developer sees “connection timeout” — which layer denied it?)

13.1 Cedar Authorization at Derivation Time

Chapter 8 described where Cedar gates fire in the pipeline — after admission, before resource generation. This section covers how to design and manage the policies that those gates evaluate.

The six gates (first described in Chapter 8, Section 8.2):

DeployImage — can this service deploy this image?
AllowTagReference — can this service use a tag instead of a digest?
AccessSecret — can this service access this secret?
OverrideSecurity — can this service use privileged capabilities?
AccessExternalEndpoint — can this service call this external host?
AccessVolume — can this service mount this shared volume?

Each gate is a focused question. Each defaults to deny. Each fires only when the spec uses the feature.

Forbid overrides permit. Cedar’s evaluation: if any matching policy says forbid, deny — regardless of permits. If no forbid and at least one permit, allow. If nothing matches, deny (default-deny). This prevents policy conflicts from resolving in favor of access.

Concrete example:

// Team lead permits commerce services to access team secrets
permit(
    principal in Namespace::"commerce",
    action == Action::"AccessSecret",
    resource in SecretStore::"commerce"
);

// Security team forbids all access to the root CA key
forbid(
    principal,
    action == Action::"AccessSecret",
    resource == Secret::"pki/root-ca-key"
);

Commerce services can access their team’s secrets. Nobody can access the root CA key — the forbid is absolute.

Contextual attributes. Cedar goes beyond role-based access. Gates receive the full request context — the principal (which service), the action (which gate), the resource (which secret, image, or endpoint), and arbitrary context attributes. A policy could forbid AccessSecret when context.environment == "production" and resource.classification == "pii" unless the principal’s namespace has a PII-approved label. This is attribute-based access control (ABAC) — and it’s what makes Cedar more expressive than Kubernetes RBAC for platform-level authorization.

Policy lifecycle. Policies are CRDs (CedarPolicy), validated by an admission webhook (syntax check), and distributed to workload clusters through the gRPC stream (Chapter 7, SyncDistributedResourcesCommand). Workload clusters evaluate locally — no callback to the parent.

The trade-off. Cedar policies are powerful and dangerous. A broad forbid blocks the fleet. A missing permit blocks a team. A syntax error in a policy admitted without validation breaks evaluation for every service that triggers the gate.

The safeguards: webhook validation at admission (catch syntax errors before they’re applied), dry-run evaluation against the current fleet before activation, and fast rollback — delete the bad CedarPolicy CRD and every service re-derives with the corrected set within 60 seconds.

Choosing an engine. The reference implementation uses Cedar because its forbid-overrides-permit semantics align with default-deny. OPA/Rego is the Kubernetes ecosystem’s dominant alternative — more powerful, steeper learning curve, no built-in forbid-overrides-permit. Kyverno writes policies as YAML CRDs — simpler, less expressive. A flat file or custom function works for trivial cases. The architecture (derivation-time gates, default-deny, focused questions) is engine-agnostic. The examples use Cedar.

13.2 Bilateral Network Agreements

Chapter 8 introduced the LatticeMeshMember as part of the derivation pipeline, and Chapter 11 showed how it bridges non-derived workloads into the mesh. This section covers the bilateral agreement model in depth — the most significant network security decision the platform makes.

The model. Traffic flows only when both sides declare:

Caller declares outbound: payments: type: service, direction: outbound
Callee declares inbound: checkout: type: service, direction: inbound

Both declarations must exist. If only one side declares, no policy is generated. Default-deny blocks the traffic.

Why bilateral over unilateral. Unilateral policies (the standard model) only control who gets in — the callee writes an AuthorizationPolicy. The caller’s egress is unconstrained. A compromised service can probe every service in the mesh. With bilateral agreements, the caller’s egress is constrained to its declared dependencies. A compromised checkout can only reach the services it declared — not every service in the mesh.

The dependency graph is auditable: read a service’s spec and you know exactly what it calls and what calls it. With unilateral policies, answering “what can service X reach?” requires scanning every AuthorizationPolicy in the cluster for X’s identity.

Cross-service compilation. This is the capability no template-based approach can replicate (Chapter 2). The mesh-member controller sees all services’ specs simultaneously. It matches outbound/inbound declarations across service boundaries, across namespaces, and between derived services and LatticeMeshMembers from packages.

What happens when the match is incomplete. Checkout declares payments: outbound but payments hasn’t declared checkout: inbound. No policies are generated. Checkout deploys successfully — but traffic to payments is denied by default-deny. The status reports: “outbound dependency payments declared but no matching inbound declaration found on service payments.” This is informational, not a derivation failure — the service can still run, it just can’t reach that dependency yet.

Coordination between teams. Bilateral agreements require coordination: team A declares outbound, team B declares inbound. In practice: team A deploys, traffic is denied, team A asks team B to add the inbound declaration. The platform makes this visible through the status condition. The alternative — team A opens a ticket and waits for team B to update a firewall rule — is the same coordination with more manual steps.

The coordination cost is not uniform. A leaf service with two callers manages two inbound declarations. A shared utility — logging, authentication, rate limiting — may have dozens or hundreds of callers. Each new caller requires the utility team to add an inbound declaration. At high fan-in, the bilateral model becomes a bottleneck: the utility team’s PR queue fills with one-line inbound additions. The platform should support wildcard or group-based inbound declarations — inbound: namespace-group: commerce-* or inbound: all-authenticated — to reduce per-caller coordination for shared infrastructure services. The trade-off is visibility: a wildcard declaration is easier to manage but harder to audit than an explicit list of callers.

13.3 What the Platform Derives

From a matched bilateral agreement, the mesh-member controller produces three resources:

CiliumNetworkPolicy on the caller. Egress rule allowing TCP to the callee’s pod selector on the callee’s service port.

CiliumNetworkPolicy on the callee. Ingress rule allowing TCP from the caller’s pod selector.

Istio AuthorizationPolicy on the callee. Permitting the caller’s SPIFFE identity: spiffe://cluster.local/ns/{namespace}/sa/{service-account}.

Three resources, two layers, both sides. Default-deny handles everything else.

The default-deny baselines:

# L4: deny all traffic (both directions)
apiVersion: cilium.io/v2
kind: CiliumClusterwideNetworkPolicy
metadata:
  name: default-deny
spec:
  endpointSelector: {}
  ingress: []
  egress: []

# L7: deny all requests
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata:
  name: default-deny
  namespace: istio-system
spec: {}

Per-service policies open specific paths. Everything else is denied.

A critical nuance for ambient mode: AuthorizationPolicies are evaluated at the waypoint proxy. Namespaces without a waypoint proxy deployed do not evaluate L7 policies at all — traffic bypasses L7 authorization entirely. The L4 (Cilium) layer is the actual universal backstop. L7 default-deny only applies to namespaces with waypoints. The platform must ensure every namespace with services gets a waypoint — Exercise 15.1 explores what happens when this invariant breaks.

External services. type: external-service resources don’t have a bilateral counterpart — the external API can’t declare inbound. For these, the platform generates an egress-only CiliumNetworkPolicy (FQDN-based for domain names) after Cedar’s AccessExternalEndpoint gate approves the egress. L7 AuthorizationPolicy isn’t applicable for external traffic that exits the mesh.

Route-level constraints. The service spec’s resource declarations can include HTTP method and path constraints: routes: [{methods: [GET], paths: [/api/v1/*]}]. These produce more granular AuthorizationPolicies — checkout can GET /api/v1/* on payments but not POST /admin/*.

13.4 Debugging Denials

Default-deny with bilateral agreements means more denials to debug. The platform must make debugging fast.

“Why can’t my service reach X?”

Step 1: Check the CRD status. Is there a condition about the dependency? “Outbound dependency payments declared but no matching inbound declaration found” tells you immediately.

Step 2: Check L7 denials. kubectl logs -n istio-system -l app=ztunnel | grep RBAC shows the exact denial — source identity, destination, matched policy.

Step 3: Check L4 denials. Cilium Hubble shows packet drops with the policy that denied them.

Common issues:

Both sides declared but traffic is blocked. Check that both services are enrolled in the mesh (ambient mode labels). Check that waypoint proxies are deployed.
Traffic works intermittently. Race condition between policy derivation and pod startup. The bilateral agreement is being compiled on one service but the mesh-member controller hasn’t reconciled the other yet. Both LatticeMeshMembers must reach Ready.

The trade-off of debugging complexity. Default-deny creates a debugging category that doesn’t exist in default-allow: “my service can’t reach X because a policy is missing.” In default-allow, everything works until it shouldn’t. In default-deny, nothing works until it should. The debugging cost of default-deny is real and ongoing. The security cost of default-allow is hidden until an incident.

13.5 What Goes Wrong in Practice

Scenario: a Cedar policy blocks the fleet. A security engineer creates a new Cedar policy intended to forbid access to a specific infrastructure secret:

forbid(
    principal,
    action == Action::"AccessSecret",
    resource == Secret::"infrastructure/root-ca"
);

But they make a typo — they write resource without the constraint, creating a policy that forbids ALL secret access for ALL services:

forbid(
    principal,
    action == Action::"AccessSecret",
    resource
);

The policy passes the webhook (it’s syntactically valid Cedar). It’s distributed to all clusters via SyncDistributedResourcesCommand. On the next reconciliation cycle (within 60 seconds), every service that declares secrets fails derivation: AccessSecret denied by policy infrastructure-secrets-lockdown.

Blast radius: Every service with secrets across the entire fleet. No new deployments. Existing services continue running (the all-or-nothing rule doesn’t tear down running services on re-derivation failure). But any service that’s re-derived — spec change, reconciliation trigger — fails.

Discovery: The on-call engineer sees a flood of PolicyAuthorized: False conditions across the fleet. The common policy name (infrastructure-secrets-lockdown) appears in every denial. The timing correlates with the policy creation.

Recovery: Delete the bad CedarPolicy CRD. Within 60 seconds, every cluster re-syncs the policy set (minus the deleted policy). Services re-derive successfully on the next reconciliation.

Total impact time: 60 seconds (policy distribution) + 60 seconds (first reconciliation failures detected) + 5 minutes (engineer diagnoses and deletes) + 60 seconds (re-sync) = ~8 minutes. During this time, no new services deploy, but existing services continue running.

The lesson: Cedar policy validation at admission must go beyond syntax. The webhook should detect overly broad policies — a forbid with an unconstrained resource field matches everything. The platform should require a dry-run evaluation before activation: “if this policy were active, which services would fail?” And because a bad policy distributes to every cluster within one reconciliation cycle, the platform should support staged rollout — activate the policy in a single non-production cluster first, observe for violations, then promote fleet-wide. Without staged rollout, the blast radius of a policy bug is the entire fleet by default. The safeguards (validation, dry-run, staged rollout, fast rollback) must be robust.

Exercises

13.1. [M10] A developer writes a Cedar policy: permit(principal in Namespace::"commerce", action == Action::"AccessSecret", resource). This permits every service in the commerce namespace to access every secret in the cluster. What is the security impact? How should the platform prevent this — reject the policy at admission, warn but allow, or accept it?

13.2. [H30] Checkout declares payments: outbound. Payments declares checkout: inbound. Both deploy. Traffic flows. Now a third service, fraud-detection, needs to call payments. Fraud-detection adds payments: outbound to its spec. Payments needs to add fraud-detection: inbound. But the payments team is in a different timezone and the change is urgent. Design the workflow: how does the fraud-detection team request the inbound declaration? Can the platform provide a temporary bypass? What are the security implications of a bypass?

13.3. [R] The bilateral model constrains caller egress to declared dependencies. But a compromised service could modify its own CRD spec to add new outbound dependencies. What prevents this? RBAC on the CRD? A Cedar gate on dependency declarations? If the attacker has write access to the namespace, can they bypass bilateral containment?

13.4. [H30] Design a Cedar policy set for an organization with 3 teams (commerce, data, ml). Each team has its own secrets. The security team wants global forbid policies for sensitive infrastructure secrets (CA keys, cloud credentials). Team leads want to manage their team’s permits without the platform team’s involvement. Structure the policies to prevent cross-team access while allowing within-team self-service. Show 3-4 concrete Cedar policy statements.

13.5. [M10] The mesh-member controller generates both CiliumNetworkPolicy (L4) and AuthorizationPolicy (L7) from the same bilateral declarations. If the controller has a bug that generates wrong labels in the CiliumNetworkPolicy, both L4 and L7 may be affected. Is this a violation of layer independence (Chapter 12, Section 12.2)? What would it take to make L4 and L7 generation truly independent?

13.6. [R] External services (type: external-service) bypass the bilateral model — there’s no callee to declare inbound. This means egress to external endpoints is controlled by Cedar AccessExternalEndpoint alone (no network-level bilateral check). Is this a gap? Could an attacker declare external-service dependencies to exfiltrate data to an attacker-controlled endpoint? What prevents this?