Skip to content

Chapter 7: Multi-Cluster Communication

Chapters 5 and 6 produced self-managing clusters — compute pools that own their own lifecycle and operate independently. But independent doesn’t mean isolated. The parent needs visibility into its children’s health. Policy updates need to propagate. Operators need to kubectl get pods on a workload cluster that’s behind a firewall. The pivot protocol from Chapter 6 needs a transport to carry CAPI resources from parent to child.

The question: how do independent clusters communicate without reintroducing the dependencies that self-management eliminated?

The answer: the workload cluster connects outbound to the parent. The parent never connects inbound to the workload cluster. Everything — health reporting, API proxying, pivot commands, policy distribution — flows over a single bidirectional gRPC stream that the child initiated.

Count the inbound connections a typical management cluster makes to its workload clusters. Prometheus scraping metrics. ArgoCD syncing manifests. The CAPI controller checking machine health. A health check endpoint. An API server endpoint for fleet management. Each of these requires the workload cluster to expose a port, configure authentication, manage TLS certificates, and maintain firewall rules.

Each inbound port is attack surface. Each firewall rule is operational burden. In enterprise environments, opening inbound ports requires change requests, security reviews, and coordination across network teams. Each new workload cluster is a set of firewall tickets. The rule matrix grows with the fleet size and the number of management tools.

Beyond the operational burden, inbound connectivity creates a dependency: the parent must be able to reach the child. If the network between them goes down, management functions stop. A workload cluster behind NAT, in a private subnet, or at an edge location with intermittent connectivity requires dedicated networking infrastructure (VPN, bastion hosts, NAT traversal) just so the parent can push to it.

The outbound-only model eliminates all of this. The workload cluster makes one outbound TCP connection. No inbound ports. No firewall rules allowing traffic to the cluster. No NAT traversal. No VPN.

graph BT
    subgraph Parent Cluster
        Cell[Cell - gRPC Server<br/>Accepts connections mTLS<br/>Routes API proxy<br/>Sends pivot commands<br/>Distributes Cedar policies]
    end
    subgraph Workload Cluster A
        AgentA[Agent]
    end
    subgraph Workload Cluster B
        AgentB[Agent]
    end
    AgentA -->|outbound gRPC<br/>mTLS| Cell
    AgentB -->|outbound gRPC<br/>mTLS| Cell

The communication model has two components:

The agent runs on the workload cluster. It establishes an outbound gRPC connection to the parent, authenticates with mTLS (certificates issued during provisioning), and maintains a persistent bidirectional stream. If the stream drops, the agent reconnects with exponential backoff. The agent is a component of the platform operator installed during bootstrapping (Chapter 5).

The cell runs on the parent cluster. It accepts inbound gRPC connections from agents, verifies their mTLS certificates (extracting the cluster identity from the certificate’s CN field), and multiplexes all management communication over the established streams. One cell serves all connected agents.

The connection is outbound from the workload cluster’s perspective — the agent initiates the TCP connection. Once established, the gRPC stream is bidirectional: both agent and cell can send messages at any time. This means the parent can push commands to the child (pivot, policy updates, API requests) without the child exposing any inbound ports.

Why gRPC: Bidirectional streaming over HTTP/2 with multiplexing, efficient Protobuf serialization, and mature mTLS support. The reference implementation uses tonic (Rust gRPC) with configurable message sizes (configured at 16 MiB — the gRPC default is 4 MiB — max 256 MiB for large pivot payloads). WebSockets, QUIC, or custom protocols could serve the same role — the choice is gRPC because it provides typed message definitions, streaming semantics, and a strong ecosystem.

Multiplexing and priority. The single stream carries heartbeats, API proxy requests, pivot batches, and policy updates simultaneously. A large pivot payload (several MiB of CAPI resources) could delay a heartbeat if sent as one message. The protocol handles this through batching (pivot payloads are split into MoveObjectBatch messages with configurable batch size) and HTTP/2’s native stream multiplexing (heartbeats and API requests flow on separate HTTP/2 streams within the same connection, so they aren’t blocked by bulk transfers).

Before the agent can connect, the workload cluster must exist and the agent must be installed. This creates a chicken-and-egg problem: the agent establishes the outbound stream, but how does the parent communicate with the workload cluster during provisioning, before the agent is running?

The answer is the bootstrap webhook. The parent cluster exposes a webhook endpoint (configured via parentConfig.bootstrapPort in the LatticeCluster CRD). During node provisioning, kubeadm’s postKubeadmCommands calls this webhook to register the new node with the parent.

To be direct about the asymmetry: the parent cluster accepts inbound connections on two ports — the gRPC cell endpoint and the bootstrap webhook. Workload clusters accept inbound connections on zero ports. The “outbound-only” label describes the workload cluster’s posture, not the parent’s. The parent is the single point that accepts connections from the fleet, authenticated by mTLS (gRPC) and bootstrap tokens (webhook). This is a smaller, better-defended attack surface than the traditional model where every workload cluster exposes multiple inbound ports.

The bootstrap webhook deserves additional hardening because it’s the only port exposed before mTLS is established. The parent knows the intended IP ranges of its children (from the LatticeCluster provider config in Chapter 5), so the webhook can be restricted via cloud security groups or network policies to accept connections only from those ranges. The bootstrap token has a short TTL and is single-use. After the agent establishes the persistent gRPC stream, the bootstrap webhook is no longer needed for that cluster.

Once the control plane is up and the platform operator is installed (Chapter 5’s bootstrapping sequence), the agent component starts and establishes the persistent gRPC stream. From this point forward, all communication flows over the agent-cell stream. The bootstrap webhook is only used during initial provisioning.

The protocol is defined in Protobuf with one RPC: StreamMessages(stream AgentMessage) returns (stream CellCommand). Everything is multiplexed over this single bidirectional stream. Rather than listing every message type, let’s walk through the three main flows and why each message exists.

When the agent connects, it sends AgentReady — a handshake containing the agent version, Kubernetes version, cluster state, and a protocol version for capability negotiation. The cell verifies the mTLS certificate, extracts the cluster name from the CN field, and registers the connection.

After the handshake, the agent sends Heartbeat every ~30 seconds. The heartbeat carries the agent’s state (Provisioning, Pivoting, Ready, Degraded, Failed), cluster health (node counts, resource allocation per worker pool), and hashes of the cluster spec and status. The hashes enable delta sync — if the parent’s view of the spec doesn’t match the hash, it knows the cluster has been modified and can request a full state sync.

The cell uses missed heartbeats to detect disconnection. The agent uses the stream’s health to detect parent unreachability. Neither side panics on a single missed heartbeat — network jitter happens. Multiple consecutive misses trigger a state transition.

The pivot flow (Chapter 6, Section 6.3) uses MoveObjectBatch messages to transfer CAPI resources in topological order. The agent responds to each batch with MoveObjectAck containing UID mappings. MoveComplete signals the end of the transfer and carries distributable resources — Cedar policies, SecretProvider CRDs, TrustPolicy CRDs, OIDC providers, image pull credentials.

The same SyncDistributedResourcesCommand is used after the pivot for ongoing policy distribution. When the platform team creates a new Cedar policy on the parent, it’s pushed to all connected children through this message. This is how platform-wide authorization, secret configuration, and image trust policies reach self-managing clusters.

The protocol also supports unpivot — the reverse of the pivot. When a self-managing cluster is deleted (the LatticeCluster CRD is removed), the agent sends ClusterDeleting with all CAPI objects back to the parent. The parent imports them using the same UID-remapping process, unpauses them, and resumes management. The parent can then tear down the infrastructure through CAPI. Self-management is reversible — the cluster can be “un-pivoted” back to parent management for deletion or migration.

KubernetesRequest and KubernetesResponse implement the API proxy (Section 7.6). The request carries the HTTP verb, path, query, body, and critically the source user and groups — so the child cluster can evaluate authorization against the original caller’s identity, not the tunnel’s identity.

ExecRequest, ExecData, ExecResize, and ExecCancel multiplex interactive sessions (exec, attach, port-forward) over the stream. Each session gets a unique request ID. Stream IDs within a session follow the Kubernetes exec protocol conventions: 0 for stdin, 1 for stdout, 2 for stderr. An operator can kubectl exec -it into a pod on a child cluster through the parent — the terminal session is tunneled over the gRPC stream in real time.

Minimal surface. The protocol carries control plane traffic — coordination, management, and interactive operations. It does not carry workload traffic, bulk telemetry, or log streams. Those have dedicated channels.

Idempotent and replay-protected. Pivot batches include move IDs and batch indices. API requests include request IDs. Replaying a message is a no-op if the operation already completed. For policy distribution (SyncDistributedResourcesCommand), messages carry monotonically increasing sequence numbers — the agent rejects any message with a sequence number less than or equal to the last accepted, preventing an attacker who captures a message from reverting to an older policy set.

Unidirectional dependency. The agent operates without the cell. If the stream drops, the cluster continues running, scaling, and self-healing. Only coordination capabilities degrade.

The gRPC stream uses mTLS — both sides authenticate. This is the most critical communication channel in the platform, and hand-waving its security would be irresponsible.

The CA. The platform has its own internal certificate authority, separate from cert-manager. This CA is generated by the platform operator during its own installation — an ECDSA P-256 key pair stored as a Kubernetes Secret in the platform’s namespace on the management cluster. The CA exists before any workload cluster does, because it’s needed to sign agent certificates during provisioning. It is not the same CA as cert-manager (which handles application-level TLS) or the mesh CA (which handles SPIFFE identity). It exists solely for the agent-cell mTLS channel. If this CA is compromised, every agent certificate is compromised and an attacker can impersonate any workload cluster. The CA secret should be treated as the most sensitive credential on the management cluster.

Certificate issuance. During cluster provisioning, the parent generates a TLS certificate for the workload cluster’s agent, signed by the platform CA. The certificate’s CN field contains the cluster name — this is how the cell identifies which cluster is connecting. The certificate and the CA’s public key are embedded in the workload cluster’s bootstrap configuration, so the agent can authenticate the parent and the parent can authenticate the agent from the first connection.

Certificate rotation. Agent certificates are short-lived (configurable, typically 24-72 hours). The agent rotates its certificate by requesting a new one from the parent through the existing stream before the current certificate expires. If the stream is down when rotation is needed, the agent uses its current (still-valid) certificate to reconnect and request a new one.

CA trust. The agent trusts the parent’s CA certificate, which is embedded during provisioning. The cell trusts the platform’s agent-signing CA. Both sides verify the full certificate chain on every connection.

What happens when certificates expire. If the agent’s certificate expires before it can rotate (extended disconnection beyond the certificate lifetime), the agent can’t reconnect — the mTLS handshake fails. Recovery requires manual intervention: issuing a new certificate through an out-of-band mechanism (SSH, cloud console) and restarting the agent.

To be precise about the “survives total isolation” claim from Chapter 6: self-managing clusters survive isolation indefinitely for workload operations — pods keep running, CAPI controllers keep reconciling, scaling and node replacement continue. But management connectivity (policy updates, cross-cluster bilateral agreements, fleet visibility) is permanently lost after the certificate TTL (24-72 hours) unless the stream can be re-established. The cluster is alive but disconnected. Monitor for agent certificate age approaching the TTL threshold.

The most immediately useful capability: accessing a workload cluster’s Kubernetes API through the parent, without the workload cluster exposing its API server.

An operator on the parent runs kubectl get pods targeting a workload cluster. The parent routes the request through the gRPC stream:

  1. The parent’s API proxy receives the request, identifies the target cluster from the path (/clusters/backend/api/v1/namespaces/default/pods).
  2. The cell serializes the request as a KubernetesRequest message — verb, path, query, body, content type, source user and groups.
  3. The agent receives the message, executes the request against the local API server, and returns a KubernetesResponse — status code, body, content type.
  4. The parent returns the response to the operator.

Watch requests work too. The agent streams watch events back over the gRPC tunnel for the lifetime of the watch, with streaming: true on each event and stream_end: true when the watch terminates.

Exec and port-forward sessions are multiplexed over the same stream. kubectl exec -it on a pod in a workload cluster tunnels stdin/stdout/stderr through the gRPC stream with per-session stream IDs.

Authorization is preserved through the tunnel. The KubernetesRequest carries the source user and groups. The parent authorizes the request before proxying — reaching the parent doesn’t automatically grant access to every child. The child can also evaluate authorization locally.

To make this concrete, here’s what happens when an operator runs kubectl get pods -n commerce targeting the backend workload cluster through the parent:

  1. t=0ms. The operator’s kubectl sends GET /clusters/backend/api/v1/namespaces/commerce/pods to the parent’s API proxy.
  2. t=1ms. The proxy identifies the target cluster (backend) from the path, strips the cluster prefix, and looks up the gRPC stream for backend’s agent.
  3. t=2ms. The cell sends a KubernetesRequest message over the stream: verb: GET, path: /api/v1/namespaces/commerce/pods, source_user: "evan", source_groups: ["platform-admins"].
  4. t=30ms. The agent receives the message, executes the request against the local API server (which evaluates RBAC against the source user/groups), and gets the pod list.
  5. t=35ms. The agent sends a KubernetesResponse: status_code: 200, body: <pod list JSON>, content_type: application/json.
  6. t=65ms. The parent returns the response to the operator’s kubectl. Total latency: ~65ms — 30ms of which is the round-trip over the gRPC stream.

The operator sees the pod list as if they’d connected to the workload cluster directly. The workload cluster has zero inbound ports. The operator didn’t need a VPN, a bastion host, or a kubeconfig for the workload cluster — they authenticated to the parent and the proxy preserved their identity through the tunnel.

This is where the outbound-only model proves its value.

Stream drops, agent reconnects. Normal transient failure. The agent reconnects with exponential backoff. During the gap: the workload cluster operates normally (self-managing), the parent loses visibility (no heartbeats, no API proxy, no policy push). No workloads are affected.

Parent is down for an extended period. The agent retries indefinitely. The workload cluster operates normally. It has the last-synced policies, the last-synced secret providers, the last-synced trust policies. When the parent recovers, the agent reconnects, sends its current state, and receives any queued updates.

Parent is permanently deleted. The workload cluster continues operating. It can’t reconnect because there’s nothing to reconnect to. An operator can reconfigure the agent to point at a new parent if coordination is needed. The cluster’s self-management is unaffected.

Network partition. From the agent’s perspective, identical to “parent is down.” The workload cluster doesn’t distinguish between “parent is unreachable” and “parent doesn’t exist.” It operates with the last-synced configuration and retries the connection.

In every failure mode, the workload cluster’s operational capabilities — running workloads, scaling, upgrading, self-healing — are unaffected. Only coordination capabilities — visibility, API proxy, policy distribution — degrade. Communication is useful, not required. This is the architectural payoff of combining self-management (Chapter 6) with outbound-only communication.

Peer-to-peer. Outbound-only defines a parent-child relationship. Two sibling workload clusters can’t talk directly through this model. They can relay through a shared parent, or they can establish direct connections for specific use cases (cross-cluster mesh, Chapter 15). The agent-cell protocol is for hierarchical management, not arbitrary cluster-to-cluster communication.

Latency. The gRPC tunnel adds a hop. For API proxying and policy distribution, this is negligible. For latency-sensitive operations (real-time cross-cluster failover, synchronous data replication), the tunnel may not be appropriate.

Bandwidth. The tunnel is designed for control plane traffic, not data plane. Log streaming, bulk metrics export, and artifact distribution should use dedicated channels. The tunnel carries kilobytes of coordination messages, not gigabytes of telemetry.

Single parent. In the current model, each workload cluster connects to one parent. If the parent is permanently lost, the agent must be reconfigured. A mesh of connections (multiple parents, peer connections) would add resilience but also complexity.

The reader now has the full cluster lifecycle:

  • Chapter 5: Clusters are declared in a spec, provisioned through a lifecycle state machine, and bootstrapped with the platform’s prerequisites.
  • Chapter 6: The pivot transfers infrastructure ownership to the workload cluster. Self-managing clusters scale, upgrade, and self-heal independently. The trade-off is complexity.
  • Chapter 7: Self-managing clusters communicate through outbound-only gRPC streams. The protocol carries health, pivot commands, API proxy, policy distribution, and exec sessions. Communication is useful, not required.

Part III shifts from infrastructure to workloads. The clusters are provisioned, self-managing, and connected. Now: how does the platform turn a LatticeService spec into a Deployment, network policies, external secrets, scrape targets, and disruption budgets? Part III covers the derivation pipeline (Chapter 8), secret resolution and its five routing paths (Chapter 9), autoscaling and resource governance (Chapter 10), and what to do when something doesn’t fit the pipeline (Chapter 11).

7.1. [M10] The agent sends heartbeats every 30 seconds. The cell uses missed heartbeats to detect disconnection. How many missed heartbeats should trigger a “disconnected” status? What’s the trade-off between detecting disconnection quickly (1 missed heartbeat = 30s) and avoiding false positives from network jitter?

7.2. [H30] The API proxy tunnels Kubernetes API requests through the gRPC stream. Design the authorization model: who is allowed to proxy requests to which clusters? Consider: should the parent’s RBAC determine access, the child’s RBAC, or both? What happens when the parent’s RBAC grants access but the child’s denies it? What information should the audit log record?

7.3. [R] The chapter claims “communication is useful, not required.” Test this claim against a platform that distributes Cedar policies through the tunnel (Section 7.4, SyncDistributedResourcesCommand). If a new policy is created that denies a previously-permitted secret access, and the workload cluster is disconnected when the policy is created, the workload cluster continues operating with the old (more permissive) policy. Is this acceptable? How long can the disconnection last before it becomes a security problem? How should the platform handle policy staleness?

7.4. [H30] Section 7.6 notes the “single parent” limitation. Design a multi-parent model where each workload cluster connects to two parents for redundancy. What changes in the protocol? How do you handle conflicting policy updates from different parents? How does the pivot protocol work with two parents? Is the complexity worth the resilience gain?

7.5. [R] The outbound-only model eliminates inbound ports on workload clusters. An adversary argues: “The parent cluster now accepts inbound connections from every workload cluster. You’ve moved the attack surface from many clusters to one.” Evaluate this argument. Is the parent’s inbound surface larger or smaller than the fleet’s aggregate inbound surface in the traditional model? What does the parent need to defend, and how?

7.6. [M10] The pivot sends CAPI resources as JSON manifests in MoveObjectBatch messages. The default max gRPC message size is 16 MiB. A cluster with 100 machines and their associated resources might produce a total payload of 5-10 MiB. What happens if a cluster has 500 machines? How does the batching strategy (batch_index, total_batches) handle this? What is the failure mode if a single batch exceeds the message size limit?