Chapter 16: Batch and Gang Scheduling

A distributed training job needs 8 GPUs. The Kubernetes scheduler places 7 pods. The 8th can’t be placed. Seven pods sit at a synchronization barrier, burning ~$25/hour in GPU time (at ~$3.50/GPU/hour on-demand for H100-class hardware), doing nothing.

The decisions — atomic group placement, shared security guarantees for batch, when batch deserves its own CRD — are universal. The reference implementation chose Volcano; the architecture is scheduler-agnostic. The decisions:

How do you schedule groups of pods atomically? (The default scheduler can’t — it commits per-pod. You need a supplemental scheduler with speculative placement.)
How much of the platform’s security model applies to batch jobs? (Image verification, Cedar authorization, bilateral agreements — all of it, through the shared WorkloadCompiler.)
When does a workload pattern deserve its own CRD? (Batch has different lifecycle, different scaling, different output resources, and multiple teams need it — all four signals from Chapter 11. The case is unambiguous.)

16.1 Why the Default Scheduler Fails for Batch

The kube-scheduler pulls a pod off the queue, finds a node, binds it. Next pod. It has no concept of “these 8 pods are a unit.” By the time it realizes pod 8 won’t fit, pods 1-7 are bound and consuming resources.

There’s no rollback. The scheduler committed eagerly.

The starvation variant is nastier. Job A takes 4 GPUs. Job B needs 4 but only 2 are placed. Those 2 pods hold resources at a barrier while higher-priority jobs keep arriving. Job B’s remaining pods may never get scheduled. You’re paying for 2 idle GPUs indefinitely, and the scheduler has no mechanism to detect or correct this — it doesn’t know the pods are related.

Services tolerate partial placement. A Deployment with 3 replicas can serve traffic with 2 — degraded but functional. Batch workloads can’t. Distributed training, MPI jobs, multi-GPU inference pipelines — they need all participants or they need to wait. Partial placement is worse than no placement because it holds resources hostage.

16.2 Volcano and the Two-Phase Commit

The reference implementation uses Volcano for gang scheduling. Volcano runs alongside the default scheduler — set schedulerName: volcano on batch workloads; everything else keeps using kube-scheduler.

The key insight is Volcano’s two-phase commit. Instead of binding pods as it finds nodes (the kube-scheduler approach), Volcano tentatively places all pods in memory first. Once it confirms that minMember pods can be simultaneously placed, it commits the batch and sends all bind requests at once. If it can’t meet the gang constraint, it drops the tentative placements — no pods created, no resources consumed.

This separation of decision from commitment is what makes gang scheduling work. The default scheduler’s eager binding means a failed gang check leaves orphaned pods holding resources. Volcano’s speculative placement means a failed check is just discarded memory.

The scheduling cycle runs five actions: Enqueue gates jobs by aggregate resource availability (don’t create pods for jobs that obviously can’t fit). Allocate runs the gang logic — sort by priority and dominant resource fairness, tentatively place, commit or discard. Preempt handles priority within queues. Reclaim rebalances across queues using weights. Backfill fills remaining capacity.

Alternatives. Kueue does admission control and quotas; as of Kueue 0.6+ (2024), it also supports gang scheduling via JobSet integration. YuniKorn supports gang scheduling but replaces the default scheduler entirely. Kubernetes has been adding native gang scheduling support through the CoScheduling scheduler plugin and the PodGroup API in the scheduler-plugins project — this covers the basic placement constraint but lacks Volcano’s queue system, DRF, and reclaim. The architecture is scheduler-agnostic; the examples use Volcano.

16.3 The LatticeJob CRD

A batch job is declared with LatticeJob:

apiVersion: lattice.dev/v1alpha1
kind: LatticeJob
metadata:
  name: training-run-42
  namespace: ml
spec:
  queue: ml-training
  minAvailable: 8
  tasks:
    workers:
      replicas: 8
      workload:
        containers:
          trainer:
            image: registry.example.com/trainer@sha256:def456
            variables:
              S3_BUCKET: "${resources.training-data.bucket}"
            resources:
              requests:
                cpu: "4"
                memory: 32Gi
                nvidia.com/gpu: "1"
              limits:
                cpu: "4"
                memory: 32Gi
                nvidia.com/gpu: "1"
            security:
              runAsUser: 65534
              apparmorProfile: Unconfined
      restartPolicy: OnFailure
  defaults:
    entryRuntime:
      imagePullSecrets:
        - default
  resources:
    training-data:
      type: secret
      params:
        keys: [bucket, access-key, secret-key]
    s3-endpoint:
      type: external-service
      direction: outbound
      params:
        endpoints:
          s3: https://s3.us-east-1.amazonaws.com

The developer declared: 8 workers with 1 GPU each, gang scheduled (minAvailable: 8), in the ml-training queue, with S3 credentials for the training data.

The pipeline derives: a Volcano VCJob with minMember: 8, a PodGroup referencing the queue, ExternalSecrets for the S3 credentials, ConfigMaps and Secrets for environment variables, a LatticeMeshMember for network policies (batch jobs have network dependencies too — they need to reach the S3 endpoint), and a TracingPolicyNamespaced for Tetragon enforcement.

16.4 What the Shared Compiler Handles

The LatticeJob uses the same WorkloadCompiler as LatticeService (Chapter 8, Section 8.4). The shared compiler handles:

Image verification. The trainer image is verified against TrustPolicy and authorized by Cedar DeployImage — identical to services.
Cedar authorization. AccessSecret for the S3 credentials. AccessExternalEndpoint if the job calls external APIs. OverrideSecurity if the job needs capabilities or privilege.
Secret resolution. The ${resources.training-data.bucket} reference goes through the same five routing paths as service secrets (Chapter 9). Path 1 (pure secret variable) for the bucket name, likely.
Environment and file compilation. ConfigMaps and Secrets are generated per-container, identical to services.
Mesh member. A LatticeMeshMember is generated for network policies. Batch jobs can declare outbound dependencies (S3, other services) and the bilateral agreement model applies.

What the LatticeJob compiler adds on top:

Volcano VCJob instead of Deployment.
PodGroup with minMember for gang scheduling.
Queue assignment for fair-share scheduling.
restartPolicy: OnFailure or Never instead of Always.
No PDB (jobs complete; they don’t need disruption protection).
No KEDA ScaledObject (batch jobs have fixed replica counts).
Completion tracking in the status (succeeded/failed/running).

A bug fix in the shared compiler’s secret resolution fixes it for services AND jobs simultaneously. A new Cedar gate in the shared compiler protects both. The shared layer is the enforcement point.

16.5 Cron Jobs

LatticeJob supports scheduled execution:

spec:
  schedule: "0 2 * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 1

The pipeline derives a Kubernetes CronJob that wraps the VCJob creation. The concurrencyPolicy prevents overlapping runs. History limits control how many completed job records are retained.

Cron jobs get the same security model as one-shot jobs — image verification, Cedar authorization, bilateral agreements. The schedule doesn’t exempt the job from policy evaluation.

GPU resources are expensive and shared. Volcano queues provide fair-share scheduling across teams.

Each team gets a queue with a weight. A queue with weight 2 gets twice the resources of a queue with weight 1 when both have waiting jobs. The LatticeJob’s queue field references the team’s queue. The pipeline validates that the job’s namespace is authorized to use the referenced queue.

Preemption allows high-priority jobs to evict lower-priority ones. The platform defines priority classes. Teams need to understand that a low-priority job may be preempted — the training framework should support checkpointing.

The trade-off. Queue management adds operational complexity. Weights must be configured and maintained. Preemption policies must be tested. Teams disagree about priorities. But without queue management, GPU scheduling is first-come-first-served — which means the team that submits the most jobs monopolizes the GPUs.

16.7 The Fragmentation Problem

Gang scheduling gives you atomicity. It doesn’t give you packing.

Volcano’s enqueue phase checks aggregate cluster resources — “are there 8 free GPUs total?” It doesn’t check per-node topology. If those 8 GPUs are scattered one-per-node and each pod needs 2 on the same node, the enqueue check passes but the allocate phase rejects every node.

The pattern is common: in a cluster with 4 GPUs per node, naive scheduling leaves most nodes with 3 free GPUs — hundreds free in aggregate, but few nodes with all 4 available for a job that needs 4 on the same node. Bin-packing (consolidating workloads onto fewer nodes to keep others fully free) combined with gang scheduling addresses both problems — atomicity and fragmentation.

Gang scheduling solves atomicity. Packing policy solves fragmentation. The platform needs both.

16.8 What Goes Wrong in Practice

Scenario: preemption kills a 5-hour job. Team A submits a low-priority training job that needs 4 GPUs. The job runs for 5 hours. Team B submits a high-priority job that needs 4 GPUs. Volcano preempts team A’s job — 4 pods are killed. Team A had no checkpointing. Five hours of compute ($70 at $3.50/GPU/hour) is lost.

What team A sees: their pods are terminated with reason Preempted. The job fails. The status says Failed, reason: task workers preempted by higher-priority job ml/team-b-urgent.

The lesson: The platform provides the scheduling mechanism (Volcano, queues, priorities). The platform does NOT protect against data loss from preemption — that’s the training framework’s responsibility (checkpointing). The platform should: (1) surface preemption risk at submission time (“this job is low-priority; higher-priority jobs may preempt it”), (2) warn when a long-running low-priority job has no checkpointing configured, (3) log the cost of preempted compute in the job’s status.

Exercises

16.1. [M10] A job needs 8 GPUs. The cluster has 8 free across 8 nodes (1 per node). Each pod needs 1 GPU. Does Volcano’s gang scheduling place the job? What if each pod needs 2 GPUs? Where does the failure occur — enqueue or allocate?

16.2. [H30] Two teams share a GPU cluster. Team A has queue weight 3, team B has weight 1. Team A submits a job needing 16 GPUs. Team B submits a job needing 4 GPUs. The cluster has 16 GPUs free. How does Volcano allocate? What if team A’s job is low-priority and team B’s is high-priority? Design the interaction between queue weights and priority classes.

16.3. [R] The shared WorkloadCompiler ensures batch jobs get the same security as services — image verification, Cedar authorization, bilateral agreements. But batch jobs are often run by ML engineers who may not understand the platform’s dependency model. They submit a training job and it fails because they didn’t declare the S3 endpoint as an external-service dependency. Is the security model too strict for batch workloads? Should batch jobs have a different (more permissive) security posture? What’s the risk?

16.4. [H30] A training job runs for 6 hours and fails because a GPU produced incorrect results (ghost loss — Chapter 17, Section 17.1). The job has no checkpointing. 6 hours of compute wasted. Design the platform’s role: should it monitor training loss curves? Should it detect GPU anomalies and preemptively migrate the job? Where is the boundary between platform responsibility and application responsibility?

16.5. [M10] Volcano’s preemption evicts lower-priority tasks to make room for higher-priority ones. A low-priority job has been running for 5 hours on 4 GPUs. A high-priority job arrives needing 4 GPUs. Volcano preempts the low-priority job. 5 hours of compute are lost (no checkpointing). Is this the correct behavior? How should the platform surface preemption risk to the developer before they submit?

16.6. [R] The chapter says “Kubernetes scheduling is per-pod, batch needs per-group.” Is this a fundamental limitation of Kubernetes, or a limitation of the kube-scheduler specifically? Could you build gang semantics into the kube-scheduler without replacing it? What are the architectural constraints that make this hard? (Hint: the scheduler’s per-pod decision loop and lack of transaction semantics.)