Skip to content

Chapter 17: GPU Infrastructure

A training job runs for 6 hours on 8 GPUs. The loss curve diverges at hour 4 — but nobody notices until hour 6, when the job completes with a useless model. The cause: a double-bit ECC error on one GPU at hour 4 that produced incorrect computation without crashing the pod.

GPUs are not fast CPUs. They fail in ways CPUs don’t, and a platform that treats them as “nodes with an extra resource type” will miss these failures.

The decisions — detecting silent failures, acting on degraded hardware, monitoring GPU-specific metrics — apply to any platform running GPU workloads. The reference implementation uses NVIDIA DCGM; the failure modes and monitoring principles are hardware-universal. The decisions:

  • How do you detect failures that don’t crash the pod? (Ghost losses — incorrect computation without errors. DCGM metrics, anomaly detection, or application-level monitoring?)
  • What action does the platform take on degraded hardware? (Cordon immediately? Alert and let the user decide? The answer depends on the cost of false positives vs. false negatives.)
  • How do you monitor hardware that has no CPU analog? (15 DCGM metrics per GPU, most of which infrastructure teams have never seen.)

Hard loss. The GPU device disappears from the system. The kernel log shows an XID 79 error (GPU has fallen off the bus) and the CUDA runtime returns cudaErrorDevicesUnavailable. The pod crashes. Kubernetes detects the pod failure and — for jobs — may reschedule. This is the “easy” failure: visible, immediate, and handled by standard Kubernetes mechanisms (MachineHealthCheck replaces the node, or the job’s restart policy retries the pod).

Ghost loss. The GPU produces incorrect results without crashing. CUDA operations complete, but the output is wrong. Double-bit ECC errors that don’t trigger a device reset. XID errors in specific categories. Driver instability that corrupts computation. The pod continues running. The training loss diverges. Nobody knows until someone inspects the model or the loss curve — which could be hours or days later.

Ghost losses are the most expensive failure mode in GPU infrastructure. Hard losses waste the compute between the last checkpoint and the crash. Ghost losses waste the compute between the ghost loss and the detection — potentially the entire training run.

Thermal throttling. GPUs throttle clock speed when they overheat. Performance degrades without errors. A training job takes 8 hours instead of 6. No alert fires because no threshold is exceeded — the GPU is working, just slower. Thermal throttling is a warning sign: cooling is inadequate, or the GPU is under sustained load that the thermal design can’t handle.

The NVIDIA Data Center GPU Manager exposes ~15 metrics per GPU that have no CPU analog. The platform must collect these through the same metrics infrastructure as service metrics.

Key metrics:

  • DCGM_FI_DEV_GPU_TEMP — Temperature. Alert threshold: manufacturer’s rated max (83-90°C typically).
  • DCGM_FI_DEV_POWER_USAGE — Power draw. Anomalous spikes or drops indicate hardware issues.
  • DCGM_FI_DEV_SM_CLOCK — Streaming multiprocessor clock speed. Drops indicate throttling.
  • DCGM_FI_DEV_ECC_SBE_VOL — Single-bit ECC errors (volatile). High rate indicates memory degradation.
  • DCGM_FI_DEV_ECC_DBE_VOL — Double-bit ECC errors. Whether these trigger a device reset (hard loss) or silently corrupt computation (ghost loss) depends on the error location and GPU firmware behavior. Any non-zero count warrants investigation — it may be a hard failure signal or the precursor to silent data corruption.
  • DCGM_FI_DEV_XID_ERRORS — XID error codes. Different codes indicate different failure types. XID 48 (double-bit ECC) typically indicates a hard loss. XID 63 (ECC page retirement) is degradation.
  • DCGM_FI_DEV_GPU_UTIL — Utilization. For capacity planning and autoscaling.
  • DCGM_FI_DEV_MEM_COPY_UTIL — Memory bandwidth utilization. Saturation indicates memory-bound workloads.

The DCGM exporter runs as a DaemonSet on GPU nodes and exposes a /metrics endpoint. The platform generates a scrape target for it during cluster bootstrapping (Chapter 5) — the same pattern as service observability, applied to hardware.

The trade-off. DCGM metrics add monitoring overhead per GPU node. At 15 metrics per GPU, 8 GPUs per node, 30-second scrape interval — that’s 240 metric series per node per scrape. For a 100-node GPU cluster, that’s 24,000 series. This is manageable for VictoriaMetrics/Prometheus but not free. The storage cost scales with GPU count.

Threshold-based alerting catches gross failures. “Alert if temperature > 85°C” is clear and correct. But threshold alerting misses subtle degradation: a GPU that normally runs at 72°C and starts running at 78°C isn’t over threshold, but the trend is significant.

Statistical baselines. Track mean and standard deviation per metric per GPU over a rolling window (7-14 days). Alert when a metric deviates by more than N standard deviations from its baseline. This catches: gradual temperature increase (cooling degradation), increasing ECC error rate (memory wear), and clock speed drops (early throttling).

Autoencoder models. Train a model on normal GPU behavior. Feed it current metrics. High reconstruction error indicates anomalous behavior. GRU (Gated Recurrent Unit) autoencoders are one possible approach for time-series GPU metrics — they can capture temporal patterns (a spike followed by a drop is different from a steady increase) that statistical methods miss.

Practical considerations. Baseline training requires GPU-hours of normal data. The platform should collect metrics for 2-4 weeks before enabling anomaly detection. False positives are expensive — a GPU node taken offline unnecessarily is an expensive node not running jobs. Anomaly detection supplements threshold alerting, doesn’t replace it.

Health-based scheduling. A GPU showing anomalous metrics should be flagged before it fails during a training run. The platform can cordon nodes with unhealthy GPUs (preventing new pods from scheduling) while existing workloads finish. This is proactive — fix the problem before it causes a 6-hour training failure.

Walk through what happens when an ML engineer submits a GPU training job:

apiVersion: lattice.dev/v1alpha1
kind: LatticeJob
metadata:
name: llm-finetune
namespace: ml
spec:
queue: ml-training
minAvailable: 4
tasks:
workers:
replicas: 4
workload:
containers:
trainer:
image: registry.example.com/finetune@sha256:abc123
variables:
MODEL: "Qwen/Qwen3-8B"
DATA_PATH: "${resources.training-data.s3-path}"
WANDB_KEY: "${resources.wandb.api-key}"
resources:
requests:
cpu: "8"
memory: 64Gi
nvidia.com/gpu: "2"
limits:
cpu: "8"
memory: 64Gi
nvidia.com/gpu: "2"
resources:
training-data:
type: secret
params:
keys: [s3-path, access-key, secret-key]
wandb:
type: secret
params:
keys: [api-key]
s3-endpoint:
type: external-service
direction: outbound
params:
endpoints:
s3: https://s3.us-east-1.amazonaws.com

t=0. The engineer applies the spec. Admission webhook validates. Status: phase: Pending.

t=1s. The derivation pipeline begins. Image verification: the finetune image is signed and authorized. Cedar gates: AccessSecret permits access to training-data and wandb secrets. AccessExternalEndpoint permits egress to S3. Quota check: 4 workers × 2 GPUs × 8 CPU fits within the team’s budget.

t=3s. The shared WorkloadCompiler produces: ConfigMaps for environment variables, ExternalSecrets for S3 credentials and Weights & Biases API key, a LatticeMeshMember with the S3 egress rule. The LatticeJob compiler wraps in a Volcano VCJob with minMember: 4, referencing the ml-training queue.

t=5s. Layered application: infrastructure resources first (ExternalSecrets, ConfigMaps, LatticeMeshMember), then the VCJob. The pipeline waits for ESO to sync the S3 credentials.

t=15s. Secrets synced. VCJob created. Volcano evaluates: are 4 nodes with 2 available GPUs each? Yes — the cluster has 8 GPU nodes, each with 4 GPUs.

t=20s. Volcano commits the gang: all 4 worker pods are placed simultaneously. No partial placement. Each pod gets 2 GPUs, 8 CPU, 64Gi memory.

t=30s. Pods start. The trainer container initializes PyTorch, discovers its peers through environment variables, and begins distributed training. DCGM metrics are being collected from each GPU.

The ML engineer wrote a job spec with containers, secrets, and GPU requirements. The platform derived: gang scheduling, secret resolution, network policy for S3 egress, Tetragon binary enforcement, and metrics collection. If a GPU on one of the worker nodes starts producing ECC errors during training, the DCGM metrics will show it — and if the anomaly detection baseline (Section 17.3) flags it, the platform can alert before the training loss diverges.

Setting gpu: true on a LatticeCluster spec installs during bootstrapping (Chapter 5):

  • NVIDIA GPU Operator. Manages GPU drivers, device plugins, and the CUDA toolkit.
  • Device plugins. Advertise nvidia.com/gpu resources to the scheduler.
  • DCGM exporter. Exposes GPU metrics for the metrics infrastructure.
  • Node Feature Discovery (NFD). Labels GPU nodes with hardware characteristics (GPU model, CUDA compute capability) for scheduling decisions.

GPU infrastructure is opt-in per cluster. Not every cluster needs GPUs. The gpu: true flag tells the bootstrapping sequence to install the GPU stack — the same pattern as services: true for the mesh or monitoring.enabled: true for VictoriaMetrics.

The LatticeService and LatticeJob specs support GPU resource requests through the standard Kubernetes mechanism:

resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"

For more advanced GPU scheduling, the type: gpu resource declaration provides additional parameters:

workload:
resources:
compute-gpu:
type: gpu
params:
count: 4
memory: 80Gi

The memory parameter serves as a node selector hint — the pipeline generates a node affinity rule targeting nodes with GPUs of at least the specified memory (e.g., A100-80GB vs A100-40GB). If the cluster has only one GPU class, the parameter is validated but doesn’t affect scheduling.

The pipeline validates GPU requests against the cluster’s GPU capacity and the team’s GPU quota (Chapter 10). A job requesting 16 GPUs on a cluster with 8 available fails at derivation time with a clear status message — not at scheduling time with a Pending pod.

MIG (Multi-Instance GPU) awareness. NVIDIA A100 and H100 GPUs support MIG — partitioning a single physical GPU into multiple isolated instances. A 7-partition A100 can serve 7 independent inference workloads. If the cluster uses MIG, the derivation pipeline must be aware: the resource name changes from nvidia.com/gpu to a MIG-specific resource (e.g., nvidia.com/mig-1g.5gb), and the pipeline must set NVIDIA_VISIBLE_DEVICES correctly so the container only sees its allotted slice. MIG partitioning is a cluster-level configuration managed by the platform team — the developer requests a GPU resource, and the pipeline maps it to the appropriate MIG profile based on cluster configuration.

Scenario: ghost loss during a 12-hour training run. A training job on 8 GPUs runs for 12 hours. At hour 3, a GPU on node-gpu-07 develops intermittent double-bit ECC errors — not enough to crash the device (no XID 48), but enough to corrupt occasional matrix multiplications. The DCGM metric DCGM_FI_DEV_ECC_DBE_VOL increments from 0 to 4 over 2 hours.

The training framework doesn’t detect the error — PyTorch sees valid CUDA return codes. The training loss plateaus at hour 5 instead of continuing to decrease. The engineer inspects the loss curve at hour 12, sees the plateau, and discards the run. Twelve hours × 8 GPUs × ~$3.50/hour (H100-class on-demand) = ~$336 wasted.

What the platform could have caught: The DCGM metrics show the ECC error count increasing. If anomaly detection (Section 17.3) is active, the deviation from the baseline (0 errors → 4 errors) triggers an alert at hour 3. The platform cordons node-gpu-07 and alerts: “GPU on node-gpu-07 has elevated double-bit ECC errors. Running job ml/training-run may be producing incorrect results.”

What the platform can’t catch: Whether the ECC errors actually affected the training. The correlation (ECC errors started at hour 3, loss plateaued at hour 5) is suggestive but not proof. The platform surfaces the hardware signal. The ML engineer makes the judgment call — checkpoint and restart, or continue and hope.

The lesson: GPU monitoring is not optional for expensive training runs. The cost of detecting a ghost loss 9 hours too late ($336) dwarfs the cost of running DCGM + anomaly detection ($50/month for the infrastructure). But the platform must be careful with automated actions — cordoning a node based on a false positive wastes the GPUs too. Alert, don’t auto-kill.

17.1. [M10] A training job has been running for 4 hours. DCGM reports a sudden increase in ECC_DBE_VOL (double-bit ECC errors) on one of the job’s GPUs. The GPU hasn’t crashed (no XID 48). What should the platform do — kill the pod (loses 4 hours), cordon the node (prevents new scheduling but the current job continues with a potentially bad GPU), or alert and let the user decide?

17.2. [H30] Design the anomaly detection pipeline for a 100-node GPU cluster (800 GPUs). What is the data volume? Where does the anomaly detection run — on the GPU nodes, on a central controller, or in the metrics backend? What is the latency from anomaly to alert? How do you handle the cold-start problem (no baseline data for new GPUs)?

17.3. [R] Ghost losses produce incorrect results without crashing. The platform monitors GPU metrics (ECC errors, XID codes, temperature). But not all ghost losses produce detectable GPU-level anomalies — some are caused by driver bugs that don’t affect hardware metrics. The only way to detect these is by monitoring the training loss curve (application-level metric). Should the platform monitor training loss, or is that the application’s responsibility? Where is the boundary?

17.4. [M10] A cluster has 50 GPU nodes with 4 GPUs each (200 total). DCGM metrics show that 12 GPUs across 8 nodes have elevated ECC error rates. The platform cordons those 8 nodes. Now 42 nodes are schedulable (168 GPUs). A job needs 170 GPUs. It can’t be scheduled. Was cordoning the right call? Design the policy: when to cordon (certain failure) vs. when to warn (possible degradation).

17.5. [H30] Multi-Instance GPU (MIG) partitions a single GPU into smaller instances. An A100 can be split into up to 7 MIG instances. This is useful for inference (small models that don’t need a full GPU) but wasteful for training (partial GPU bandwidth). Should the platform support MIG as a resource type? How does it interact with the quota system (is 1 MIG instance 1/7 of a GPU for quota purposes)? How does it interact with gang scheduling (can 7 MIG pods from the same job land on the same GPU)?