Skip to content

GPU and TPU Execution

Modern automatic differentiation systems are built around accelerator hardware. GPUs and TPUs provide enormous throughput for tensor operations, making large-scale...

Modern automatic differentiation systems are built around accelerator hardware. GPUs and TPUs provide enormous throughput for tensor operations, making large-scale differentiable computation practical. Reverse-mode AD, deep learning, differentiable simulation, and large tensor algebra all depend heavily on these architectures.

Accelerators are not merely faster CPUs. They expose fundamentally different execution models:

  • massive parallelism,
  • hierarchical memory systems,
  • SIMD or SIMT execution,
  • high-throughput tensor units,
  • asynchronous scheduling,
  • and distributed communication fabrics.

Automatic differentiation systems must therefore adapt their computational graphs, memory layouts, and execution schedules to accelerator constraints.

GPU Architecture

A GPU contains thousands of lightweight execution lanes optimized for throughput rather than low-latency sequential execution.

Core ideas include:

FeaturePurpose
Massive parallelismExecute many threads simultaneously
SIMT executionSingle instruction, multiple threads
High memory bandwidthFeed tensor operations
Warp schedulingHide latency
Specialized tensor unitsAccelerate matrix operations

GPUs are especially efficient for dense linear algebra:

C=AB, C = AB,

convolution,

attention,

and tensor reductions.

These operations dominate many differentiable workloads.

SIMT Execution

GPUs typically use SIMT execution:

Single Instruction, Multiple Threads.

Threads are grouped into warps.

A warp executes one instruction stream across many threads simultaneously.

Example:

Warp laneData
Thread 0x0x_0
Thread 1x1x_1
Thread 2x2x_2

This works well when all threads follow the same control flow.

Warp Divergence

Branching harms GPU efficiency.

Example:

if xi>0. \text{if } x_i > 0.

Some threads may take the true branch while others take the false branch.

The GPU serializes divergent paths.

This reduces utilization.

Differentiable programs with irregular control flow therefore perform poorly on SIMD/SIMT hardware.

Tensor Operations

Automatic differentiation heavily relies on tensor primitives.

Core operations include:

OperationImportance
Matrix multiplicationNeural layers
ConvolutionCNNs
AttentionTransformers
ReductionGradient accumulation
BroadcastTensor alignment
Elementwise opsActivation functions

Accelerator hardware is specifically optimized for these patterns.

Tensor Cores

Modern GPUs include tensor cores.

Tensor cores accelerate small matrix multiply-accumulate operations:

D=AB+C. D = AB + C.

These units use mixed precision arithmetic.

Example:

Input precisionAccumulator precision
float16float32
bfloat16float32
TF32float32

Automatic differentiation systems must account for these precision differences during forward and backward execution.

TPU Architecture

TPUs are specialized tensor processors designed primarily for machine learning workloads.

They emphasize:

  • matrix throughput,
  • systolic array execution,
  • large tensor operations,
  • high-bandwidth interconnects.

A TPU core behaves differently from a GPU.

Instead of fine-grained thread scheduling, TPUs emphasize large fused tensor programs.

Systolic Arrays

TPUs use systolic arrays for matrix multiplication.

Data flows rhythmically through a grid of compute units.

This structure is extremely efficient for:

C=AB. C = AB.

Matrix multiplications dominate transformer and neural network workloads, making systolic execution highly effective.

Accelerator-Friendly AD

Not all AD programs map efficiently onto accelerators.

Efficient accelerator execution favors:

Preferred patternReason
Large dense tensorsHigh utilization
Regular memory accessBetter bandwidth
Fused kernelsLess memory traffic
Static shapesBetter compilation
Large batch operationsMore parallel work

Poor patterns include:

Problematic patternConsequence
Tiny kernelsLaunch overhead
Dynamic shapesCompiler difficulty
Sparse irregular graphsPoor locality
Branch-heavy codeWarp divergence
Frequent synchronizationReduced throughput

Memory Hierarchy

Accelerator performance depends heavily on memory movement.

Typical hierarchy:

MemorySpeedCapacity
RegistersFastestSmall
Shared/local memoryVery fastSmall
L2 cacheFastModerate
Global device memorySlowerLarge
Host memoryMuch slowerVery large

Automatic differentiation systems must minimize expensive memory traffic.

Bandwidth vs Compute

Modern accelerators are often memory-bandwidth bound rather than compute bound.

A simple operation:

y=x+b y = x + b

may require more memory movement than arithmetic work.

Backward propagation intensifies this because it repeatedly loads:

  • activations,
  • gradients,
  • parameters,
  • optimizer state.

Thus efficient AD systems prioritize:

  • fusion,
  • locality,
  • recomputation,
  • tensor reuse.

Kernel Launch Overhead

Each GPU kernel launch has overhead.

Very small operations may spend more time launching kernels than computing.

Example sequence:

y=ReLU(Wx+b). y = \operatorname{ReLU}(Wx+b).

Naively:

  1. launch matrix multiply,
  2. launch bias addition,
  3. launch ReLU.

A fused kernel combines them into one execution unit.

This reduces:

  • launch overhead,
  • memory traffic,
  • intermediate allocations.

Kernel Fusion

Fusion is central to accelerator optimization.

Suppose:

z=sin(x+y). z = \sin(x+y).

Without fusion:

  1. compute x+yx+y,
  2. store intermediate,
  3. reload intermediate,
  4. compute sine.

Fused execution computes directly without materializing intermediates.

Benefits include:

BenefitEffect
Lower memory trafficHigher throughput
Fewer allocationsLower overhead
Better cache localityImproved utilization

Backward passes benefit similarly.

Asynchronous Execution

Accelerators typically execute asynchronously relative to the host CPU.

The CPU queues kernels:

  1. launch kernel,
  2. continue execution,
  3. synchronize later.

This overlaps:

  • compute,
  • communication,
  • memory transfer.

AD runtimes therefore schedule entire execution streams asynchronously.

Streams and Queues

GPU systems use streams for concurrent execution.

Independent operations may execute simultaneously if resources allow.

Example:

StreamOperation
Stream 1Matrix multiply
Stream 2Data transfer
Stream 3Reduction

The runtime must preserve dependency correctness while maximizing overlap.

Backward Pass Scheduling

Reverse mode naturally creates dependency chains.

The runtime schedules backward kernels according to graph topology.

Independent gradient computations may execute concurrently.

Merged dependencies require synchronization.

Efficient scheduling minimizes idle accelerator time.

Memory Explosion on Accelerators

Accelerators have limited memory compared to CPUs.

Typical GPU memory:

DeviceApproximate memory
Consumer GPU8–24 GB
Data center GPU40–192 GB
TPU deviceSimilar order

Large models easily exceed these limits.

Activation storage during reverse mode is often the dominant memory consumer.

Activation Recomputation

To reduce memory usage, systems recompute activations instead of storing them.

This is especially valuable on accelerators because:

  • arithmetic throughput is high,
  • memory bandwidth is precious.

Recomputation trades extra FLOPs for reduced memory traffic.

Modern accelerators often favor this trade.

Mixed Precision Execution

Accelerators are optimized for low precision.

Common training precisions include:

FormatUsage
float16High throughput
bfloat16Wider exponent range
TF32Tensor-core optimized
float32Accumulation and stability

Mixed precision improves throughput dramatically.

But it also introduces:

  • overflow risk,
  • underflow risk,
  • gradient quantization,
  • reproducibility issues.

Loss Scaling

Low precision gradients may underflow.

Loss scaling multiplies the loss:

L=αL. L' = \alpha L.

Backward propagation computes:

L=αL. \nabla L' = \alpha \nabla L.

Gradients remain representable during propagation.

Later divide by:

α. \alpha.

Dynamic scaling adjusts automatically to avoid overflow.

TPU Compilation

TPUs often prefer static compiled graphs.

The compiler performs:

  • graph fusion,
  • memory planning,
  • tensor layout optimization,
  • operation scheduling.

Dynamic graphs are harder to optimize.

This motivates staged or traced execution systems.

XLA and Graph Lowering

Systems like XLA lower high-level tensor programs into optimized accelerator kernels.

The compiler may:

OptimizationEffect
Fuse operatorsReduce memory traffic
Reorder computationsImprove locality
Tile matricesMatch hardware
Vectorize operationsIncrease throughput
Plan memory reuseReduce allocations

AD systems increasingly integrate tightly with compiler infrastructure.

Sparse Operations

Sparse operations remain challenging on accelerators.

Dense tensor hardware assumes:

  • regular memory access,
  • predictable compute patterns,
  • high arithmetic intensity.

Sparse workloads violate these assumptions.

Examples:

  • graph neural networks,
  • sparse attention,
  • routing systems,
  • mixture-of-experts.

Sparse AD often suffers from:

  • load imbalance,
  • poor cache utilization,
  • synchronization overhead.

Communication on Accelerators

Large models require multiple accelerators.

Communication becomes critical.

Operations include:

CollectivePurpose
All-reduceGradient aggregation
BroadcastParameter distribution
All-gatherTensor assembly
Reduce-scatterSharded optimization

Communication latency can dominate training time at scale.

Overlapping Communication and Compute

Modern systems overlap:

  • gradient communication,
  • backward computation,
  • optimizer updates.

Example:

While layer kk gradients communicate, layer k1k-1 continues backward computation.

This improves utilization.

Accelerator Numerical Semantics

Accelerators sometimes alter numerical behavior.

Examples:

MechanismEffect
Tensor coresReduced precision multiply
Fused kernelsDifferent rounding
Parallel reductionsNon-associative sums
Fast math intrinsicsApproximate transcendental functions
Flush-to-zeroAggressive underflow handling

Two mathematically identical programs may therefore produce different gradients across devices.

Dynamic Shapes

Dynamic tensor shapes complicate accelerator optimization.

Static shapes enable:

  • preallocation,
  • fusion,
  • scheduling,
  • layout optimization.

Dynamic workloads require:

  • recompilation,
  • padding,
  • shape polymorphism,
  • runtime dispatch.

This creates tension between flexibility and performance.

Accelerator Utilization

A major systems goal is maximizing utilization.

Low utilization wastes expensive hardware.

Common causes include:

CauseEffect
Small batch sizesIdle compute units
Host-device synchronizationStalls
Memory bottlenecksCompute starvation
Load imbalanceIdle devices
Tiny kernelsLaunch overhead

AD runtimes therefore aggressively optimize execution graphs.

Automatic Differentiation as Accelerator Programming

Modern AD systems increasingly resemble accelerator compilers.

They must manage:

  • graph transformation,
  • scheduling,
  • fusion,
  • communication,
  • memory planning,
  • precision policy,
  • distributed execution.

Differentiation is no longer just symbolic calculus or local chain rules. It is deeply tied to hardware architecture.

Core Idea

GPU and TPU execution fundamentally shape modern automatic differentiation systems. Reverse-mode differentiation must adapt to massively parallel hardware, hierarchical memory systems, mixed precision arithmetic, and distributed communication fabrics. Efficient differentiable computation therefore depends as much on compiler and accelerator architecture as on calculus itself.