Modern automatic differentiation systems are built around accelerator hardware. GPUs and TPUs provide enormous throughput for tensor operations, making large-scale...
Modern automatic differentiation systems are built around accelerator hardware. GPUs and TPUs provide enormous throughput for tensor operations, making large-scale differentiable computation practical. Reverse-mode AD, deep learning, differentiable simulation, and large tensor algebra all depend heavily on these architectures.
Accelerators are not merely faster CPUs. They expose fundamentally different execution models:
- massive parallelism,
- hierarchical memory systems,
- SIMD or SIMT execution,
- high-throughput tensor units,
- asynchronous scheduling,
- and distributed communication fabrics.
Automatic differentiation systems must therefore adapt their computational graphs, memory layouts, and execution schedules to accelerator constraints.
GPU Architecture
A GPU contains thousands of lightweight execution lanes optimized for throughput rather than low-latency sequential execution.
Core ideas include:
| Feature | Purpose |
|---|---|
| Massive parallelism | Execute many threads simultaneously |
| SIMT execution | Single instruction, multiple threads |
| High memory bandwidth | Feed tensor operations |
| Warp scheduling | Hide latency |
| Specialized tensor units | Accelerate matrix operations |
GPUs are especially efficient for dense linear algebra:
convolution,
attention,
and tensor reductions.
These operations dominate many differentiable workloads.
SIMT Execution
GPUs typically use SIMT execution:
Single Instruction, Multiple Threads.
Threads are grouped into warps.
A warp executes one instruction stream across many threads simultaneously.
Example:
| Warp lane | Data |
|---|---|
| Thread 0 | |
| Thread 1 | |
| Thread 2 | |
| … | … |
This works well when all threads follow the same control flow.
Warp Divergence
Branching harms GPU efficiency.
Example:
Some threads may take the true branch while others take the false branch.
The GPU serializes divergent paths.
This reduces utilization.
Differentiable programs with irregular control flow therefore perform poorly on SIMD/SIMT hardware.
Tensor Operations
Automatic differentiation heavily relies on tensor primitives.
Core operations include:
| Operation | Importance |
|---|---|
| Matrix multiplication | Neural layers |
| Convolution | CNNs |
| Attention | Transformers |
| Reduction | Gradient accumulation |
| Broadcast | Tensor alignment |
| Elementwise ops | Activation functions |
Accelerator hardware is specifically optimized for these patterns.
Tensor Cores
Modern GPUs include tensor cores.
Tensor cores accelerate small matrix multiply-accumulate operations:
These units use mixed precision arithmetic.
Example:
| Input precision | Accumulator precision |
|---|---|
| float16 | float32 |
| bfloat16 | float32 |
| TF32 | float32 |
Automatic differentiation systems must account for these precision differences during forward and backward execution.
TPU Architecture
TPUs are specialized tensor processors designed primarily for machine learning workloads.
They emphasize:
- matrix throughput,
- systolic array execution,
- large tensor operations,
- high-bandwidth interconnects.
A TPU core behaves differently from a GPU.
Instead of fine-grained thread scheduling, TPUs emphasize large fused tensor programs.
Systolic Arrays
TPUs use systolic arrays for matrix multiplication.
Data flows rhythmically through a grid of compute units.
This structure is extremely efficient for:
Matrix multiplications dominate transformer and neural network workloads, making systolic execution highly effective.
Accelerator-Friendly AD
Not all AD programs map efficiently onto accelerators.
Efficient accelerator execution favors:
| Preferred pattern | Reason |
|---|---|
| Large dense tensors | High utilization |
| Regular memory access | Better bandwidth |
| Fused kernels | Less memory traffic |
| Static shapes | Better compilation |
| Large batch operations | More parallel work |
Poor patterns include:
| Problematic pattern | Consequence |
|---|---|
| Tiny kernels | Launch overhead |
| Dynamic shapes | Compiler difficulty |
| Sparse irregular graphs | Poor locality |
| Branch-heavy code | Warp divergence |
| Frequent synchronization | Reduced throughput |
Memory Hierarchy
Accelerator performance depends heavily on memory movement.
Typical hierarchy:
| Memory | Speed | Capacity |
|---|---|---|
| Registers | Fastest | Small |
| Shared/local memory | Very fast | Small |
| L2 cache | Fast | Moderate |
| Global device memory | Slower | Large |
| Host memory | Much slower | Very large |
Automatic differentiation systems must minimize expensive memory traffic.
Bandwidth vs Compute
Modern accelerators are often memory-bandwidth bound rather than compute bound.
A simple operation:
may require more memory movement than arithmetic work.
Backward propagation intensifies this because it repeatedly loads:
- activations,
- gradients,
- parameters,
- optimizer state.
Thus efficient AD systems prioritize:
- fusion,
- locality,
- recomputation,
- tensor reuse.
Kernel Launch Overhead
Each GPU kernel launch has overhead.
Very small operations may spend more time launching kernels than computing.
Example sequence:
Naively:
- launch matrix multiply,
- launch bias addition,
- launch ReLU.
A fused kernel combines them into one execution unit.
This reduces:
- launch overhead,
- memory traffic,
- intermediate allocations.
Kernel Fusion
Fusion is central to accelerator optimization.
Suppose:
Without fusion:
- compute ,
- store intermediate,
- reload intermediate,
- compute sine.
Fused execution computes directly without materializing intermediates.
Benefits include:
| Benefit | Effect |
|---|---|
| Lower memory traffic | Higher throughput |
| Fewer allocations | Lower overhead |
| Better cache locality | Improved utilization |
Backward passes benefit similarly.
Asynchronous Execution
Accelerators typically execute asynchronously relative to the host CPU.
The CPU queues kernels:
- launch kernel,
- continue execution,
- synchronize later.
This overlaps:
- compute,
- communication,
- memory transfer.
AD runtimes therefore schedule entire execution streams asynchronously.
Streams and Queues
GPU systems use streams for concurrent execution.
Independent operations may execute simultaneously if resources allow.
Example:
| Stream | Operation |
|---|---|
| Stream 1 | Matrix multiply |
| Stream 2 | Data transfer |
| Stream 3 | Reduction |
The runtime must preserve dependency correctness while maximizing overlap.
Backward Pass Scheduling
Reverse mode naturally creates dependency chains.
The runtime schedules backward kernels according to graph topology.
Independent gradient computations may execute concurrently.
Merged dependencies require synchronization.
Efficient scheduling minimizes idle accelerator time.
Memory Explosion on Accelerators
Accelerators have limited memory compared to CPUs.
Typical GPU memory:
| Device | Approximate memory |
|---|---|
| Consumer GPU | 8–24 GB |
| Data center GPU | 40–192 GB |
| TPU device | Similar order |
Large models easily exceed these limits.
Activation storage during reverse mode is often the dominant memory consumer.
Activation Recomputation
To reduce memory usage, systems recompute activations instead of storing them.
This is especially valuable on accelerators because:
- arithmetic throughput is high,
- memory bandwidth is precious.
Recomputation trades extra FLOPs for reduced memory traffic.
Modern accelerators often favor this trade.
Mixed Precision Execution
Accelerators are optimized for low precision.
Common training precisions include:
| Format | Usage |
|---|---|
| float16 | High throughput |
| bfloat16 | Wider exponent range |
| TF32 | Tensor-core optimized |
| float32 | Accumulation and stability |
Mixed precision improves throughput dramatically.
But it also introduces:
- overflow risk,
- underflow risk,
- gradient quantization,
- reproducibility issues.
Loss Scaling
Low precision gradients may underflow.
Loss scaling multiplies the loss:
Backward propagation computes:
Gradients remain representable during propagation.
Later divide by:
Dynamic scaling adjusts automatically to avoid overflow.
TPU Compilation
TPUs often prefer static compiled graphs.
The compiler performs:
- graph fusion,
- memory planning,
- tensor layout optimization,
- operation scheduling.
Dynamic graphs are harder to optimize.
This motivates staged or traced execution systems.
XLA and Graph Lowering
Systems like XLA lower high-level tensor programs into optimized accelerator kernels.
The compiler may:
| Optimization | Effect |
|---|---|
| Fuse operators | Reduce memory traffic |
| Reorder computations | Improve locality |
| Tile matrices | Match hardware |
| Vectorize operations | Increase throughput |
| Plan memory reuse | Reduce allocations |
AD systems increasingly integrate tightly with compiler infrastructure.
Sparse Operations
Sparse operations remain challenging on accelerators.
Dense tensor hardware assumes:
- regular memory access,
- predictable compute patterns,
- high arithmetic intensity.
Sparse workloads violate these assumptions.
Examples:
- graph neural networks,
- sparse attention,
- routing systems,
- mixture-of-experts.
Sparse AD often suffers from:
- load imbalance,
- poor cache utilization,
- synchronization overhead.
Communication on Accelerators
Large models require multiple accelerators.
Communication becomes critical.
Operations include:
| Collective | Purpose |
|---|---|
| All-reduce | Gradient aggregation |
| Broadcast | Parameter distribution |
| All-gather | Tensor assembly |
| Reduce-scatter | Sharded optimization |
Communication latency can dominate training time at scale.
Overlapping Communication and Compute
Modern systems overlap:
- gradient communication,
- backward computation,
- optimizer updates.
Example:
While layer gradients communicate, layer continues backward computation.
This improves utilization.
Accelerator Numerical Semantics
Accelerators sometimes alter numerical behavior.
Examples:
| Mechanism | Effect |
|---|---|
| Tensor cores | Reduced precision multiply |
| Fused kernels | Different rounding |
| Parallel reductions | Non-associative sums |
| Fast math intrinsics | Approximate transcendental functions |
| Flush-to-zero | Aggressive underflow handling |
Two mathematically identical programs may therefore produce different gradients across devices.
Dynamic Shapes
Dynamic tensor shapes complicate accelerator optimization.
Static shapes enable:
- preallocation,
- fusion,
- scheduling,
- layout optimization.
Dynamic workloads require:
- recompilation,
- padding,
- shape polymorphism,
- runtime dispatch.
This creates tension between flexibility and performance.
Accelerator Utilization
A major systems goal is maximizing utilization.
Low utilization wastes expensive hardware.
Common causes include:
| Cause | Effect |
|---|---|
| Small batch sizes | Idle compute units |
| Host-device synchronization | Stalls |
| Memory bottlenecks | Compute starvation |
| Load imbalance | Idle devices |
| Tiny kernels | Launch overhead |
AD runtimes therefore aggressively optimize execution graphs.
Automatic Differentiation as Accelerator Programming
Modern AD systems increasingly resemble accelerator compilers.
They must manage:
- graph transformation,
- scheduling,
- fusion,
- communication,
- memory planning,
- precision policy,
- distributed execution.
Differentiation is no longer just symbolic calculus or local chain rules. It is deeply tied to hardware architecture.
Core Idea
GPU and TPU execution fundamentally shape modern automatic differentiation systems. Reverse-mode differentiation must adapt to massively parallel hardware, hierarchical memory systems, mixed precision arithmetic, and distributed communication fabrics. Efficient differentiable computation therefore depends as much on compiler and accelerator architecture as on calculus itself.