GPU and TPU Execution

Modern automatic differentiation systems are built around accelerator hardware. GPUs and TPUs provide enormous throughput for tensor operations, making large-scale differentiable computation practical. Reverse-mode AD, deep learning, differentiable simulation, and large tensor algebra all depend heavily on these architectures.

Accelerators are not merely faster CPUs. They expose fundamentally different execution models:

massive parallelism,
hierarchical memory systems,
SIMD or SIMT execution,
high-throughput tensor units,
asynchronous scheduling,
and distributed communication fabrics.

Automatic differentiation systems must therefore adapt their computational graphs, memory layouts, and execution schedules to accelerator constraints.

GPU Architecture

A GPU contains thousands of lightweight execution lanes optimized for throughput rather than low-latency sequential execution.

Core ideas include:

Feature	Purpose
Massive parallelism	Execute many threads simultaneously
SIMT execution	Single instruction, multiple threads
High memory bandwidth	Feed tensor operations
Warp scheduling	Hide latency
Specialized tensor units	Accelerate matrix operations

GPUs are especially efficient for dense linear algebra:

C = AB,

convolution,

attention,

and tensor reductions.

These operations dominate many differentiable workloads.

SIMT Execution

GPUs typically use SIMT execution:

Single Instruction, Multiple Threads.

Threads are grouped into warps.

A warp executes one instruction stream across many threads simultaneously.

Example:

Warp lane	Data
Thread 0	$x_0$
Thread 1	$x_1$
Thread 2	$x_2$
…	…

This works well when all threads follow the same control flow.

Warp Divergence

Branching harms GPU efficiency.

Example:

\text{if } x_i > 0.

Some threads may take the true branch while others take the false branch.

The GPU serializes divergent paths.

This reduces utilization.

Differentiable programs with irregular control flow therefore perform poorly on SIMD/SIMT hardware.

Tensor Operations

Automatic differentiation heavily relies on tensor primitives.

Core operations include:

Operation	Importance
Matrix multiplication	Neural layers
Convolution	CNNs
Attention	Transformers
Reduction	Gradient accumulation
Broadcast	Tensor alignment
Elementwise ops	Activation functions

Accelerator hardware is specifically optimized for these patterns.

Tensor Cores

Modern GPUs include tensor cores.

Tensor cores accelerate small matrix multiply-accumulate operations:

D = AB + C.

These units use mixed precision arithmetic.

Example:

Input precision	Accumulator precision
float16	float32
bfloat16	float32
TF32	float32

Automatic differentiation systems must account for these precision differences during forward and backward execution.

TPU Architecture

TPUs are specialized tensor processors designed primarily for machine learning workloads.

They emphasize:

matrix throughput,
systolic array execution,
large tensor operations,
high-bandwidth interconnects.

A TPU core behaves differently from a GPU.

Instead of fine-grained thread scheduling, TPUs emphasize large fused tensor programs.

Systolic Arrays

TPUs use systolic arrays for matrix multiplication.

Data flows rhythmically through a grid of compute units.

This structure is extremely efficient for:

C = AB.

Matrix multiplications dominate transformer and neural network workloads, making systolic execution highly effective.

Accelerator-Friendly AD

Not all AD programs map efficiently onto accelerators.

Efficient accelerator execution favors:

Preferred pattern	Reason
Large dense tensors	High utilization
Regular memory access	Better bandwidth
Fused kernels	Less memory traffic
Static shapes	Better compilation
Large batch operations	More parallel work

Poor patterns include:

Problematic pattern	Consequence
Tiny kernels	Launch overhead
Dynamic shapes	Compiler difficulty
Sparse irregular graphs	Poor locality
Branch-heavy code	Warp divergence
Frequent synchronization	Reduced throughput

Memory Hierarchy

Accelerator performance depends heavily on memory movement.

Typical hierarchy:

Memory	Speed	Capacity
Registers	Fastest	Small
Shared/local memory	Very fast	Small
L2 cache	Fast	Moderate
Global device memory	Slower	Large
Host memory	Much slower	Very large

Automatic differentiation systems must minimize expensive memory traffic.

Bandwidth vs Compute

Modern accelerators are often memory-bandwidth bound rather than compute bound.

A simple operation:

y = x + b

may require more memory movement than arithmetic work.

Backward propagation intensifies this because it repeatedly loads:

activations,
gradients,
parameters,
optimizer state.

Thus efficient AD systems prioritize:

fusion,
locality,
recomputation,
tensor reuse.

Kernel Launch Overhead

Each GPU kernel launch has overhead.

Very small operations may spend more time launching kernels than computing.

Example sequence:

y = \operatorname{ReLU}(Wx+b).

Naively:

launch matrix multiply,
launch bias addition,
launch ReLU.

A fused kernel combines them into one execution unit.

This reduces:

launch overhead,
memory traffic,
intermediate allocations.

Kernel Fusion

Fusion is central to accelerator optimization.

Suppose:

z = \sin(x+y).

Without fusion:

compute $x+y$ ,
store intermediate,
reload intermediate,
compute sine.

Fused execution computes directly without materializing intermediates.

Benefits include:

Benefit	Effect
Lower memory traffic	Higher throughput
Fewer allocations	Lower overhead
Better cache locality	Improved utilization

Backward passes benefit similarly.

Asynchronous Execution

Accelerators typically execute asynchronously relative to the host CPU.

The CPU queues kernels:

launch kernel,
continue execution,
synchronize later.

This overlaps:

compute,
communication,
memory transfer.

AD runtimes therefore schedule entire execution streams asynchronously.

Streams and Queues

GPU systems use streams for concurrent execution.

Independent operations may execute simultaneously if resources allow.

Example:

Stream	Operation
Stream 1	Matrix multiply
Stream 2	Data transfer
Stream 3	Reduction

The runtime must preserve dependency correctness while maximizing overlap.

Backward Pass Scheduling

Reverse mode naturally creates dependency chains.

The runtime schedules backward kernels according to graph topology.

Independent gradient computations may execute concurrently.

Merged dependencies require synchronization.

Efficient scheduling minimizes idle accelerator time.

Memory Explosion on Accelerators

Accelerators have limited memory compared to CPUs.

Typical GPU memory:

Device	Approximate memory
Consumer GPU	8–24 GB
Data center GPU	40–192 GB
TPU device	Similar order

Large models easily exceed these limits.

Activation storage during reverse mode is often the dominant memory consumer.

Activation Recomputation

To reduce memory usage, systems recompute activations instead of storing them.

This is especially valuable on accelerators because:

arithmetic throughput is high,
memory bandwidth is precious.

Recomputation trades extra FLOPs for reduced memory traffic.

Modern accelerators often favor this trade.

Mixed Precision Execution

Accelerators are optimized for low precision.

Common training precisions include:

Format	Usage
float16	High throughput
bfloat16	Wider exponent range
TF32	Tensor-core optimized
float32	Accumulation and stability

Mixed precision improves throughput dramatically.

But it also introduces:

overflow risk,
underflow risk,
gradient quantization,
reproducibility issues.

Loss Scaling

Low precision gradients may underflow.

Loss scaling multiplies the loss:

L' = \alpha L.

Backward propagation computes:

\nabla L' = \alpha \nabla L.

Gradients remain representable during propagation.

Later divide by:

\alpha.

Dynamic scaling adjusts automatically to avoid overflow.

TPU Compilation

TPUs often prefer static compiled graphs.

The compiler performs:

graph fusion,
memory planning,
tensor layout optimization,
operation scheduling.

Dynamic graphs are harder to optimize.

This motivates staged or traced execution systems.

XLA and Graph Lowering

Systems like XLA lower high-level tensor programs into optimized accelerator kernels.

The compiler may:

Optimization	Effect
Fuse operators	Reduce memory traffic
Reorder computations	Improve locality
Tile matrices	Match hardware
Vectorize operations	Increase throughput
Plan memory reuse	Reduce allocations

AD systems increasingly integrate tightly with compiler infrastructure.

Sparse Operations

Sparse operations remain challenging on accelerators.

Dense tensor hardware assumes:

regular memory access,
predictable compute patterns,
high arithmetic intensity.

Sparse workloads violate these assumptions.

Examples:

graph neural networks,
sparse attention,
routing systems,
mixture-of-experts.

Sparse AD often suffers from:

load imbalance,
poor cache utilization,
synchronization overhead.

Communication on Accelerators

Large models require multiple accelerators.

Communication becomes critical.

Operations include:

Collective	Purpose
All-reduce	Gradient aggregation
Broadcast	Parameter distribution
All-gather	Tensor assembly
Reduce-scatter	Sharded optimization

Communication latency can dominate training time at scale.

Overlapping Communication and Compute

Modern systems overlap:

gradient communication,
backward computation,
optimizer updates.

Example:

While layer $k$ gradients communicate, layer $k-1$ continues backward computation.

This improves utilization.

Accelerator Numerical Semantics

Accelerators sometimes alter numerical behavior.

Examples:

Mechanism	Effect
Tensor cores	Reduced precision multiply
Fused kernels	Different rounding
Parallel reductions	Non-associative sums
Fast math intrinsics	Approximate transcendental functions
Flush-to-zero	Aggressive underflow handling

Two mathematically identical programs may therefore produce different gradients across devices.

Dynamic Shapes

Dynamic tensor shapes complicate accelerator optimization.

Static shapes enable:

preallocation,
fusion,
scheduling,
layout optimization.

Dynamic workloads require:

recompilation,
padding,
shape polymorphism,
runtime dispatch.

This creates tension between flexibility and performance.

Accelerator Utilization

A major systems goal is maximizing utilization.

Low utilization wastes expensive hardware.

Common causes include:

Cause	Effect
Small batch sizes	Idle compute units
Host-device synchronization	Stalls
Memory bottlenecks	Compute starvation
Load imbalance	Idle devices
Tiny kernels	Launch overhead

AD runtimes therefore aggressively optimize execution graphs.

Automatic Differentiation as Accelerator Programming

Modern AD systems increasingly resemble accelerator compilers.

They must manage:

graph transformation,
scheduling,
fusion,
communication,
memory planning,
precision policy,
distributed execution.

Differentiation is no longer just symbolic calculus or local chain rules. It is deeply tied to hardware architecture.

Core Idea

GPU and TPU execution fundamentally shape modern automatic differentiation systems. Reverse-mode differentiation must adapt to massively parallel hardware, hierarchical memory systems, mixed precision arithmetic, and distributed communication fabrics. Efficient differentiable computation therefore depends as much on compiler and accelerator architecture as on calculus itself.