GPU Tensor Kernels

Modern automatic differentiation systems are fundamentally tensor compiler systems. Their performance depends less on mathematical differentiation rules than on how efficiently tensor kernels execute on accelerators.

GPU tensor kernels are the computational substrate of modern deep learning, scientific computing, and differentiable simulation. Reverse-mode AD amplifies the importance of kernel efficiency because every forward operation usually induces at least one backward operation.

The challenge is therefore not only:

\text{compute derivatives},

but also:

\text{compute derivatives with high hardware efficiency}.

GPU Execution Model

GPUs are throughput-oriented processors.

They achieve performance through massive parallelism:

Hardware Unit	Role
Thread	Smallest execution unit
Warp / wavefront	SIMD execution group
Block / workgroup	Cooperative thread group
Streaming multiprocessor	Parallel execution engine
Global memory	Large but slow
Shared memory	Fast local scratchpad
Registers	Fastest storage

Tensor operations map naturally onto GPUs because many array elements can be processed independently.

Example:

Y_{ij} = X_{ij} + Z_{ij}

can assign one thread per element.

Matrix multiplication assigns thread blocks to output tiles.

Tensor Kernel

A tensor kernel is a low-level operation executed on accelerator hardware.

Examples:

Kernel Type	Example
Elementwise	ReLU, exp, add
Reduction	sum, max
Contraction	GEMM
Gather/scatter	indexing
Convolution	CNN kernels
Softmax	normalized exponential
Attention	batched contraction

An AD engine lowers tensor programs into sequences of kernels.

Forward and Backward Kernels

Every differentiable primitive induces backward kernels.

Example:

Forward:

Y = XW.

Backward:

\bar{X} = \bar{Y}W^T,

\bar{W} = X^T\bar{Y}.

Thus one matrix multiplication creates two additional matrix multiplications during reverse mode.

Backward computation is therefore often more expensive than forward computation.

Approximate cost:

Pass	Relative FLOPs
Forward	1x
Backward	2x to 3x

Memory traffic may dominate arithmetic cost even more strongly in backward passes.

Arithmetic Intensity

GPU efficiency depends heavily on arithmetic intensity:

\text{Arithmetic Intensity} = \frac{\text{FLOPs}} {\text{Memory Traffic}}.

High arithmetic intensity means many computations per byte loaded.

Examples:

Operation	Intensity
Matrix multiplication	High
Elementwise add	Low
Reduction	Moderate
Scatter-add	Very low

Matrix multiplication maps extremely well to GPUs because data reuse is high.

Elementwise kernels are often memory-bandwidth bound.

Matrix Multiplication Kernels

General matrix multiplication (GEMM) is the dominant dense tensor primitive.

C = AB.

Efficient GPU GEMM uses:

tiling
shared memory reuse
register blocking
warp-level matrix instructions
tensor cores

A tile-based kernel computes submatrices:

C_{tile} = A_{tile}B_{tile}.

Threads cooperatively load blocks into shared memory and reuse them repeatedly.

Backward rules reuse GEMM:

Gradient	Expression
Input gradient	$\bar{X} = \bar{Y}W^T$
Weight gradient	$\bar{W} = X^T\bar{Y}$

Thus highly optimized GEMM kernels are critical for AD performance.

Tensor Cores

Modern GPUs include specialized matrix-multiply hardware:

Vendor	Unit
NVIDIA	Tensor cores
AMD	Matrix cores
TPU	Systolic arrays

These units accelerate:

D = AB + C.

They often use reduced precision:

Precision	Common Use
FP16	Training
BF16	Training
TF32	Mixed precision
FP8	Emerging large-scale training

AD systems must preserve numerical stability under reduced precision arithmetic.

Kernel Fusion

Kernel launch overhead is expensive.

Instead of:

tmp1 = x + y
tmp2 = relu(tmp1)
tmp3 = tmp2 * z

a compiler may fuse into:

out = relu(x + y) * z

This avoids:

Cost	Reason
Intermediate memory writes	Temporary tensors
Extra global memory reads	Reloading intermediates
Kernel launch overhead	Multiple launches

Fusion is especially important for backward passes, which otherwise produce many tiny kernels.

Backward Fusion

Reverse-mode AD naturally creates chains of small operations.

Example:

y = \sin(x^2 + z).

Backward:

\bar{x} = \bar{y} \cos(x^2+z) 2x.

Naively, this creates several intermediate tensors.

Fusion combines them into one kernel.

Without fusion:

tmp1 = x*x
tmp2 = tmp1 + z
tmp3 = cos(tmp2)
tmp4 = tmp3 * 2*x
dx   = dy * tmp4

With fusion:

dx = dy * cos(x*x + z) * 2*x

Fusion reduces memory bandwidth pressure dramatically.

Reduction Kernels

Reductions are harder than elementwise kernels because they require communication.

Example:

y = \sum_i x_i.

Parallel reduction proceeds hierarchically:

Local thread accumulation.
Warp reduction.
Block reduction.
Global reduction.

Backward rule:

\bar{x}_i = \bar{y}.

Forward reduction is communication-heavy. Backward broadcast is embarrassingly parallel.

Reductions are often synchronization-bound.

Softmax Kernels

Softmax combines reduction and elementwise operations:

y_i = \frac{e^{x_i}} {\sum_j e^{x_j}}.

Efficient kernels fuse:

max reduction
stabilized subtraction
exponentiation
sum reduction
division

Backward softmax also fuses reductions:

\bar{x} = y \odot ( \bar{y} - \langle \bar{y}, y\rangle ).

Attention kernels heavily depend on efficient softmax implementations.

Convolution Kernels

Convolutions are structured tensor contractions.

Forward:

Y = X * K.

Backward produces:

Gradient	Operation
Input gradient	Convolution with flipped kernel
Kernel gradient	Correlation
Bias gradient	Reduction

GPU implementations use:

Method	Use
Direct convolution	Small kernels
im2col + GEMM	General dense convolution
FFT convolution	Large kernels
Winograd	Small dense kernels

Backward convolutions often have different optimal algorithms from forward convolutions.

Attention Kernels

Transformer attention dominates modern large models.

Basic attention:

S = \frac{QK^T}{\sqrt{d}},

P = \operatorname{softmax}(S),

Y = PV.

Backward pass involves multiple matrix multiplications and softmax gradients.

Attention is memory-intensive because:

S \in \mathbb{R}^{n \times n}.

Long sequences create quadratic memory growth.

Flash Attention

Flash Attention avoids materializing the full attention matrix.

Key idea:

stream blocks through shared memory
recompute local statistics
avoid large intermediate tensors

This reduces:

Resource	Reduction
Global memory traffic	Large
HBM usage	Large
Intermediate allocation	Eliminated

Backward pass recomputes local softmax statistics instead of storing them.

This is an example of recomputation trading FLOPs for memory.

Checkpointing and Recomputation

Backward passes require forward intermediates.

Storing all activations is expensive.

Checkpointing stores only selected tensors:

store sparse checkpoints
recompute missing intermediates during backward

Tradeoff:

Strategy	Memory	Compute
Full storage	High	Low
Full recomputation	Low	High
Checkpointing	Medium	Medium

Large-scale AD systems rely heavily on recomputation.

Memory Bandwidth

Many tensor kernels are memory-bound rather than compute-bound.

Example:

y_i = x_i + z_i.

This performs:

1 addition
multiple memory loads/stores

Arithmetic cost is tiny compared to memory movement.

GPU optimization therefore focuses on:

coalesced access
cache reuse
shared memory reuse
kernel fusion
layout optimization

Bandwidth dominates many backward passes.

Tensor Layout

Tensor layout affects kernel efficiency.

Common layouts:

Layout	Description
Row-major	C layout
Column-major	Fortran layout
NHWC	Channels last
NCHW	Channels first

Different kernels prefer different layouts.

Layout transforms themselves cost memory bandwidth.

AD systems often propagate layout constraints through graphs to avoid unnecessary transposes.

Strides and Views

Many tensor operations create views rather than copies.

Example:

y = x.transpose()

No data movement may occur.

The kernel instead uses modified strides.

Backward kernels must preserve aliasing semantics.

View-based AD is significantly harder than purely functional tensor models.

Scatter and Gather

Sparse and indexing operations use:

Operation	Description
Gather	Read scattered locations
Scatter	Write scattered locations
Scatter-add	Accumulate scattered writes

Backward rules swap gather and scatter-add.

Example:

Forward:

y_i = x_{idx_i}.

Backward:

\bar{x}_{idx_i} \mathrel{+}= \bar{y}_i.

Scatter-add is difficult on GPUs because many threads may update the same location.

This requires:

atomic operations
segmented reductions
parallel conflict resolution

Parallel Reduction Nondeterminism

Floating-point addition is not associative.

Parallel reductions may produce:

(a+b)+c \ne a+(b+c).

Different thread scheduling gives slightly different results.

Thus GPU gradients are often nondeterministic at low precision.

Tradeoff:

Goal	Cost
Deterministic reductions	Lower throughput
Maximum throughput	Nondeterministic accumulation

Large-scale training usually prioritizes throughput.

Mixed Precision Training

Modern AD systems use mixed precision extensively.

Forward:

Tensor Type	Precision
Activations	FP16/BF16
Weights	FP16/BF16
Accumulators	FP32

Backward:

Quantity	Precision
Gradients	Often FP16/BF16
Master weights	FP32

Loss scaling prevents gradient underflow:

L' = sL.

Backward computes scaled gradients, then rescales:

\nabla L = \frac{1}{s}\nabla L'.

Mixed precision is essential for modern large-scale training.

Compiler Systems

Modern AD systems increasingly compile tensor graphs.

Examples:

System	Compiler
JAX	XLA
PyTorch	TorchInductor
TensorFlow	XLA
TVM	Tensor compiler
Triton	GPU DSL

These systems perform:

fusion
layout optimization
kernel scheduling
memory planning
vectorization
autotuning

AD is now tightly integrated with compiler optimization.

Autotuning

Kernel performance depends on:

tile sizes
warp shapes
block dimensions
shared memory usage
register pressure

Compilers often benchmark multiple variants dynamically.

Autotuning is especially important for:

Primitive	Sensitivity
GEMM	Very high
Convolution	Very high
Attention	High
Reductions	Moderate

GPU-Aware AD Design

A GPU-efficient AD system should minimize:

Problem	Cause
Tiny kernels	Poor fusion
Excessive intermediates	Materialization
Global memory traffic	Unfused graphs
Layout conversions	Mismatched kernels
Synchronization	Reductions/scatter
Recomputation imbalance	Poor checkpointing

Good AD design is therefore also hardware design.

Practical Principles

Prefer:

Good Practice	Avoid
Fused kernels	Long chains of tiny ops
Tensor contractions	Scalar loops
Structured reductions	Random scatter
Recomputation checkpointing	Storing every activation
Mixed precision	Full FP64 everywhere
Layout-aware graphs	Constant transposes

For compiler designers:

Focus	Importance
Memory bandwidth	Critical
Fusion	Critical
Tensor contraction scheduling	Critical
Reduction efficiency	Major
Autotuning	Major
Sparse kernel optimization	Increasingly important

Summary

GPU tensor kernels are the execution layer beneath modern automatic differentiation systems. The mathematical derivative rules are relatively small compared to the systems complexity required to execute them efficiently.

Reverse-mode AD amplifies memory traffic, synchronization cost, and intermediate tensor allocation. Efficient implementations therefore rely on fusion, tiling, mixed precision, checkpointing, layout optimization, and compiler-guided kernel generation.

Modern AD systems are best understood not merely as differentiation engines, but as tensor compiler runtimes specialized for accelerator hardware.