Skip to content

GPU Tensor Kernels

Modern automatic differentiation systems are fundamentally tensor compiler systems. Their performance depends less on mathematical differentiation rules than on how...

Modern automatic differentiation systems are fundamentally tensor compiler systems. Their performance depends less on mathematical differentiation rules than on how efficiently tensor kernels execute on accelerators.

GPU tensor kernels are the computational substrate of modern deep learning, scientific computing, and differentiable simulation. Reverse-mode AD amplifies the importance of kernel efficiency because every forward operation usually induces at least one backward operation.

The challenge is therefore not only:

compute derivatives, \text{compute derivatives},

but also:

compute derivatives with high hardware efficiency. \text{compute derivatives with high hardware efficiency}.

GPU Execution Model

GPUs are throughput-oriented processors.

They achieve performance through massive parallelism:

Hardware UnitRole
ThreadSmallest execution unit
Warp / wavefrontSIMD execution group
Block / workgroupCooperative thread group
Streaming multiprocessorParallel execution engine
Global memoryLarge but slow
Shared memoryFast local scratchpad
RegistersFastest storage

Tensor operations map naturally onto GPUs because many array elements can be processed independently.

Example:

Yij=Xij+Zij Y_{ij} = X_{ij} + Z_{ij}

can assign one thread per element.

Matrix multiplication assigns thread blocks to output tiles.

Tensor Kernel

A tensor kernel is a low-level operation executed on accelerator hardware.

Examples:

Kernel TypeExample
ElementwiseReLU, exp, add
Reductionsum, max
ContractionGEMM
Gather/scatterindexing
ConvolutionCNN kernels
Softmaxnormalized exponential
Attentionbatched contraction

An AD engine lowers tensor programs into sequences of kernels.

Forward and Backward Kernels

Every differentiable primitive induces backward kernels.

Example:

Forward:

Y=XW. Y = XW.

Backward:

Xˉ=YˉWT, \bar{X} = \bar{Y}W^T, Wˉ=XTYˉ. \bar{W} = X^T\bar{Y}.

Thus one matrix multiplication creates two additional matrix multiplications during reverse mode.

Backward computation is therefore often more expensive than forward computation.

Approximate cost:

PassRelative FLOPs
Forward1x
Backward2x to 3x

Memory traffic may dominate arithmetic cost even more strongly in backward passes.

Arithmetic Intensity

GPU efficiency depends heavily on arithmetic intensity:

Arithmetic Intensity=FLOPsMemory Traffic. \text{Arithmetic Intensity} = \frac{\text{FLOPs}} {\text{Memory Traffic}}.

High arithmetic intensity means many computations per byte loaded.

Examples:

OperationIntensity
Matrix multiplicationHigh
Elementwise addLow
ReductionModerate
Scatter-addVery low

Matrix multiplication maps extremely well to GPUs because data reuse is high.

Elementwise kernels are often memory-bandwidth bound.

Matrix Multiplication Kernels

General matrix multiplication (GEMM) is the dominant dense tensor primitive.

C=AB. C = AB.

Efficient GPU GEMM uses:

tiling
shared memory reuse
register blocking
warp-level matrix instructions
tensor cores

A tile-based kernel computes submatrices:

Ctile=AtileBtile. C_{tile} = A_{tile}B_{tile}.

Threads cooperatively load blocks into shared memory and reuse them repeatedly.

Backward rules reuse GEMM:

GradientExpression
Input gradientXˉ=YˉWT\bar{X} = \bar{Y}W^T
Weight gradientWˉ=XTYˉ\bar{W} = X^T\bar{Y}

Thus highly optimized GEMM kernels are critical for AD performance.

Tensor Cores

Modern GPUs include specialized matrix-multiply hardware:

VendorUnit
NVIDIATensor cores
AMDMatrix cores
TPUSystolic arrays

These units accelerate:

D=AB+C. D = AB + C.

They often use reduced precision:

PrecisionCommon Use
FP16Training
BF16Training
TF32Mixed precision
FP8Emerging large-scale training

AD systems must preserve numerical stability under reduced precision arithmetic.

Kernel Fusion

Kernel launch overhead is expensive.

Instead of:

tmp1 = x + y
tmp2 = relu(tmp1)
tmp3 = tmp2 * z

a compiler may fuse into:

out = relu(x + y) * z

This avoids:

CostReason
Intermediate memory writesTemporary tensors
Extra global memory readsReloading intermediates
Kernel launch overheadMultiple launches

Fusion is especially important for backward passes, which otherwise produce many tiny kernels.

Backward Fusion

Reverse-mode AD naturally creates chains of small operations.

Example:

y=sin(x2+z). y = \sin(x^2 + z).

Backward:

xˉ=yˉcos(x2+z)2x. \bar{x} = \bar{y} \cos(x^2+z) 2x.

Naively, this creates several intermediate tensors.

Fusion combines them into one kernel.

Without fusion:

tmp1 = x*x
tmp2 = tmp1 + z
tmp3 = cos(tmp2)
tmp4 = tmp3 * 2*x
dx   = dy * tmp4

With fusion:

dx = dy * cos(x*x + z) * 2*x

Fusion reduces memory bandwidth pressure dramatically.

Reduction Kernels

Reductions are harder than elementwise kernels because they require communication.

Example:

y=ixi. y = \sum_i x_i.

Parallel reduction proceeds hierarchically:

  1. Local thread accumulation.
  2. Warp reduction.
  3. Block reduction.
  4. Global reduction.

Backward rule:

xˉi=yˉ. \bar{x}_i = \bar{y}.

Forward reduction is communication-heavy. Backward broadcast is embarrassingly parallel.

Reductions are often synchronization-bound.

Softmax Kernels

Softmax combines reduction and elementwise operations:

yi=exijexj. y_i = \frac{e^{x_i}} {\sum_j e^{x_j}}.

Efficient kernels fuse:

max reduction
stabilized subtraction
exponentiation
sum reduction
division

Backward softmax also fuses reductions:

xˉ=y(yˉyˉ,y). \bar{x} = y \odot ( \bar{y} - \langle \bar{y}, y\rangle ).

Attention kernels heavily depend on efficient softmax implementations.

Convolution Kernels

Convolutions are structured tensor contractions.

Forward:

Y=XK. Y = X * K.

Backward produces:

GradientOperation
Input gradientConvolution with flipped kernel
Kernel gradientCorrelation
Bias gradientReduction

GPU implementations use:

MethodUse
Direct convolutionSmall kernels
im2col + GEMMGeneral dense convolution
FFT convolutionLarge kernels
WinogradSmall dense kernels

Backward convolutions often have different optimal algorithms from forward convolutions.

Attention Kernels

Transformer attention dominates modern large models.

Basic attention:

S=QKTd, S = \frac{QK^T}{\sqrt{d}}, P=softmax(S), P = \operatorname{softmax}(S), Y=PV. Y = PV.

Backward pass involves multiple matrix multiplications and softmax gradients.

Attention is memory-intensive because:

SRn×n. S \in \mathbb{R}^{n \times n}.

Long sequences create quadratic memory growth.

Flash Attention

Flash Attention avoids materializing the full attention matrix.

Key idea:

stream blocks through shared memory
recompute local statistics
avoid large intermediate tensors

This reduces:

ResourceReduction
Global memory trafficLarge
HBM usageLarge
Intermediate allocationEliminated

Backward pass recomputes local softmax statistics instead of storing them.

This is an example of recomputation trading FLOPs for memory.

Checkpointing and Recomputation

Backward passes require forward intermediates.

Storing all activations is expensive.

Checkpointing stores only selected tensors:

store sparse checkpoints
recompute missing intermediates during backward

Tradeoff:

StrategyMemoryCompute
Full storageHighLow
Full recomputationLowHigh
CheckpointingMediumMedium

Large-scale AD systems rely heavily on recomputation.

Memory Bandwidth

Many tensor kernels are memory-bound rather than compute-bound.

Example:

yi=xi+zi. y_i = x_i + z_i.

This performs:

  • 1 addition
  • multiple memory loads/stores

Arithmetic cost is tiny compared to memory movement.

GPU optimization therefore focuses on:

coalesced access
cache reuse
shared memory reuse
kernel fusion
layout optimization

Bandwidth dominates many backward passes.

Tensor Layout

Tensor layout affects kernel efficiency.

Common layouts:

LayoutDescription
Row-majorC layout
Column-majorFortran layout
NHWCChannels last
NCHWChannels first

Different kernels prefer different layouts.

Layout transforms themselves cost memory bandwidth.

AD systems often propagate layout constraints through graphs to avoid unnecessary transposes.

Strides and Views

Many tensor operations create views rather than copies.

Example:

y = x.transpose()

No data movement may occur.

The kernel instead uses modified strides.

Backward kernels must preserve aliasing semantics.

View-based AD is significantly harder than purely functional tensor models.

Scatter and Gather

Sparse and indexing operations use:

OperationDescription
GatherRead scattered locations
ScatterWrite scattered locations
Scatter-addAccumulate scattered writes

Backward rules swap gather and scatter-add.

Example:

Forward:

yi=xidxi. y_i = x_{idx_i}.

Backward:

xˉidxi+=yˉi. \bar{x}_{idx_i} \mathrel{+}= \bar{y}_i.

Scatter-add is difficult on GPUs because many threads may update the same location.

This requires:

atomic operations
segmented reductions
parallel conflict resolution

Parallel Reduction Nondeterminism

Floating-point addition is not associative.

Parallel reductions may produce:

(a+b)+ca+(b+c). (a+b)+c \ne a+(b+c).

Different thread scheduling gives slightly different results.

Thus GPU gradients are often nondeterministic at low precision.

Tradeoff:

GoalCost
Deterministic reductionsLower throughput
Maximum throughputNondeterministic accumulation

Large-scale training usually prioritizes throughput.

Mixed Precision Training

Modern AD systems use mixed precision extensively.

Forward:

Tensor TypePrecision
ActivationsFP16/BF16
WeightsFP16/BF16
AccumulatorsFP32

Backward:

QuantityPrecision
GradientsOften FP16/BF16
Master weightsFP32

Loss scaling prevents gradient underflow:

L=sL. L' = sL.

Backward computes scaled gradients, then rescales:

L=1sL. \nabla L = \frac{1}{s}\nabla L'.

Mixed precision is essential for modern large-scale training.

Compiler Systems

Modern AD systems increasingly compile tensor graphs.

Examples:

SystemCompiler
JAXXLA
PyTorchTorchInductor
TensorFlowXLA
TVMTensor compiler
TritonGPU DSL

These systems perform:

fusion
layout optimization
kernel scheduling
memory planning
vectorization
autotuning

AD is now tightly integrated with compiler optimization.

Autotuning

Kernel performance depends on:

tile sizes
warp shapes
block dimensions
shared memory usage
register pressure

Compilers often benchmark multiple variants dynamically.

Autotuning is especially important for:

PrimitiveSensitivity
GEMMVery high
ConvolutionVery high
AttentionHigh
ReductionsModerate

GPU-Aware AD Design

A GPU-efficient AD system should minimize:

ProblemCause
Tiny kernelsPoor fusion
Excessive intermediatesMaterialization
Global memory trafficUnfused graphs
Layout conversionsMismatched kernels
SynchronizationReductions/scatter
Recomputation imbalancePoor checkpointing

Good AD design is therefore also hardware design.

Practical Principles

Prefer:

Good PracticeAvoid
Fused kernelsLong chains of tiny ops
Tensor contractionsScalar loops
Structured reductionsRandom scatter
Recomputation checkpointingStoring every activation
Mixed precisionFull FP64 everywhere
Layout-aware graphsConstant transposes

For compiler designers:

FocusImportance
Memory bandwidthCritical
FusionCritical
Tensor contraction schedulingCritical
Reduction efficiencyMajor
AutotuningMajor
Sparse kernel optimizationIncreasingly important

Summary

GPU tensor kernels are the execution layer beneath modern automatic differentiation systems. The mathematical derivative rules are relatively small compared to the systems complexity required to execute them efficiently.

Reverse-mode AD amplifies memory traffic, synchronization cost, and intermediate tensor allocation. Efficient implementations therefore rely on fusion, tiling, mixed precision, checkpointing, layout optimization, and compiler-guided kernel generation.

Modern AD systems are best understood not merely as differentiation engines, but as tensor compiler runtimes specialized for accelerator hardware.