# GPU Tensor Kernels

## GPU Tensor Kernels

Modern automatic differentiation systems are fundamentally tensor compiler systems. Their performance depends less on mathematical differentiation rules than on how efficiently tensor kernels execute on accelerators.

GPU tensor kernels are the computational substrate of modern deep learning, scientific computing, and differentiable simulation. Reverse-mode AD amplifies the importance of kernel efficiency because every forward operation usually induces at least one backward operation.

The challenge is therefore not only:

$$
\text{compute derivatives},
$$

but also:

$$
\text{compute derivatives with high hardware efficiency}.
$$

## GPU Execution Model

GPUs are throughput-oriented processors.

They achieve performance through massive parallelism:

| Hardware Unit | Role |
|---|---|
| Thread | Smallest execution unit |
| Warp / wavefront | SIMD execution group |
| Block / workgroup | Cooperative thread group |
| Streaming multiprocessor | Parallel execution engine |
| Global memory | Large but slow |
| Shared memory | Fast local scratchpad |
| Registers | Fastest storage |

Tensor operations map naturally onto GPUs because many array elements can be processed independently.

Example:

$$
Y_{ij} = X_{ij} + Z_{ij}
$$

can assign one thread per element.

Matrix multiplication assigns thread blocks to output tiles.

## Tensor Kernel

A tensor kernel is a low-level operation executed on accelerator hardware.

Examples:

| Kernel Type | Example |
|---|---|
| Elementwise | ReLU, exp, add |
| Reduction | sum, max |
| Contraction | GEMM |
| Gather/scatter | indexing |
| Convolution | CNN kernels |
| Softmax | normalized exponential |
| Attention | batched contraction |

An AD engine lowers tensor programs into sequences of kernels.

## Forward and Backward Kernels

Every differentiable primitive induces backward kernels.

Example:

Forward:

$$
Y = XW.
$$

Backward:

$$
\bar{X} = \bar{Y}W^T,
$$

$$
\bar{W} = X^T\bar{Y}.
$$

Thus one matrix multiplication creates two additional matrix multiplications during reverse mode.

Backward computation is therefore often more expensive than forward computation.

Approximate cost:

| Pass | Relative FLOPs |
|---|---:|
| Forward | 1x |
| Backward | 2x to 3x |

Memory traffic may dominate arithmetic cost even more strongly in backward passes.

## Arithmetic Intensity

GPU efficiency depends heavily on arithmetic intensity:

$$
\text{Arithmetic Intensity} =
\frac{\text{FLOPs}}
{\text{Memory Traffic}}.
$$

High arithmetic intensity means many computations per byte loaded.

Examples:

| Operation | Intensity |
|---|---|
| Matrix multiplication | High |
| Elementwise add | Low |
| Reduction | Moderate |
| Scatter-add | Very low |

Matrix multiplication maps extremely well to GPUs because data reuse is high.

Elementwise kernels are often memory-bandwidth bound.

## Matrix Multiplication Kernels

General matrix multiplication (GEMM) is the dominant dense tensor primitive.

$$
C = AB.
$$

Efficient GPU GEMM uses:

```text
tiling
shared memory reuse
register blocking
warp-level matrix instructions
tensor cores
```

A tile-based kernel computes submatrices:

$$
C_{tile} =
A_{tile}B_{tile}.
$$

Threads cooperatively load blocks into shared memory and reuse them repeatedly.

Backward rules reuse GEMM:

| Gradient | Expression |
|---|---|
| Input gradient | $\bar{X} = \bar{Y}W^T$ |
| Weight gradient | $\bar{W} = X^T\bar{Y}$ |

Thus highly optimized GEMM kernels are critical for AD performance.

## Tensor Cores

Modern GPUs include specialized matrix-multiply hardware:

| Vendor | Unit |
|---|---|
| NVIDIA | Tensor cores |
| AMD | Matrix cores |
| TPU | Systolic arrays |

These units accelerate:

$$
D = AB + C.
$$

They often use reduced precision:

| Precision | Common Use |
|---|---|
| FP16 | Training |
| BF16 | Training |
| TF32 | Mixed precision |
| FP8 | Emerging large-scale training |

AD systems must preserve numerical stability under reduced precision arithmetic.

## Kernel Fusion

Kernel launch overhead is expensive.

Instead of:

```text
tmp1 = x + y
tmp2 = relu(tmp1)
tmp3 = tmp2 * z
```

a compiler may fuse into:

```text
out = relu(x + y) * z
```

This avoids:

| Cost | Reason |
|---|---|
| Intermediate memory writes | Temporary tensors |
| Extra global memory reads | Reloading intermediates |
| Kernel launch overhead | Multiple launches |

Fusion is especially important for backward passes, which otherwise produce many tiny kernels.

## Backward Fusion

Reverse-mode AD naturally creates chains of small operations.

Example:

$$
y = \sin(x^2 + z).
$$

Backward:

$$
\bar{x} =
\bar{y}
\cos(x^2+z)
2x.
$$

Naively, this creates several intermediate tensors.

Fusion combines them into one kernel.

Without fusion:

```text
tmp1 = x*x
tmp2 = tmp1 + z
tmp3 = cos(tmp2)
tmp4 = tmp3 * 2*x
dx   = dy * tmp4
```

With fusion:

```text
dx = dy * cos(x*x + z) * 2*x
```

Fusion reduces memory bandwidth pressure dramatically.

## Reduction Kernels

Reductions are harder than elementwise kernels because they require communication.

Example:

$$
y = \sum_i x_i.
$$

Parallel reduction proceeds hierarchically:

1. Local thread accumulation.
2. Warp reduction.
3. Block reduction.
4. Global reduction.

Backward rule:

$$
\bar{x}_i = \bar{y}.
$$

Forward reduction is communication-heavy. Backward broadcast is embarrassingly parallel.

Reductions are often synchronization-bound.

## Softmax Kernels

Softmax combines reduction and elementwise operations:

$$
y_i =
\frac{e^{x_i}}
{\sum_j e^{x_j}}.
$$

Efficient kernels fuse:

```text
max reduction
stabilized subtraction
exponentiation
sum reduction
division
```

Backward softmax also fuses reductions:

$$
\bar{x} =
y \odot
(
\bar{y} -
\langle \bar{y}, y\rangle
).
$$

Attention kernels heavily depend on efficient softmax implementations.

## Convolution Kernels

Convolutions are structured tensor contractions.

Forward:

$$
Y = X * K.
$$

Backward produces:

| Gradient | Operation |
|---|---|
| Input gradient | Convolution with flipped kernel |
| Kernel gradient | Correlation |
| Bias gradient | Reduction |

GPU implementations use:

| Method | Use |
|---|---|
| Direct convolution | Small kernels |
| im2col + GEMM | General dense convolution |
| FFT convolution | Large kernels |
| Winograd | Small dense kernels |

Backward convolutions often have different optimal algorithms from forward convolutions.

## Attention Kernels

Transformer attention dominates modern large models.

Basic attention:

$$
S = \frac{QK^T}{\sqrt{d}},
$$

$$
P = \operatorname{softmax}(S),
$$

$$
Y = PV.
$$

Backward pass involves multiple matrix multiplications and softmax gradients.

Attention is memory-intensive because:

$$
S \in \mathbb{R}^{n \times n}.
$$

Long sequences create quadratic memory growth.

## Flash Attention

Flash Attention avoids materializing the full attention matrix.

Key idea:

```text
stream blocks through shared memory
recompute local statistics
avoid large intermediate tensors
```

This reduces:

| Resource | Reduction |
|---|---|
| Global memory traffic | Large |
| HBM usage | Large |
| Intermediate allocation | Eliminated |

Backward pass recomputes local softmax statistics instead of storing them.

This is an example of recomputation trading FLOPs for memory.

## Checkpointing and Recomputation

Backward passes require forward intermediates.

Storing all activations is expensive.

Checkpointing stores only selected tensors:

```text
store sparse checkpoints
recompute missing intermediates during backward
```

Tradeoff:

| Strategy | Memory | Compute |
|---|---:|---:|
| Full storage | High | Low |
| Full recomputation | Low | High |
| Checkpointing | Medium | Medium |

Large-scale AD systems rely heavily on recomputation.

## Memory Bandwidth

Many tensor kernels are memory-bound rather than compute-bound.

Example:

$$
y_i = x_i + z_i.
$$

This performs:

- 1 addition
- multiple memory loads/stores

Arithmetic cost is tiny compared to memory movement.

GPU optimization therefore focuses on:

```text
coalesced access
cache reuse
shared memory reuse
kernel fusion
layout optimization
```

Bandwidth dominates many backward passes.

## Tensor Layout

Tensor layout affects kernel efficiency.

Common layouts:

| Layout | Description |
|---|---|
| Row-major | C layout |
| Column-major | Fortran layout |
| NHWC | Channels last |
| NCHW | Channels first |

Different kernels prefer different layouts.

Layout transforms themselves cost memory bandwidth.

AD systems often propagate layout constraints through graphs to avoid unnecessary transposes.

## Strides and Views

Many tensor operations create views rather than copies.

Example:

```python
y = x.transpose()
```

No data movement may occur.

The kernel instead uses modified strides.

Backward kernels must preserve aliasing semantics.

View-based AD is significantly harder than purely functional tensor models.

## Scatter and Gather

Sparse and indexing operations use:

| Operation | Description |
|---|---|
| Gather | Read scattered locations |
| Scatter | Write scattered locations |
| Scatter-add | Accumulate scattered writes |

Backward rules swap gather and scatter-add.

Example:

Forward:

$$
y_i = x_{idx_i}.
$$

Backward:

$$
\bar{x}_{idx_i}
\mathrel{+}=
\bar{y}_i.
$$

Scatter-add is difficult on GPUs because many threads may update the same location.

This requires:

```text
atomic operations
segmented reductions
parallel conflict resolution
```

## Parallel Reduction Nondeterminism

Floating-point addition is not associative.

Parallel reductions may produce:

$$
(a+b)+c
\ne
a+(b+c).
$$

Different thread scheduling gives slightly different results.

Thus GPU gradients are often nondeterministic at low precision.

Tradeoff:

| Goal | Cost |
|---|---|
| Deterministic reductions | Lower throughput |
| Maximum throughput | Nondeterministic accumulation |

Large-scale training usually prioritizes throughput.

## Mixed Precision Training

Modern AD systems use mixed precision extensively.

Forward:

| Tensor Type | Precision |
|---|---|
| Activations | FP16/BF16 |
| Weights | FP16/BF16 |
| Accumulators | FP32 |

Backward:

| Quantity | Precision |
|---|---|
| Gradients | Often FP16/BF16 |
| Master weights | FP32 |

Loss scaling prevents gradient underflow:

$$
L' = sL.
$$

Backward computes scaled gradients, then rescales:

$$
\nabla L = \frac{1}{s}\nabla L'.
$$

Mixed precision is essential for modern large-scale training.

## Compiler Systems

Modern AD systems increasingly compile tensor graphs.

Examples:

| System | Compiler |
|---|---|
| JAX | XLA |
| PyTorch | TorchInductor |
| TensorFlow | XLA |
| TVM | Tensor compiler |
| Triton | GPU DSL |

These systems perform:

```text
fusion
layout optimization
kernel scheduling
memory planning
vectorization
autotuning
```

AD is now tightly integrated with compiler optimization.

## Autotuning

Kernel performance depends on:

```text
tile sizes
warp shapes
block dimensions
shared memory usage
register pressure
```

Compilers often benchmark multiple variants dynamically.

Autotuning is especially important for:

| Primitive | Sensitivity |
|---|---|
| GEMM | Very high |
| Convolution | Very high |
| Attention | High |
| Reductions | Moderate |

## GPU-Aware AD Design

A GPU-efficient AD system should minimize:

| Problem | Cause |
|---|---|
| Tiny kernels | Poor fusion |
| Excessive intermediates | Materialization |
| Global memory traffic | Unfused graphs |
| Layout conversions | Mismatched kernels |
| Synchronization | Reductions/scatter |
| Recomputation imbalance | Poor checkpointing |

Good AD design is therefore also hardware design.

## Practical Principles

Prefer:

| Good Practice | Avoid |
|---|---|
| Fused kernels | Long chains of tiny ops |
| Tensor contractions | Scalar loops |
| Structured reductions | Random scatter |
| Recomputation checkpointing | Storing every activation |
| Mixed precision | Full FP64 everywhere |
| Layout-aware graphs | Constant transposes |

For compiler designers:

| Focus | Importance |
|---|---|
| Memory bandwidth | Critical |
| Fusion | Critical |
| Tensor contraction scheduling | Critical |
| Reduction efficiency | Major |
| Autotuning | Major |
| Sparse kernel optimization | Increasingly important |

## Summary

GPU tensor kernels are the execution layer beneath modern automatic differentiation systems. The mathematical derivative rules are relatively small compared to the systems complexity required to execute them efficiently.

Reverse-mode AD amplifies memory traffic, synchronization cost, and intermediate tensor allocation. Efficient implementations therefore rely on fusion, tiling, mixed precision, checkpointing, layout optimization, and compiler-guided kernel generation.

Modern AD systems are best understood not merely as differentiation engines, but as tensor compiler runtimes specialized for accelerator hardware.

