Tensor operations generalize scalar, vector, and matrix operations to arrays with arbitrary rank. In automatic differentiation, a tensor is usually treated as a typed array...
Tensor operations generalize scalar, vector, and matrix operations to arrays with arbitrary rank. In automatic differentiation, a tensor is usually treated as a typed array with shape, dtype, layout, and device. The mathematical object is an element of a finite-dimensional vector space. The systems object is a strided memory view with rules for indexing and mutation.
A tensor with shape
has entries
The number is the rank of the tensor. The total number of elements is
A tensor operation is a function between shaped spaces:
Its derivative is still a linear map. As with matrices, AD systems rarely form the full derivative tensor. They compute how perturbations move forward and how adjoints move backward.
Elementwise Operations
An elementwise operation applies the same scalar function to every entry:
The forward differential is
So
Here means elementwise multiplication.
The reverse rule is
Examples:
| Operation | Forward Rule | Reverse Rule |
|---|---|---|
Elementwise rules are local and shape-preserving. They are among the simplest AD primitives.
Reductions
A reduction removes one or more axes.
For example, if
then has the same shape as , except axis is removed or kept with size , depending on the API.
For a full sum,
The forward differential is
The reverse rule broadcasts the scalar adjoint back to every input element:
So reduction and broadcasting are adjoints of each other.
For a mean reduction over elements,
The reverse rule is
For a maximum reduction, the rule is piecewise:
If a unique index attains the maximum, then
If several entries tie for the maximum, the derivative is not unique. Libraries choose a subgradient convention, often sending the adjoint to one selected maximum or splitting it across all maxima.
Reshape and View Operations
Reshape changes how the same elements are indexed. It does not change values.
The forward rule is
The reverse rule is
Transpose and permutation are similar. If
then
and the reverse rule uses the inverse permutation:
These operations are mathematically cheap. In a runtime, they may create views with new strides rather than copy data. AD must respect aliasing: two different views may refer to the same storage.
Slicing and Indexing
Slicing selects part of a tensor:
The forward rule is
The reverse rule scatters the output adjoint back into the selected region:
For repeated indices, adjoints must accumulate. For example,
means the first input element is used twice. The reverse pass must add both contributions to .
Indexing has a sharp distinction:
| Quantity | Differentiability |
|---|---|
| Indexed values | Differentiable |
| Integer indices | Usually not differentiable |
| Soft indices or interpolation weights | Differentiable |
Integer index selection is discrete. AD can propagate gradients through the selected values, but not through the selection decision itself without relaxation or specialized estimators.
Concatenation and Split
Concatenation joins tensors along an axis:
The forward differential is
The reverse rule splits the output adjoint into matching pieces:
Split is the adjoint pattern in reverse: the forward pass slices, and the reverse pass concatenates or scatters.
Broadcasting
Broadcasting expands a smaller tensor across one or more axes.
For example,
broadcasts to .
The forward differential is
The reverse rule sums over the broadcasted axis:
This is one of the most common places where hand-written gradients are wrong. The reverse of broadcasting is reduction over exactly the axes introduced by broadcasting.
More generally:
Tensor Contraction
Tensor contraction generalizes matrix multiplication. It multiplies tensors and sums over shared axes.
Matrix multiplication is
A general contraction has the form
Here , , and may each represent multiple axes.
The forward differential is
The reverse rules are contractions with the complementary tensor:
This is the abstract rule behind matrix multiplication, batched matrix multiplication, attention score computation, convolution lowered to matrix multiplication, and many einsum expressions.
Einsum Notation
Einstein summation notation gives a compact language for tensor contractions.
For example,
corresponds to matrix multiplication.
A batched matrix multiplication can be written as
The repeated index is summed. Indices appearing in the output are preserved.
For an einsum expression, reverse-mode rules can be derived by replacing one input with the output adjoint and solving for the missing input index pattern.
If
then
Einsum is valuable because it makes shape relationships explicit. It also gives compiler backends a compact representation for optimization.
Norms
Norms reduce tensors to scalars or lower-rank tensors.
For the squared Euclidean norm,
the differential is
So
For the Euclidean norm,
when ,
At , the norm is not differentiable. Implementations either return a chosen subgradient, return zero, or rely on user-added numerical stabilization such as
Softmax
Softmax is a tensor operation usually applied along one axis. For a vector ,
The differential is
Given an output adjoint , the reverse rule is
In vector form:
Softmax is rarely implemented as a naive primitive. Production systems use numerically stable forms, usually subtracting the maximum value before exponentiation. For cross-entropy loss, systems often fuse softmax and log-loss to avoid unstable intermediate values.
Tensor Operations and Memory Layout
A tensor has more than a shape. It also has a layout. A strided tensor view is described by:
The address of an element is computed as
Here is the stride for axis .
Two tensors may have the same shape but different memory layouts. Transpose can often be represented by changing strides. Slice can often be represented by changing offset and shape. Expand can sometimes be represented using a stride of zero.
This matters for AD because the reverse pass must accumulate into storage correctly. If multiple output elements refer to the same input storage location, adjoints must be summed, not overwritten.
Tensor Primitive Rules
| Primitive | Forward Differential | Reverse Rule |
|---|---|---|
| Contract input differentials | Contract output adjoint with complementary inputs |
Implementation View
A tensor AD engine typically represents each primitive with:
op:
inputs: Tensor[]
output: Tensor
forward: compute output values
jvp: compute output tangent from input tangents
vjp: compute input adjoints from output adjointThe reverse rule for a primitive must handle:
shape
dtype
layout
broadcast axes
reduction axes
aliasing
device placement
accumulationFor example, the reverse rule for broadcast addition must know which axes were broadcast during the forward pass. Without that metadata, it cannot reduce the adjoint correctly.
Practical Rule
For tensor operations, the safest derivation process is:
- Write the operation with explicit indices.
- Differentiate the indexed equation.
- Attach an output adjoint.
- Sum over output indices.
- Rearrange terms so each input differential is isolated.
- Read off the reverse rule.
- Check shapes and broadcast axes.
Tensor AD is mostly disciplined bookkeeping. The calculus is local. The difficulty is preserving the exact semantics of shape, indexing, layout, and accumulation.