Chapter 10. Matrix and Tensor Differentiation

Matrix Calculus

Matrix calculus is the notation and rule system used to differentiate functions whose inputs, outputs, or intermediate values are vectors, matrices, or tensors. Automatic differentiation systems do not require matrix calculus internally: they only need local derivative rules for primitive operations. However, matrix calculus is the language used to specify those rules, check them, and understand their shape.

A scalar function has the familiar form

f : \mathbb{R} \to \mathbb{R}.

A vector function may have the form

f : \mathbb{R}^n \to \mathbb{R}^m.

A matrix function may have the form

F : \mathbb{R}^{m \times n} \to \mathbb{R}^{p \times q}.

The derivative of such a function is still a linear map. The difficulty is mostly notation. A derivative of a matrix-valued function with respect to a matrix input is naturally a fourth-order object. In practical AD systems, that object is almost never materialized. Instead, AD computes products with it.

The central idea is:

\text{Derivative} = \text{best local linear approximation}.

For a function

f(x)

near a point $x$ , the first-order approximation is

f(x + \Delta x) \approx f(x) + Df(x)[\Delta x].

Here $Df(x)$ is the derivative, and $Df(x)[\Delta x]$ is the derivative applied to a perturbation $\Delta x$ . This notation is often cleaner than writing giant Jacobians.

Scalars, Vectors, and Matrices

Let

x \in \mathbb{R}^n.

A scalar-valued function has type

f : \mathbb{R}^n \to \mathbb{R}.

Its gradient is

\nabla f(x) \in \mathbb{R}^n.

The first-order expansion is

f(x + \Delta x) \approx f(x) + \nabla f(x)^T \Delta x.

A vector-valued function has type

f : \mathbb{R}^n \to \mathbb{R}^m.

Its Jacobian is

J_f(x) \in \mathbb{R}^{m \times n}.

The first-order expansion is

f(x + \Delta x) \approx f(x) + J_f(x)\Delta x.

A matrix-valued input can be flattened into a vector, but doing so often hides structure. If

X \in \mathbb{R}^{m \times n},

then a scalar-valued function

f : \mathbb{R}^{m \times n} \to \mathbb{R}

has gradient

\nabla_X f(X) \in \mathbb{R}^{m \times n}.

The first-order expansion is written using the Frobenius inner product:

f(X + \Delta X) \approx f(X) + \langle \nabla_X f(X), \Delta X \rangle_F.

The Frobenius inner product is

\langle A, B \rangle_F = \operatorname{tr}(A^T B) = \sum_{i,j} A_{ij}B_{ij}.

This is the matrix analogue of the vector dot product.

Differential Notation

Differential notation is often the cleanest notation for deriving matrix gradients.

For a scalar function $f(X)$ , we write

df = \langle \nabla_X f, dX \rangle_F.

The goal is to rewrite $df$ so that every occurrence of $dX$ appears in the final position of an inner product. Then the coefficient of $dX$ is the gradient.

For example, let

f(X) = \operatorname{tr}(A^T X).

Then

df = \operatorname{tr}(A^T dX).

\nabla_X f = A.

This follows because

df = \langle A, dX \rangle_F.

Now consider

f(X) = \operatorname{tr}(X^T X).

Then

df = \operatorname{tr}(dX^T X + X^T dX).

Using trace symmetry,

\operatorname{tr}(dX^T X) = \operatorname{tr}(X^T dX).

Therefore

df = 2\operatorname{tr}(X^T dX),

and

\nabla_X f = 2X.

This method is mechanical. It is also close to how reverse-mode AD works: move local perturbations backward until the coefficient of each input perturbation is exposed.

Common Matrix Derivatives

Function	Shape	Gradient
$f(x) = a^T x$	$x,a \in \mathbb{R}^n$	$\nabla_x f = a$
$f(x) = x^T A x$	$A \in \mathbb{R}^{n \times n}$	$\nabla_x f = (A + A^T)x$
$f(x) = \\|x\\|_2^2$	$x \in \mathbb{R}^n$	$\nabla_x f = 2x$
$f(X) = \operatorname{tr}(A^T X)$	$A,X \in \mathbb{R}^{m \times n}$	$\nabla_X f = A$
$f(X) = \operatorname{tr}(X^T X)$	$X \in \mathbb{R}^{m \times n}$	$\nabla_X f = 2X$
$f(X) = \\|X\\|_F^2$	$X \in \mathbb{R}^{m \times n}$	$\nabla_X f = 2X$
$f(X) = \log \det X$	$X \in \mathbb{R}^{n \times n}$ nonsingular	$\nabla_X f = X^{-T}$
$f(X) = \operatorname{tr}(X^{-1}A)$	$X$ nonsingular	$\nabla_X f = -X^{-T}A^T X^{-T}$

The table assumes real-valued matrices and the Frobenius inner product convention.

Linear Maps and Adjoints

Automatic differentiation works locally with linear maps.

Suppose a primitive operation has the form

Y = AX.

The forward perturbation rule is

dY = A\,dX.

This rule computes a Jacobian-vector product. It sends an input perturbation $dX$ forward to an output perturbation $dY$ .

Reverse mode needs the adjoint rule. Let $\bar{Y}$ be the adjoint of $Y$ , meaning the gradient accumulated at $Y$ . We want $\bar{X}$ , the gradient contribution to $X$ .

The adjoint rule is determined by preserving inner products:

\langle \bar{Y}, dY \rangle_F = \langle \bar{X}, dX \rangle_F.

Since

dY = A\,dX,

we have

\langle \bar{Y}, A\,dX \rangle_F = \operatorname{tr}(\bar{Y}^T A dX).

Rearrange the trace:

\operatorname{tr}(\bar{Y}^T A dX) = \operatorname{tr}((A^T \bar{Y})^T dX).

Therefore

\bar{X} = A^T \bar{Y}.

So the reverse rule for

Y = AX

\bar{X} \mathrel{+}= A^T \bar{Y}.

If $A$ is also an input, then

dY = dA\,X + A\,dX.

The adjoint contribution to $A$ is

\bar{A} \mathrel{+}= \bar{Y}X^T.

So the complete reverse rule is:

\bar{A} \mathrel{+}= \bar{Y}X^T, \qquad \bar{X} \mathrel{+}= A^T\bar{Y}.

This pattern is fundamental. Most tensor AD rules are obtained by writing the forward differential and then taking the adjoint.

Matrix Multiplication

Let

C = AB,

where

A \in \mathbb{R}^{m \times k}, \qquad B \in \mathbb{R}^{k \times n}, \qquad C \in \mathbb{R}^{m \times n}.

The forward differential is

dC = dA\,B + A\,dB.

Given an output adjoint

\bar{C} \in \mathbb{R}^{m \times n},

we derive the reverse rules from

\langle \bar{C}, dC \rangle_F = \langle \bar{C}, dA\,B + A\,dB \rangle_F.

For the $A$ term:

\langle \bar{C}, dA\,B \rangle_F = \operatorname{tr}(\bar{C}^T dA B) = \operatorname{tr}((\bar{C}B^T)^T dA).

\bar{A} \mathrel{+}= \bar{C}B^T.

For the $B$ term:

\langle \bar{C}, A\,dB \rangle_F = \operatorname{tr}(\bar{C}^T A dB) = \operatorname{tr}((A^T\bar{C})^T dB).

\bar{B} \mathrel{+}= A^T\bar{C}.

Therefore the reverse-mode rule for matrix multiplication is

C = AB,

\bar{A} \mathrel{+}= \bar{C}B^T, \qquad \bar{B} \mathrel{+}= A^T\bar{C}.

This is the core gradient rule behind dense layers in neural networks.

Example: Linear Layer

A linear layer usually has the form

Y = XW + b.

Let

X \in \mathbb{R}^{N \times d}, \qquad W \in \mathbb{R}^{d \times h}, \qquad b \in \mathbb{R}^{h}, \qquad Y \in \mathbb{R}^{N \times h}.

Here $N$ is the batch size, $d$ is the input dimension, and $h$ is the output dimension.

The differential is

dY = dX\,W + X\,dW + db.

The reverse rules are

\bar{X} = \bar{Y}W^T,

\bar{W} = X^T\bar{Y},

\bar{b} = \sum_{i=1}^{N} \bar{Y}_{i,:}.

The bias gradient sums over the batch dimension because $b$ is broadcast across rows.

This example shows why shape semantics matter. The derivative of broadcasting is reduction. The derivative of reduction is broadcasting. AD systems must encode these layout rules exactly.

Jacobians Are Usually Too Large

For a function

f : \mathbb{R}^n \to \mathbb{R}^m,

the Jacobian has $mn$ entries. For modern models, $m$ and $n$ may both be enormous. Explicit Jacobians are usually impractical.

AD systems instead compute structured products.

Forward mode computes

Jv.

Reverse mode computes

u^T J.

In tensor notation, these are called:

Operation	Name	Typical AD Mode
$Jv$	Jacobian-vector product	Forward mode
$u^T J$	Vector-Jacobian product	Reverse mode
$Hv$	Hessian-vector product	Mixed mode
$J^T u$	Pullback of cotangent	Reverse mode

The key point is that AD applies derivative operators without constructing them.

Shape Discipline

Matrix calculus is only correct when shapes are correct. A useful derivation should annotate every variable with its shape.

For

C = AB,

we should write

A : m \times k, \qquad B : k \times n, \qquad C : m \times n.

Then the reverse rules must match:

\bar{C} : m \times n,

\bar{A} = \bar{C}B^T : (m \times n)(n \times k) = m \times k,

\bar{B} = A^T\bar{C} : (k \times m)(m \times n) = k \times n.

Shape checking catches many gradient bugs before numerical testing.

Trace Identities

Trace identities are the main algebraic tool for deriving matrix gradients.

The most useful identities are:

\operatorname{tr}(A) = \operatorname{tr}(A^T),

\operatorname{tr}(ABC) = \operatorname{tr}(BCA) = \operatorname{tr}(CAB),

\operatorname{tr}(A^T B) = \langle A, B \rangle_F,

\operatorname{tr}(AB) = \operatorname{tr}(BA),

when the products are defined.

The cyclic property of trace is especially important. It allows us to move $dX$ into the position required to read off the gradient.

For example, if

f(X) = \operatorname{tr}(AXB),

then

df = \operatorname{tr}(A\,dX\,B).

Using cyclic trace,

df = \operatorname{tr}(B A\,dX).

To match the Frobenius form

df = \operatorname{tr}(G^T dX),

we need

G^T = BA.

Therefore

\nabla_X f = A^T B^T.

Gradients Depend on Inner Product Convention

A gradient is not just a derivative. A derivative is a linear map. A gradient is the representation of that linear map under a chosen inner product.

For vectors, we usually use

\langle x, y \rangle = x^T y.

For matrices, we usually use

\langle A, B \rangle_F = \operatorname{tr}(A^T B).

Under this convention,

df = \langle \nabla_X f, dX \rangle_F.

If a different inner product is used, the gradient representation changes. This matters in geometry, optimization on manifolds, natural gradients, and constrained matrix problems.

AD systems usually assume Euclidean inner products for arrays. More specialized systems may expose custom vector-space structures and custom adjoints.

Matrix Calculus in AD Systems

An AD engine needs a derivative rule for each primitive. For matrix primitives, the rule usually has two forms:

Forward rule:

dY = J_f(X)[dX].

Reverse rule:

\bar{X} = J_f(X)^T[\bar{Y}].

For example:

Primitive	Forward Differential	Reverse Rule
$Y = X^T$	$dY = dX^T$	$\bar{X} \mathrel{+}= \bar{Y}^T$
$Y = AX$	$dY = A\,dX$	$\bar{X} \mathrel{+}= A^T\bar{Y}$
$Y = XB$	$dY = dX\,B$	$\bar{X} \mathrel{+}= \bar{Y}B^T$
$Y = AB$	$dY = dA\,B + A\,dB$	$\bar{A} \mathrel{+}= \bar{Y}B^T,\ \bar{B} \mathrel{+}= A^T\bar{Y}$
$y = \operatorname{sum}(X)$	$dy = \operatorname{sum}(dX)$	$\bar{X} \mathrel{+}= \bar{y}\mathbf{1}$
$Y = X + b$ with broadcast	$dY = dX + db$	$\bar{b} \mathrel{+}= \operatorname{reduce}(\bar{Y})$

A production AD system also tracks layout, strides, broadcasting, dtype, device placement, aliasing, and mutation. The mathematics gives the rule. The runtime makes the rule correct for real arrays.

Summary

Matrix calculus gives AD systems their local tensor rules. The derivative is best understood as a linear map. Forward mode applies that linear map to perturbations. Reverse mode applies its adjoint to output sensitivities.

The practical workflow is:

Write the primitive operation.
Write its differential.
Move perturbations into Frobenius inner-product form.
Read off the adjoint rules.
Check every shape.

This method scales from simple matrix multiplication to large tensor programs.