Skip to content

Chapter 10. Matrix and Tensor Differentiation

Matrix calculus is the notation and rule system used to differentiate functions whose inputs, outputs, or intermediate values are vectors, matrices, or tensors. Automatic...

Matrix Calculus

Matrix calculus is the notation and rule system used to differentiate functions whose inputs, outputs, or intermediate values are vectors, matrices, or tensors. Automatic differentiation systems do not require matrix calculus internally: they only need local derivative rules for primitive operations. However, matrix calculus is the language used to specify those rules, check them, and understand their shape.

A scalar function has the familiar form

f:RR. f : \mathbb{R} \to \mathbb{R}.

A vector function may have the form

f:RnRm. f : \mathbb{R}^n \to \mathbb{R}^m.

A matrix function may have the form

F:Rm×nRp×q. F : \mathbb{R}^{m \times n} \to \mathbb{R}^{p \times q}.

The derivative of such a function is still a linear map. The difficulty is mostly notation. A derivative of a matrix-valued function with respect to a matrix input is naturally a fourth-order object. In practical AD systems, that object is almost never materialized. Instead, AD computes products with it.

The central idea is:

Derivative=best local linear approximation. \text{Derivative} = \text{best local linear approximation}.

For a function

f(x) f(x)

near a point xx, the first-order approximation is

f(x+Δx)f(x)+Df(x)[Δx]. f(x + \Delta x) \approx f(x) + Df(x)[\Delta x].

Here Df(x)Df(x) is the derivative, and Df(x)[Δx]Df(x)[\Delta x] is the derivative applied to a perturbation Δx\Delta x. This notation is often cleaner than writing giant Jacobians.

Scalars, Vectors, and Matrices

Let

xRn. x \in \mathbb{R}^n.

A scalar-valued function has type

f:RnR. f : \mathbb{R}^n \to \mathbb{R}.

Its gradient is

f(x)Rn. \nabla f(x) \in \mathbb{R}^n.

The first-order expansion is

f(x+Δx)f(x)+f(x)TΔx. f(x + \Delta x) \approx f(x) + \nabla f(x)^T \Delta x.

A vector-valued function has type

f:RnRm. f : \mathbb{R}^n \to \mathbb{R}^m.

Its Jacobian is

Jf(x)Rm×n. J_f(x) \in \mathbb{R}^{m \times n}.

The first-order expansion is

f(x+Δx)f(x)+Jf(x)Δx. f(x + \Delta x) \approx f(x) + J_f(x)\Delta x.

A matrix-valued input can be flattened into a vector, but doing so often hides structure. If

XRm×n, X \in \mathbb{R}^{m \times n},

then a scalar-valued function

f:Rm×nR f : \mathbb{R}^{m \times n} \to \mathbb{R}

has gradient

Xf(X)Rm×n. \nabla_X f(X) \in \mathbb{R}^{m \times n}.

The first-order expansion is written using the Frobenius inner product:

f(X+ΔX)f(X)+Xf(X),ΔXF. f(X + \Delta X) \approx f(X) + \langle \nabla_X f(X), \Delta X \rangle_F.

The Frobenius inner product is

A,BF=tr(ATB)=i,jAijBij. \langle A, B \rangle_F = \operatorname{tr}(A^T B) = \sum_{i,j} A_{ij}B_{ij}.

This is the matrix analogue of the vector dot product.

Differential Notation

Differential notation is often the cleanest notation for deriving matrix gradients.

For a scalar function f(X)f(X), we write

df=Xf,dXF. df = \langle \nabla_X f, dX \rangle_F.

The goal is to rewrite dfdf so that every occurrence of dXdX appears in the final position of an inner product. Then the coefficient of dXdX is the gradient.

For example, let

f(X)=tr(ATX). f(X) = \operatorname{tr}(A^T X).

Then

df=tr(ATdX). df = \operatorname{tr}(A^T dX).

So

Xf=A. \nabla_X f = A.

This follows because

df=A,dXF. df = \langle A, dX \rangle_F.

Now consider

f(X)=tr(XTX). f(X) = \operatorname{tr}(X^T X).

Then

df=tr(dXTX+XTdX). df = \operatorname{tr}(dX^T X + X^T dX).

Using trace symmetry,

tr(dXTX)=tr(XTdX). \operatorname{tr}(dX^T X) = \operatorname{tr}(X^T dX).

Therefore

df=2tr(XTdX), df = 2\operatorname{tr}(X^T dX),

and

Xf=2X. \nabla_X f = 2X.

This method is mechanical. It is also close to how reverse-mode AD works: move local perturbations backward until the coefficient of each input perturbation is exposed.

Common Matrix Derivatives

FunctionShapeGradient
f(x)=aTxf(x) = a^T xx,aRnx,a \in \mathbb{R}^nxf=a\nabla_x f = a
f(x)=xTAxf(x) = x^T A xARn×nA \in \mathbb{R}^{n \times n}xf=(A+AT)x\nabla_x f = (A + A^T)x
f(x)=x22f(x) = \|x\|_2^2xRnx \in \mathbb{R}^nxf=2x\nabla_x f = 2x
f(X)=tr(ATX)f(X) = \operatorname{tr}(A^T X)A,XRm×nA,X \in \mathbb{R}^{m \times n}Xf=A\nabla_X f = A
f(X)=tr(XTX)f(X) = \operatorname{tr}(X^T X)XRm×nX \in \mathbb{R}^{m \times n}Xf=2X\nabla_X f = 2X
f(X)=XF2f(X) = \|X\|_F^2XRm×nX \in \mathbb{R}^{m \times n}Xf=2X\nabla_X f = 2X
f(X)=logdetXf(X) = \log \det XXRn×nX \in \mathbb{R}^{n \times n} nonsingularXf=XT\nabla_X f = X^{-T}
f(X)=tr(X1A)f(X) = \operatorname{tr}(X^{-1}A)XX nonsingularXf=XTATXT\nabla_X f = -X^{-T}A^T X^{-T}

The table assumes real-valued matrices and the Frobenius inner product convention.

Linear Maps and Adjoints

Automatic differentiation works locally with linear maps.

Suppose a primitive operation has the form

Y=AX. Y = AX.

The forward perturbation rule is

dY=AdX. dY = A\,dX.

This rule computes a Jacobian-vector product. It sends an input perturbation dXdX forward to an output perturbation dYdY.

Reverse mode needs the adjoint rule. Let Yˉ\bar{Y} be the adjoint of YY, meaning the gradient accumulated at YY. We want Xˉ\bar{X}, the gradient contribution to XX.

The adjoint rule is determined by preserving inner products:

Yˉ,dYF=Xˉ,dXF. \langle \bar{Y}, dY \rangle_F = \langle \bar{X}, dX \rangle_F.

Since

dY=AdX, dY = A\,dX,

we have

Yˉ,AdXF=tr(YˉTAdX). \langle \bar{Y}, A\,dX \rangle_F = \operatorname{tr}(\bar{Y}^T A dX).

Rearrange the trace:

tr(YˉTAdX)=tr((ATYˉ)TdX). \operatorname{tr}(\bar{Y}^T A dX) = \operatorname{tr}((A^T \bar{Y})^T dX).

Therefore

Xˉ=ATYˉ. \bar{X} = A^T \bar{Y}.

So the reverse rule for

Y=AX Y = AX

is

Xˉ+=ATYˉ. \bar{X} \mathrel{+}= A^T \bar{Y}.

If AA is also an input, then

dY=dAX+AdX. dY = dA\,X + A\,dX.

The adjoint contribution to AA is

Aˉ+=YˉXT. \bar{A} \mathrel{+}= \bar{Y}X^T.

So the complete reverse rule is:

Aˉ+=YˉXT,Xˉ+=ATYˉ. \bar{A} \mathrel{+}= \bar{Y}X^T, \qquad \bar{X} \mathrel{+}= A^T\bar{Y}.

This pattern is fundamental. Most tensor AD rules are obtained by writing the forward differential and then taking the adjoint.

Matrix Multiplication

Let

C=AB, C = AB,

where

ARm×k,BRk×n,CRm×n. A \in \mathbb{R}^{m \times k}, \qquad B \in \mathbb{R}^{k \times n}, \qquad C \in \mathbb{R}^{m \times n}.

The forward differential is

dC=dAB+AdB. dC = dA\,B + A\,dB.

Given an output adjoint

CˉRm×n, \bar{C} \in \mathbb{R}^{m \times n},

we derive the reverse rules from

Cˉ,dCF=Cˉ,dAB+AdBF. \langle \bar{C}, dC \rangle_F = \langle \bar{C}, dA\,B + A\,dB \rangle_F.

For the AA term:

Cˉ,dABF=tr(CˉTdAB)=tr((CˉBT)TdA). \langle \bar{C}, dA\,B \rangle_F = \operatorname{tr}(\bar{C}^T dA B) = \operatorname{tr}((\bar{C}B^T)^T dA).

So

Aˉ+=CˉBT. \bar{A} \mathrel{+}= \bar{C}B^T.

For the BB term:

Cˉ,AdBF=tr(CˉTAdB)=tr((ATCˉ)TdB). \langle \bar{C}, A\,dB \rangle_F = \operatorname{tr}(\bar{C}^T A dB) = \operatorname{tr}((A^T\bar{C})^T dB).

So

Bˉ+=ATCˉ. \bar{B} \mathrel{+}= A^T\bar{C}.

Therefore the reverse-mode rule for matrix multiplication is

C=AB, C = AB, Aˉ+=CˉBT,Bˉ+=ATCˉ. \bar{A} \mathrel{+}= \bar{C}B^T, \qquad \bar{B} \mathrel{+}= A^T\bar{C}.

This is the core gradient rule behind dense layers in neural networks.

Example: Linear Layer

A linear layer usually has the form

Y=XW+b. Y = XW + b.

Let

XRN×d,WRd×h,bRh,YRN×h. X \in \mathbb{R}^{N \times d}, \qquad W \in \mathbb{R}^{d \times h}, \qquad b \in \mathbb{R}^{h}, \qquad Y \in \mathbb{R}^{N \times h}.

Here NN is the batch size, dd is the input dimension, and hh is the output dimension.

The differential is

dY=dXW+XdW+db. dY = dX\,W + X\,dW + db.

The reverse rules are

Xˉ=YˉWT, \bar{X} = \bar{Y}W^T, Wˉ=XTYˉ, \bar{W} = X^T\bar{Y}, bˉ=i=1NYˉi,:. \bar{b} = \sum_{i=1}^{N} \bar{Y}_{i,:}.

The bias gradient sums over the batch dimension because bb is broadcast across rows.

This example shows why shape semantics matter. The derivative of broadcasting is reduction. The derivative of reduction is broadcasting. AD systems must encode these layout rules exactly.

Jacobians Are Usually Too Large

For a function

f:RnRm, f : \mathbb{R}^n \to \mathbb{R}^m,

the Jacobian has mnmn entries. For modern models, mm and nn may both be enormous. Explicit Jacobians are usually impractical.

AD systems instead compute structured products.

Forward mode computes

Jv. Jv.

Reverse mode computes

uTJ. u^T J.

In tensor notation, these are called:

OperationNameTypical AD Mode
JvJvJacobian-vector productForward mode
uTJu^T JVector-Jacobian productReverse mode
HvHvHessian-vector productMixed mode
JTuJ^T uPullback of cotangentReverse mode

The key point is that AD applies derivative operators without constructing them.

Shape Discipline

Matrix calculus is only correct when shapes are correct. A useful derivation should annotate every variable with its shape.

For

C=AB, C = AB,

we should write

A:m×k,B:k×n,C:m×n. A : m \times k, \qquad B : k \times n, \qquad C : m \times n.

Then the reverse rules must match:

Cˉ:m×n, \bar{C} : m \times n, Aˉ=CˉBT:(m×n)(n×k)=m×k, \bar{A} = \bar{C}B^T : (m \times n)(n \times k) = m \times k, Bˉ=ATCˉ:(k×m)(m×n)=k×n. \bar{B} = A^T\bar{C} : (k \times m)(m \times n) = k \times n.

Shape checking catches many gradient bugs before numerical testing.

Trace Identities

Trace identities are the main algebraic tool for deriving matrix gradients.

The most useful identities are:

tr(A)=tr(AT), \operatorname{tr}(A) = \operatorname{tr}(A^T), tr(ABC)=tr(BCA)=tr(CAB), \operatorname{tr}(ABC) = \operatorname{tr}(BCA) = \operatorname{tr}(CAB), tr(ATB)=A,BF, \operatorname{tr}(A^T B) = \langle A, B \rangle_F, tr(AB)=tr(BA), \operatorname{tr}(AB) = \operatorname{tr}(BA),

when the products are defined.

The cyclic property of trace is especially important. It allows us to move dXdX into the position required to read off the gradient.

For example, if

f(X)=tr(AXB), f(X) = \operatorname{tr}(AXB),

then

df=tr(AdXB). df = \operatorname{tr}(A\,dX\,B).

Using cyclic trace,

df=tr(BAdX). df = \operatorname{tr}(B A\,dX).

To match the Frobenius form

df=tr(GTdX), df = \operatorname{tr}(G^T dX),

we need

GT=BA. G^T = BA.

Therefore

Xf=ATBT. \nabla_X f = A^T B^T.

Gradients Depend on Inner Product Convention

A gradient is not just a derivative. A derivative is a linear map. A gradient is the representation of that linear map under a chosen inner product.

For vectors, we usually use

x,y=xTy. \langle x, y \rangle = x^T y.

For matrices, we usually use

A,BF=tr(ATB). \langle A, B \rangle_F = \operatorname{tr}(A^T B).

Under this convention,

df=Xf,dXF. df = \langle \nabla_X f, dX \rangle_F.

If a different inner product is used, the gradient representation changes. This matters in geometry, optimization on manifolds, natural gradients, and constrained matrix problems.

AD systems usually assume Euclidean inner products for arrays. More specialized systems may expose custom vector-space structures and custom adjoints.

Matrix Calculus in AD Systems

An AD engine needs a derivative rule for each primitive. For matrix primitives, the rule usually has two forms:

Forward rule:

dY=Jf(X)[dX]. dY = J_f(X)[dX].

Reverse rule:

Xˉ=Jf(X)T[Yˉ]. \bar{X} = J_f(X)^T[\bar{Y}].

For example:

PrimitiveForward DifferentialReverse Rule
Y=XTY = X^TdY=dXTdY = dX^TXˉ+=YˉT\bar{X} \mathrel{+}= \bar{Y}^T
Y=AXY = AXdY=AdXdY = A\,dXXˉ+=ATYˉ\bar{X} \mathrel{+}= A^T\bar{Y}
Y=XBY = XBdY=dXBdY = dX\,BXˉ+=YˉBT\bar{X} \mathrel{+}= \bar{Y}B^T
Y=ABY = ABdY=dAB+AdBdY = dA\,B + A\,dBAˉ+=YˉBT, Bˉ+=ATYˉ\bar{A} \mathrel{+}= \bar{Y}B^T,\ \bar{B} \mathrel{+}= A^T\bar{Y}
y=sum(X)y = \operatorname{sum}(X)dy=sum(dX)dy = \operatorname{sum}(dX)Xˉ+=yˉ1\bar{X} \mathrel{+}= \bar{y}\mathbf{1}
Y=X+bY = X + b with broadcastdY=dX+dbdY = dX + dbbˉ+=reduce(Yˉ)\bar{b} \mathrel{+}= \operatorname{reduce}(\bar{Y})

A production AD system also tracks layout, strides, broadcasting, dtype, device placement, aliasing, and mutation. The mathematics gives the rule. The runtime makes the rule correct for real arrays.

Summary

Matrix calculus gives AD systems their local tensor rules. The derivative is best understood as a linear map. Forward mode applies that linear map to perturbations. Reverse mode applies its adjoint to output sensitivities.

The practical workflow is:

  1. Write the primitive operation.
  2. Write its differential.
  3. Move perturbations into Frobenius inner-product form.
  4. Read off the adjoint rules.
  5. Check every shape.

This method scales from simple matrix multiplication to large tensor programs.