Matrix Calculus
Matrix calculus is the notation and rule system used to differentiate functions whose inputs, outputs, or intermediate values are vectors, matrices, or tensors. Automatic differentiation systems do not require matrix calculus internally: they only need local derivative rules for primitive operations. However, matrix calculus is the language used to specify those rules, check them, and understand their shape.
A scalar function has the familiar form
f:R→R.A vector function may have the form
f:Rn→Rm.A matrix function may have the form
F:Rm×n→Rp×q.The derivative of such a function is still a linear map. The difficulty is mostly notation. A derivative of a matrix-valued function with respect to a matrix input is naturally a fourth-order object. In practical AD systems, that object is almost never materialized. Instead, AD computes products with it.
The central idea is:
Derivative=best local linear approximation.For a function
f(x)near a point x, the first-order approximation is
f(x+Δx)≈f(x)+Df(x)[Δx].Here Df(x) is the derivative, and Df(x)[Δx] is the derivative applied to a perturbation Δx. This notation is often cleaner than writing giant Jacobians.
Scalars, Vectors, and Matrices
Let
x∈Rn.A scalar-valued function has type
f:Rn→R.Its gradient is
∇f(x)∈Rn.The first-order expansion is
f(x+Δx)≈f(x)+∇f(x)TΔx.A vector-valued function has type
f:Rn→Rm.Its Jacobian is
Jf(x)∈Rm×n.The first-order expansion is
f(x+Δx)≈f(x)+Jf(x)Δx.A matrix-valued input can be flattened into a vector, but doing so often hides structure. If
X∈Rm×n,then a scalar-valued function
f:Rm×n→Rhas gradient
∇Xf(X)∈Rm×n.The first-order expansion is written using the Frobenius inner product:
f(X+ΔX)≈f(X)+⟨∇Xf(X),ΔX⟩F.The Frobenius inner product is
⟨A,B⟩F=tr(ATB)=i,j∑AijBij.This is the matrix analogue of the vector dot product.
Differential Notation
Differential notation is often the cleanest notation for deriving matrix gradients.
For a scalar function f(X), we write
df=⟨∇Xf,dX⟩F.The goal is to rewrite df so that every occurrence of dX appears in the final position of an inner product. Then the coefficient of dX is the gradient.
For example, let
f(X)=tr(ATX).Then
df=tr(ATdX).So
∇Xf=A.This follows because
df=⟨A,dX⟩F.Now consider
f(X)=tr(XTX).Then
df=tr(dXTX+XTdX).Using trace symmetry,
tr(dXTX)=tr(XTdX).Therefore
df=2tr(XTdX),and
∇Xf=2X.This method is mechanical. It is also close to how reverse-mode AD works: move local perturbations backward until the coefficient of each input perturbation is exposed.
Common Matrix Derivatives
| Function | Shape | Gradient |
|---|
| f(x)=aTx | x,a∈Rn | ∇xf=a |
| f(x)=xTAx | A∈Rn×n | ∇xf=(A+AT)x |
| f(x)=∥x∥22 | x∈Rn | ∇xf=2x |
| f(X)=tr(ATX) | A,X∈Rm×n | ∇Xf=A |
| f(X)=tr(XTX) | X∈Rm×n | ∇Xf=2X |
| f(X)=∥X∥F2 | X∈Rm×n | ∇Xf=2X |
| f(X)=logdetX | X∈Rn×n nonsingular | ∇Xf=X−T |
| f(X)=tr(X−1A) | X nonsingular | ∇Xf=−X−TATX−T |
The table assumes real-valued matrices and the Frobenius inner product convention.
Linear Maps and Adjoints
Automatic differentiation works locally with linear maps.
Suppose a primitive operation has the form
Y=AX.The forward perturbation rule is
dY=AdX.This rule computes a Jacobian-vector product. It sends an input perturbation dX forward to an output perturbation dY.
Reverse mode needs the adjoint rule. Let Yˉ be the adjoint of Y, meaning the gradient accumulated at Y. We want Xˉ, the gradient contribution to X.
The adjoint rule is determined by preserving inner products:
⟨Yˉ,dY⟩F=⟨Xˉ,dX⟩F.Since
dY=AdX,we have
⟨Yˉ,AdX⟩F=tr(YˉTAdX).Rearrange the trace:
tr(YˉTAdX)=tr((ATYˉ)TdX).Therefore
Xˉ=ATYˉ.So the reverse rule for
Y=AXis
Xˉ+=ATYˉ.If A is also an input, then
dY=dAX+AdX.The adjoint contribution to A is
Aˉ+=YˉXT.So the complete reverse rule is:
Aˉ+=YˉXT,Xˉ+=ATYˉ.This pattern is fundamental. Most tensor AD rules are obtained by writing the forward differential and then taking the adjoint.
Matrix Multiplication
Let
C=AB,where
A∈Rm×k,B∈Rk×n,C∈Rm×n.The forward differential is
dC=dAB+AdB.Given an output adjoint
Cˉ∈Rm×n,we derive the reverse rules from
⟨Cˉ,dC⟩F=⟨Cˉ,dAB+AdB⟩F.For the A term:
⟨Cˉ,dAB⟩F=tr(CˉTdAB)=tr((CˉBT)TdA).So
Aˉ+=CˉBT.For the B term:
⟨Cˉ,AdB⟩F=tr(CˉTAdB)=tr((ATCˉ)TdB).So
Bˉ+=ATCˉ.Therefore the reverse-mode rule for matrix multiplication is
C=AB,Aˉ+=CˉBT,Bˉ+=ATCˉ.This is the core gradient rule behind dense layers in neural networks.
Example: Linear Layer
A linear layer usually has the form
Y=XW+b.Let
X∈RN×d,W∈Rd×h,b∈Rh,Y∈RN×h.Here N is the batch size, d is the input dimension, and h is the output dimension.
The differential is
dY=dXW+XdW+db.The reverse rules are
Xˉ=YˉWT,Wˉ=XTYˉ,bˉ=i=1∑NYˉi,:.The bias gradient sums over the batch dimension because b is broadcast across rows.
This example shows why shape semantics matter. The derivative of broadcasting is reduction. The derivative of reduction is broadcasting. AD systems must encode these layout rules exactly.
Jacobians Are Usually Too Large
For a function
f:Rn→Rm,the Jacobian has mn entries. For modern models, m and n may both be enormous. Explicit Jacobians are usually impractical.
AD systems instead compute structured products.
Forward mode computes
Jv.Reverse mode computes
uTJ.In tensor notation, these are called:
| Operation | Name | Typical AD Mode |
|---|
| Jv | Jacobian-vector product | Forward mode |
| uTJ | Vector-Jacobian product | Reverse mode |
| Hv | Hessian-vector product | Mixed mode |
| JTu | Pullback of cotangent | Reverse mode |
The key point is that AD applies derivative operators without constructing them.
Shape Discipline
Matrix calculus is only correct when shapes are correct. A useful derivation should annotate every variable with its shape.
For
C=AB,we should write
A:m×k,B:k×n,C:m×n.Then the reverse rules must match:
Cˉ:m×n,Aˉ=CˉBT:(m×n)(n×k)=m×k,Bˉ=ATCˉ:(k×m)(m×n)=k×n.Shape checking catches many gradient bugs before numerical testing.
Trace Identities
Trace identities are the main algebraic tool for deriving matrix gradients.
The most useful identities are:
tr(A)=tr(AT),tr(ABC)=tr(BCA)=tr(CAB),tr(ATB)=⟨A,B⟩F,tr(AB)=tr(BA),when the products are defined.
The cyclic property of trace is especially important. It allows us to move dX into the position required to read off the gradient.
For example, if
f(X)=tr(AXB),then
df=tr(AdXB).Using cyclic trace,
df=tr(BAdX).To match the Frobenius form
df=tr(GTdX),we need
GT=BA.Therefore
∇Xf=ATBT.Gradients Depend on Inner Product Convention
A gradient is not just a derivative. A derivative is a linear map. A gradient is the representation of that linear map under a chosen inner product.
For vectors, we usually use
⟨x,y⟩=xTy.For matrices, we usually use
⟨A,B⟩F=tr(ATB).Under this convention,
df=⟨∇Xf,dX⟩F.If a different inner product is used, the gradient representation changes. This matters in geometry, optimization on manifolds, natural gradients, and constrained matrix problems.
AD systems usually assume Euclidean inner products for arrays. More specialized systems may expose custom vector-space structures and custom adjoints.
Matrix Calculus in AD Systems
An AD engine needs a derivative rule for each primitive. For matrix primitives, the rule usually has two forms:
Forward rule:
dY=Jf(X)[dX].Reverse rule:
Xˉ=Jf(X)T[Yˉ].For example:
| Primitive | Forward Differential | Reverse Rule |
|---|
| Y=XT | dY=dXT | Xˉ+=YˉT |
| Y=AX | dY=AdX | Xˉ+=ATYˉ |
| Y=XB | dY=dXB | Xˉ+=YˉBT |
| Y=AB | dY=dAB+AdB | Aˉ+=YˉBT, Bˉ+=ATYˉ |
| y=sum(X) | dy=sum(dX) | Xˉ+=yˉ1 |
| Y=X+b with broadcast | dY=dX+db | bˉ+=reduce(Yˉ) |
A production AD system also tracks layout, strides, broadcasting, dtype, device placement, aliasing, and mutation. The mathematics gives the rule. The runtime makes the rule correct for real arrays.
Summary
Matrix calculus gives AD systems their local tensor rules. The derivative is best understood as a linear map. Forward mode applies that linear map to perturbations. Reverse mode applies its adjoint to output sensitivities.
The practical workflow is:
- Write the primitive operation.
- Write its differential.
- Move perturbations into Frobenius inner-product form.
- Read off the adjoint rules.
- Check every shape.
This method scales from simple matrix multiplication to large tensor programs.