Skip to content

Chapter 139. Matrix Calculus

139.1 Introduction

Matrix calculus studies derivatives of functions involving vectors and matrices.

It extends ordinary differential calculus to multidimensional linear-algebraic settings. Matrix calculus is fundamental in optimization, statistics, control theory, machine learning, numerical analysis, and differential geometry.

The main objects are:

ObjectExample
Scalar-valued functionsf(x)Rf(x)\in\mathbb{R}
Vector-valued functionsF(x)RmF(x)\in\mathbb{R}^m
Matrix-valued functionsA(t)Rm×nA(t)\in\mathbb{R}^{m\times n}

Matrix calculus provides rules for gradients, Jacobians, Hessians, and derivatives of matrix expressions.

Modern optimization and machine learning rely heavily on matrix derivatives because objective functions are usually expressed using vectors and matrices.

139.2 Scalars, Vectors, and Matrices

Throughout this chapter:

SymbolMeaning
xRnx\in\mathbb{R}^nColumn vector
ARm×nA\in\mathbb{R}^{m\times n}Matrix
f(x)Rf(x)\in\mathbb{R}Scalar-valued function
F(x)RmF(x)\in\mathbb{R}^mVector-valued function

A scalar function maps vectors to numbers:

f:RnR. f:\mathbb{R}^n\to\mathbb{R}.

A vector function maps vectors to vectors:

F:RnRm. F:\mathbb{R}^n\to\mathbb{R}^m.

The derivative structures depend on the dimensions of the input and output.

139.3 Directional Derivatives

Let

f:RnR. f:\mathbb{R}^n\to\mathbb{R}.

The directional derivative of ff at xx in direction vv is

Dvf(x)=limt0f(x+tv)f(x)t. D_v f(x) = \lim_{t\to0} \frac{f(x+tv)-f(x)}{t}.

It measures the instantaneous rate of change of ff in direction vv.

The directional derivative is linear in the direction vector:

Dav+bwf=aDvf+bDwf. D_{a v+b w}f = aD_vf+bD_wf.

Directional derivatives lead naturally to gradients and Jacobians.

139.4 Gradient

For a differentiable scalar function

f:RnR, f:\mathbb{R}^n\to\mathbb{R},

the gradient is the column vector

f(x)=[fx1fx2fxn]. \nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1}\\ \frac{\partial f}{\partial x_2}\\ \vdots\\ \frac{\partial f}{\partial x_n} \end{bmatrix}.

The gradient gives the direction of steepest increase.

The directional derivative satisfies

Dvf(x)=f(x)Tv. D_vf(x) = \nabla f(x)^T v.

D_vf(x)=\nabla f(x)^Tv

Thus the gradient converts local change into a linear-algebraic inner product.

139.5 Jacobian Matrix

Let

F:RnRm F:\mathbb{R}^n\to\mathbb{R}^m

with components

F=[F1Fm]. F= \begin{bmatrix} F_1\\ \vdots\\ F_m \end{bmatrix}.

The Jacobian matrix is

JF(x)=[F1x1F1xnFmx1Fmxn]. J_F(x) = \begin{bmatrix} \frac{\partial F_1}{\partial x_1} & \cdots & \frac{\partial F_1}{\partial x_n}\\ \vdots & \ddots & \vdots\\ \frac{\partial F_m}{\partial x_1} & \cdots & \frac{\partial F_m}{\partial x_n} \end{bmatrix}.

The Jacobian represents the best linear approximation of FF near xx:

F(x+h)F(x)+JF(x)h. F(x+h) \approx F(x)+J_F(x)h.

The Jacobian generalizes the derivative matrix from single-variable calculus.

139.6 Hessian Matrix

For a twice-differentiable scalar function

f:RnR, f:\mathbb{R}^n\to\mathbb{R},

the Hessian matrix is

Hf(x)=(2fxixj). H_f(x) = \left( \frac{\partial^2 f}{\partial x_i\partial x_j} \right).

Explicitly,

Hf(x)=[2fx122fx1xn2fxnx12fxn2]. H_f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \cdots & \frac{\partial^2 f}{\partial x_1\partial x_n}\\ \vdots & \ddots & \vdots\\ \frac{\partial^2 f}{\partial x_n\partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}.

If mixed partial derivatives commute, then

Hf(x)=Hf(x)T. H_f(x)=H_f(x)^T.

The Hessian describes local curvature.

139.7 Differential Notation

Matrix calculus is often cleaner using differentials.

Suppose

y=f(x). y=f(x).

The differential is written

dy. dy.

If ff is differentiable, then

dy=f(x)Tdx. dy = \nabla f(x)^T dx.

For vector functions,

dF=JF(x)dx. dF = J_F(x)\,dx.

Differential notation treats derivatives as linear maps acting on infinitesimal increments.

This viewpoint is coordinate-free and especially useful for matrix expressions.

139.8 Derivative of Linear Functions

Let

f(x)=aTx, f(x)=a^Tx,

where

aRn. a\in\mathbb{R}^n.

Then

f(x)=a. \nabla f(x)=a.

The function is already linear, so its derivative is constant.

More generally, for

F(x)=Ax, F(x)=Ax,

the Jacobian is

JF(x)=A. J_F(x)=A.

Linear maps are their own derivatives.

139.9 Derivative of Quadratic Forms

Consider the quadratic form

f(x)=xTAx. f(x)=x^TAx.

Its differential is

df=xTAdx+(dx)TAx. df = x^TA\,dx + (dx)^TAx.

Using transpose identities,

(df)T=dxTATx. (df)^T = dx^TA^Tx.

Thus:

df=xTAdx+xTATdx. df = x^TA\,dx + x^TA^Tdx.

Therefore,

f(x)=(A+AT)x. \nabla f(x) = (A+A^T)x.

If AA is symmetric, this simplifies to

f(x)=2Ax. \nabla f(x)=2Ax.

\nabla(x^TAx)=(A+A^T)x

Quadratic forms are central in optimization and statistics.

139.10 Least Squares Derivatives

Let

f(x)=Axb22. f(x)=\|Ax-b\|_2^2.

Expand:

f(x)=(Axb)T(Axb). f(x) = (Ax-b)^T(Ax-b).

Differentiating gives

f(x)=2AT(Axb). \nabla f(x) = 2A^T(Ax-b).

Setting the gradient equal to zero yields the normal equations:

ATAx=ATb. A^TAx=A^Tb.

This derivation is fundamental in least squares theory and machine learning.

139.11 Chain Rule

Suppose

F:RnRm,G:RmRp. F:\mathbb{R}^n\to\mathbb{R}^m, \qquad G:\mathbb{R}^m\to\mathbb{R}^p.

Then

H(x)=G(F(x)). H(x)=G(F(x)).

The Jacobian satisfies

JH(x)=JG(F(x))JF(x). J_H(x) = J_G(F(x))\,J_F(x).

J_H(x)=J_G(F(x))J_F(x)

Thus derivatives compose by matrix multiplication.

This rule is the foundation of backpropagation in neural networks.

139.12 Matrix-by-Matrix Derivatives

Sometimes the variable itself is a matrix.

Suppose

f(A)=tr(BA). f(A) = \operatorname{tr}(BA).

The differential is

df=tr(BdA). df = \operatorname{tr}(B\,dA).

Thus the derivative with respect to AA is

fA=BT. \frac{\partial f}{\partial A}=B^T.

Matrix derivatives are often expressed using trace identities because traces linearize matrix expressions.

139.13 Trace Identities

Trace identities are heavily used in matrix calculus.

Important formulas include:

IdentityFormula
Cyclic propertytr(ABC)=tr(BCA)\operatorname{tr}(ABC)=\operatorname{tr}(BCA)
Transpose invariancetr(A)=tr(AT)\operatorname{tr}(A)=\operatorname{tr}(A^T)
Inner productA,B=tr(ATB)\langle A,B\rangle=\operatorname{tr}(A^TB)

These identities allow complicated derivatives to be rewritten in manageable form.

139.14 Derivative of the Determinant

Let

A(t) A(t)

be a differentiable family of invertible matrices.

Then

ddtdet(A)=det(A)tr(A1A). \frac{d}{dt}\det(A) = \det(A)\operatorname{tr}(A^{-1}A').

This is Jacobi’s formula.

Equivalently,

d(logdetA)=tr(A1dA). d(\log\det A) = \operatorname{tr}(A^{-1}dA).

Log-determinants appear frequently in statistics, covariance estimation, and optimization.

139.15 Derivative of the Matrix Inverse

Suppose

A=A(t) A=A(t)

is invertible.

Since

AA1=I, AA^{-1}=I,

differentiating gives

AA1+A(A1)=0. A'A^{-1}+A(A^{-1})'=0.

Solving for the derivative:

(A1)=A1AA1. (A^{-1})' = -A^{-1}A'A^{-1}.

(A^{-1})’=-A^{-1}A’A^{-1}

This identity appears constantly in optimization and sensitivity analysis.

139.16 Derivative of Eigenvalues

Suppose

A(t) A(t)

is a differentiable family of symmetric matrices.

Let

A(t)v(t)=λ(t)v(t), A(t)v(t)=\lambda(t)v(t),

with normalized eigenvector

v(t)Tv(t)=1. v(t)^Tv(t)=1.

Differentiating gives

λ(t)=v(t)TA(t)v(t). \lambda'(t) = v(t)^T A'(t) v(t).

Thus eigenvalue sensitivity depends on quadratic forms of the perturbation.

This formula is important in perturbation theory and optimization involving spectra.

139.17 Matrix Exponential Derivatives

The matrix exponential is

eA=k=0Akk!. e^A = \sum_{k=0}^{\infty}\frac{A^k}{k!}.

If

A=A(t), A=A(t),

then differentiating is more complicated because matrices may not commute.

If

AA=AA, A'A=AA',

then

ddteA(t)=eA(t)A(t). \frac{d}{dt}e^{A(t)} = e^{A(t)}A'(t).

Without commutativity, integral formulas are needed.

Matrix exponentials appear in differential equations and control theory.

139.18 Automatic Differentiation

Automatic differentiation computes derivatives algorithmically using the chain rule.

It is neither symbolic differentiation nor finite-difference approximation.

Instead, computations are decomposed into elementary operations, and derivatives propagate through the computation graph.

Two major modes are:

ModeEfficient when
Forward modeFew inputs
Reverse modeFew outputs

Reverse-mode automatic differentiation underlies backpropagation in deep learning.

Matrix calculus provides the mathematical foundation for these algorithms.

139.19 Backpropagation

Neural networks are compositions of matrix operations and nonlinearities.

A layer often has the form

xk+1=σ(Wkxk+bk). x_{k+1} = \sigma(W_kx_k+b_k).

The loss function depends on the final output.

Backpropagation computes gradients efficiently by repeatedly applying the chain rule backward through the network.

At each stage:

  1. Compute local derivatives,
  2. Multiply by incoming sensitivities,
  3. Propagate backward.

This process is fundamentally matrix calculus.

139.20 Differential Geometry Perspective

Matrix calculus may also be interpreted geometrically.

The derivative of

F:RnRm F:\mathbb{R}^n\to\mathbb{R}^m

at xx is the linear map

DF(x):RnRm DF(x):\mathbb{R}^n\to\mathbb{R}^m

best approximating FF near xx.

The Jacobian matrix is simply a coordinate representation of this linear map.

This viewpoint extends naturally to manifolds and tensor calculus.

139.21 Summary

Matrix calculus extends differentiation to vector and matrix expressions.

The main concepts are:

ConceptMeaning
GradientFirst derivative of scalar function
JacobianDerivative matrix of vector function
HessianMatrix of second derivatives
DifferentialLinear approximation notation
Chain ruleComposition of derivatives
Quadratic-form derivative(A+AT)x(A+A^T)x
Least squares gradient2AT(Axb)2A^T(Ax-b)
Trace calculusMatrix derivative simplification
Determinant derivativeJacobi formula
Inverse derivativeA1AA1-A^{-1}A'A^{-1}
Eigenvalue derivativeSpectral sensitivity
Automatic differentiationAlgorithmic chain-rule propagation

Matrix calculus provides the language for optimization, machine learning, statistics, control theory, and numerical computation. It transforms multivariable differentiation into structured linear algebra.