Chapter 139. Matrix Calculus

139.1 Introduction

Matrix calculus studies derivatives of functions involving vectors and matrices.

It extends ordinary differential calculus to multidimensional linear-algebraic settings. Matrix calculus is fundamental in optimization, statistics, control theory, machine learning, numerical analysis, and differential geometry.

The main objects are:

Object	Example
Scalar-valued functions	$f(x)\in\mathbb{R}$
Vector-valued functions	$F(x)\in\mathbb{R}^m$
Matrix-valued functions	$A(t)\in\mathbb{R}^{m\times n}$

Matrix calculus provides rules for gradients, Jacobians, Hessians, and derivatives of matrix expressions.

Modern optimization and machine learning rely heavily on matrix derivatives because objective functions are usually expressed using vectors and matrices.

139.2 Scalars, Vectors, and Matrices

Throughout this chapter:

Symbol	Meaning
$x\in\mathbb{R}^n$	Column vector
$A\in\mathbb{R}^{m\times n}$	Matrix
$f(x)\in\mathbb{R}$	Scalar-valued function
$F(x)\in\mathbb{R}^m$	Vector-valued function

A scalar function maps vectors to numbers:

f:\mathbb{R}^n\to\mathbb{R}.

A vector function maps vectors to vectors:

F:\mathbb{R}^n\to\mathbb{R}^m.

The derivative structures depend on the dimensions of the input and output.

139.3 Directional Derivatives

Let

f:\mathbb{R}^n\to\mathbb{R}.

The directional derivative of $f$ at $x$ in direction $v$ is

D_v f(x) = \lim_{t\to0} \frac{f(x+tv)-f(x)}{t}.

It measures the instantaneous rate of change of $f$ in direction $v$ .

The directional derivative is linear in the direction vector:

D_{a v+b w}f = aD_vf+bD_wf.

Directional derivatives lead naturally to gradients and Jacobians.

139.4 Gradient

For a differentiable scalar function

f:\mathbb{R}^n\to\mathbb{R},

the gradient is the column vector

\nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1}\\ \frac{\partial f}{\partial x_2}\\ \vdots\\ \frac{\partial f}{\partial x_n} \end{bmatrix}.

The gradient gives the direction of steepest increase.

The directional derivative satisfies

D_vf(x) = \nabla f(x)^T v.

D_vf(x)=\nabla f(x)^Tv

Thus the gradient converts local change into a linear-algebraic inner product.

139.5 Jacobian Matrix

Let

F:\mathbb{R}^n\to\mathbb{R}^m

with components

F= \begin{bmatrix} F_1\\ \vdots\\ F_m \end{bmatrix}.

The Jacobian matrix is

J_F(x) = \begin{bmatrix} \frac{\partial F_1}{\partial x_1} & \cdots & \frac{\partial F_1}{\partial x_n}\\ \vdots & \ddots & \vdots\\ \frac{\partial F_m}{\partial x_1} & \cdots & \frac{\partial F_m}{\partial x_n} \end{bmatrix}.

The Jacobian represents the best linear approximation of $F$ near $x$ :

F(x+h) \approx F(x)+J_F(x)h.

The Jacobian generalizes the derivative matrix from single-variable calculus.

139.6 Hessian Matrix

For a twice-differentiable scalar function

f:\mathbb{R}^n\to\mathbb{R},

the Hessian matrix is

H_f(x) = \left( \frac{\partial^2 f}{\partial x_i\partial x_j} \right).

Explicitly,

H_f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \cdots & \frac{\partial^2 f}{\partial x_1\partial x_n}\\ \vdots & \ddots & \vdots\\ \frac{\partial^2 f}{\partial x_n\partial x_1} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix}.

If mixed partial derivatives commute, then

H_f(x)=H_f(x)^T.

The Hessian describes local curvature.

139.7 Differential Notation

Matrix calculus is often cleaner using differentials.

Suppose

y=f(x).

The differential is written

dy.

If $f$ is differentiable, then

dy = \nabla f(x)^T dx.

For vector functions,

dF = J_F(x)\,dx.

Differential notation treats derivatives as linear maps acting on infinitesimal increments.

This viewpoint is coordinate-free and especially useful for matrix expressions.

139.8 Derivative of Linear Functions

Let

f(x)=a^Tx,

where

a\in\mathbb{R}^n.

Then

\nabla f(x)=a.

The function is already linear, so its derivative is constant.

More generally, for

F(x)=Ax,

the Jacobian is

J_F(x)=A.

Linear maps are their own derivatives.

139.9 Derivative of Quadratic Forms

Consider the quadratic form

f(x)=x^TAx.

Its differential is

df = x^TA\,dx + (dx)^TAx.

Using transpose identities,

(df)^T = dx^TA^Tx.

Thus:

df = x^TA\,dx + x^TA^Tdx.

Therefore,

\nabla f(x) = (A+A^T)x.

If $A$ is symmetric, this simplifies to

\nabla f(x)=2Ax.

\nabla(x^TAx)=(A+A^T)x

Quadratic forms are central in optimization and statistics.

139.10 Least Squares Derivatives

Let

f(x)=\|Ax-b\|_2^2.

Expand:

f(x) = (Ax-b)^T(Ax-b).

Differentiating gives

\nabla f(x) = 2A^T(Ax-b).

Setting the gradient equal to zero yields the normal equations:

A^TAx=A^Tb.

This derivation is fundamental in least squares theory and machine learning.

139.11 Chain Rule

Suppose

F:\mathbb{R}^n\to\mathbb{R}^m, \qquad G:\mathbb{R}^m\to\mathbb{R}^p.

Then

H(x)=G(F(x)).

The Jacobian satisfies

J_H(x) = J_G(F(x))\,J_F(x).

J_H(x)=J_G(F(x))J_F(x)

Thus derivatives compose by matrix multiplication.

This rule is the foundation of backpropagation in neural networks.

139.12 Matrix-by-Matrix Derivatives

Sometimes the variable itself is a matrix.

Suppose

f(A) = \operatorname{tr}(BA).

The differential is

df = \operatorname{tr}(B\,dA).

Thus the derivative with respect to $A$ is

\frac{\partial f}{\partial A}=B^T.

Matrix derivatives are often expressed using trace identities because traces linearize matrix expressions.

139.13 Trace Identities

Trace identities are heavily used in matrix calculus.

Important formulas include:

Identity	Formula
Cyclic property	$\operatorname{tr}(ABC)=\operatorname{tr}(BCA)$
Transpose invariance	$\operatorname{tr}(A)=\operatorname{tr}(A^T)$
Inner product	$\langle A,B\rangle=\operatorname{tr}(A^TB)$

These identities allow complicated derivatives to be rewritten in manageable form.

139.14 Derivative of the Determinant

Let

A(t)

be a differentiable family of invertible matrices.

Then

\frac{d}{dt}\det(A) = \det(A)\operatorname{tr}(A^{-1}A').

This is Jacobi’s formula.

Equivalently,

d(\log\det A) = \operatorname{tr}(A^{-1}dA).

Log-determinants appear frequently in statistics, covariance estimation, and optimization.

139.15 Derivative of the Matrix Inverse

Suppose

A=A(t)

is invertible.

Since

AA^{-1}=I,

differentiating gives

A'A^{-1}+A(A^{-1})'=0.

Solving for the derivative:

(A^{-1})' = -A^{-1}A'A^{-1}.

(A^{-1})’=-A^{-1}A’A^{-1}

This identity appears constantly in optimization and sensitivity analysis.

139.16 Derivative of Eigenvalues

Suppose

A(t)

is a differentiable family of symmetric matrices.

Let

A(t)v(t)=\lambda(t)v(t),

with normalized eigenvector

v(t)^Tv(t)=1.

Differentiating gives

\lambda'(t) = v(t)^T A'(t) v(t).

Thus eigenvalue sensitivity depends on quadratic forms of the perturbation.

This formula is important in perturbation theory and optimization involving spectra.

139.17 Matrix Exponential Derivatives

The matrix exponential is

e^A = \sum_{k=0}^{\infty}\frac{A^k}{k!}.

A=A(t),

then differentiating is more complicated because matrices may not commute.

A'A=AA',

then

\frac{d}{dt}e^{A(t)} = e^{A(t)}A'(t).

Without commutativity, integral formulas are needed.

Matrix exponentials appear in differential equations and control theory.

139.18 Automatic Differentiation

Automatic differentiation computes derivatives algorithmically using the chain rule.

It is neither symbolic differentiation nor finite-difference approximation.

Instead, computations are decomposed into elementary operations, and derivatives propagate through the computation graph.

Two major modes are:

Mode	Efficient when
Forward mode	Few inputs
Reverse mode	Few outputs

Reverse-mode automatic differentiation underlies backpropagation in deep learning.

Matrix calculus provides the mathematical foundation for these algorithms.

139.19 Backpropagation

Neural networks are compositions of matrix operations and nonlinearities.

A layer often has the form

x_{k+1} = \sigma(W_kx_k+b_k).

The loss function depends on the final output.

Backpropagation computes gradients efficiently by repeatedly applying the chain rule backward through the network.

At each stage:

Compute local derivatives,
Multiply by incoming sensitivities,
Propagate backward.

This process is fundamentally matrix calculus.

139.20 Differential Geometry Perspective

Matrix calculus may also be interpreted geometrically.

The derivative of

F:\mathbb{R}^n\to\mathbb{R}^m

at $x$ is the linear map

DF(x):\mathbb{R}^n\to\mathbb{R}^m

best approximating $F$ near $x$ .

The Jacobian matrix is simply a coordinate representation of this linear map.

This viewpoint extends naturally to manifolds and tensor calculus.

139.21 Summary

Matrix calculus extends differentiation to vector and matrix expressions.

The main concepts are:

Concept	Meaning
Gradient	First derivative of scalar function
Jacobian	Derivative matrix of vector function
Hessian	Matrix of second derivatives
Differential	Linear approximation notation
Chain rule	Composition of derivatives
Quadratic-form derivative	$(A+A^T)x$
Least squares gradient	$2A^T(Ax-b)$
Trace calculus	Matrix derivative simplification
Determinant derivative	Jacobi formula
Inverse derivative	$-A^{-1}A'A^{-1}$
Eigenvalue derivative	Spectral sensitivity
Automatic differentiation	Algorithmic chain-rule propagation

Matrix calculus provides the language for optimization, machine learning, statistics, control theory, and numerical computation. It transforms multivariable differentiation into structured linear algebra.