# Chapter 139. Matrix Calculus

# Chapter 139. Matrix Calculus

## 139.1 Introduction

Matrix calculus studies derivatives of functions involving vectors and matrices.

It extends ordinary differential calculus to multidimensional linear-algebraic settings. Matrix calculus is fundamental in optimization, statistics, control theory, machine learning, numerical analysis, and differential geometry.

The main objects are:

| Object | Example |
|---|---|
| Scalar-valued functions | \(f(x)\in\mathbb{R}\) |
| Vector-valued functions | \(F(x)\in\mathbb{R}^m\) |
| Matrix-valued functions | \(A(t)\in\mathbb{R}^{m\times n}\) |

Matrix calculus provides rules for gradients, Jacobians, Hessians, and derivatives of matrix expressions.

Modern optimization and machine learning rely heavily on matrix derivatives because objective functions are usually expressed using vectors and matrices.

## 139.2 Scalars, Vectors, and Matrices

Throughout this chapter:

| Symbol | Meaning |
|---|---|
| \(x\in\mathbb{R}^n\) | Column vector |
| \(A\in\mathbb{R}^{m\times n}\) | Matrix |
| \(f(x)\in\mathbb{R}\) | Scalar-valued function |
| \(F(x)\in\mathbb{R}^m\) | Vector-valued function |

A scalar function maps vectors to numbers:

$$
f:\mathbb{R}^n\to\mathbb{R}.
$$

A vector function maps vectors to vectors:

$$
F:\mathbb{R}^n\to\mathbb{R}^m.
$$

The derivative structures depend on the dimensions of the input and output.

## 139.3 Directional Derivatives

Let

$$
f:\mathbb{R}^n\to\mathbb{R}.
$$

The directional derivative of \(f\) at \(x\) in direction \(v\) is

$$
D_v f(x) =
\lim_{t\to0}
\frac{f(x+tv)-f(x)}{t}.
$$

It measures the instantaneous rate of change of \(f\) in direction \(v\).

The directional derivative is linear in the direction vector:

$$
D_{a v+b w}f =
aD_vf+bD_wf.
$$

Directional derivatives lead naturally to gradients and Jacobians.

## 139.4 Gradient

For a differentiable scalar function

$$
f:\mathbb{R}^n\to\mathbb{R},
$$

the gradient is the column vector

$$
\nabla f(x) =
\begin{bmatrix}
\frac{\partial f}{\partial x_1}\\
\frac{\partial f}{\partial x_2}\\
\vdots\\
\frac{\partial f}{\partial x_n}
\end{bmatrix}.
$$

The gradient gives the direction of steepest increase.

The directional derivative satisfies

$$
D_vf(x) =
\nabla f(x)^T v.
$$

D_vf(x)=\nabla f(x)^Tv

Thus the gradient converts local change into a linear-algebraic inner product.

## 139.5 Jacobian Matrix

Let

$$
F:\mathbb{R}^n\to\mathbb{R}^m
$$

with components

$$
F=
\begin{bmatrix}
F_1\\
\vdots\\
F_m
\end{bmatrix}.
$$

The Jacobian matrix is

$$
J_F(x) =
\begin{bmatrix}
\frac{\partial F_1}{\partial x_1} & \cdots & \frac{\partial F_1}{\partial x_n}\\
\vdots & \ddots & \vdots\\
\frac{\partial F_m}{\partial x_1} & \cdots & \frac{\partial F_m}{\partial x_n}
\end{bmatrix}.
$$

The Jacobian represents the best linear approximation of \(F\) near \(x\):

$$
F(x+h)
\approx
F(x)+J_F(x)h.
$$

The Jacobian generalizes the derivative matrix from single-variable calculus.

## 139.6 Hessian Matrix

For a twice-differentiable scalar function

$$
f:\mathbb{R}^n\to\mathbb{R},
$$

the Hessian matrix is

$$
H_f(x) =
\left(
\frac{\partial^2 f}{\partial x_i\partial x_j}
\right).
$$

Explicitly,

$$
H_f(x) =
\begin{bmatrix}
\frac{\partial^2 f}{\partial x_1^2} &
\cdots &
\frac{\partial^2 f}{\partial x_1\partial x_n}\\
\vdots & \ddots & \vdots\\
\frac{\partial^2 f}{\partial x_n\partial x_1} &
\cdots &
\frac{\partial^2 f}{\partial x_n^2}
\end{bmatrix}.
$$

If mixed partial derivatives commute, then

$$
H_f(x)=H_f(x)^T.
$$

The Hessian describes local curvature.

## 139.7 Differential Notation

Matrix calculus is often cleaner using differentials.

Suppose

$$
y=f(x).
$$

The differential is written

$$
dy.
$$

If \(f\) is differentiable, then

$$
dy =
\nabla f(x)^T dx.
$$

For vector functions,

$$
dF =
J_F(x)\,dx.
$$

Differential notation treats derivatives as linear maps acting on infinitesimal increments.

This viewpoint is coordinate-free and especially useful for matrix expressions.

## 139.8 Derivative of Linear Functions

Let

$$
f(x)=a^Tx,
$$

where

$$
a\in\mathbb{R}^n.
$$

Then

$$
\nabla f(x)=a.
$$

The function is already linear, so its derivative is constant.

More generally, for

$$
F(x)=Ax,
$$

the Jacobian is

$$
J_F(x)=A.
$$

Linear maps are their own derivatives.

## 139.9 Derivative of Quadratic Forms

Consider the quadratic form

$$
f(x)=x^TAx.
$$

Its differential is

$$
df =
x^TA\,dx
+
(dx)^TAx.
$$

Using transpose identities,

$$
(df)^T =
dx^TA^Tx.
$$

Thus:

$$
df =
x^TA\,dx
+
x^TA^Tdx.
$$

Therefore,

$$
\nabla f(x) =
(A+A^T)x.
$$

If \(A\) is symmetric, this simplifies to

$$
\nabla f(x)=2Ax.
$$

\nabla(x^TAx)=(A+A^T)x

Quadratic forms are central in optimization and statistics.

## 139.10 Least Squares Derivatives

Let

$$
f(x)=\|Ax-b\|_2^2.
$$

Expand:

$$
f(x) =
(Ax-b)^T(Ax-b).
$$

Differentiating gives

$$
\nabla f(x) =
2A^T(Ax-b).
$$

Setting the gradient equal to zero yields the normal equations:

$$
A^TAx=A^Tb.
$$

This derivation is fundamental in least squares theory and machine learning.

## 139.11 Chain Rule

Suppose

$$
F:\mathbb{R}^n\to\mathbb{R}^m,
\qquad
G:\mathbb{R}^m\to\mathbb{R}^p.
$$

Then

$$
H(x)=G(F(x)).
$$

The Jacobian satisfies

$$
J_H(x) =
J_G(F(x))\,J_F(x).
$$

J_H(x)=J_G(F(x))J_F(x)

Thus derivatives compose by matrix multiplication.

This rule is the foundation of backpropagation in neural networks.

## 139.12 Matrix-by-Matrix Derivatives

Sometimes the variable itself is a matrix.

Suppose

$$
f(A) =
\operatorname{tr}(BA).
$$

The differential is

$$
df =
\operatorname{tr}(B\,dA).
$$

Thus the derivative with respect to \(A\) is

$$
\frac{\partial f}{\partial A}=B^T.
$$

Matrix derivatives are often expressed using trace identities because traces linearize matrix expressions.

## 139.13 Trace Identities

Trace identities are heavily used in matrix calculus.

Important formulas include:

| Identity | Formula |
|---|---|
| Cyclic property | \(\operatorname{tr}(ABC)=\operatorname{tr}(BCA)\) |
| Transpose invariance | \(\operatorname{tr}(A)=\operatorname{tr}(A^T)\) |
| Inner product | \(\langle A,B\rangle=\operatorname{tr}(A^TB)\) |

These identities allow complicated derivatives to be rewritten in manageable form.

## 139.14 Derivative of the Determinant

Let

$$
A(t)
$$

be a differentiable family of invertible matrices.

Then

$$
\frac{d}{dt}\det(A) =
\det(A)\operatorname{tr}(A^{-1}A').
$$

This is Jacobi's formula.

Equivalently,

$$
d(\log\det A) =
\operatorname{tr}(A^{-1}dA).
$$

Log-determinants appear frequently in statistics, covariance estimation, and optimization.

## 139.15 Derivative of the Matrix Inverse

Suppose

$$
A=A(t)
$$

is invertible.

Since

$$
AA^{-1}=I,
$$

differentiating gives

$$
A'A^{-1}+A(A^{-1})'=0.
$$

Solving for the derivative:

$$
(A^{-1})' =
-A^{-1}A'A^{-1}.
$$

(A^{-1})'=-A^{-1}A'A^{-1}

This identity appears constantly in optimization and sensitivity analysis.

## 139.16 Derivative of Eigenvalues

Suppose

$$
A(t)
$$

is a differentiable family of symmetric matrices.

Let

$$
A(t)v(t)=\lambda(t)v(t),
$$

with normalized eigenvector

$$
v(t)^Tv(t)=1.
$$

Differentiating gives

$$
\lambda'(t) =
v(t)^T A'(t) v(t).
$$

Thus eigenvalue sensitivity depends on quadratic forms of the perturbation.

This formula is important in perturbation theory and optimization involving spectra.

## 139.17 Matrix Exponential Derivatives

The matrix exponential is

$$
e^A =
\sum_{k=0}^{\infty}\frac{A^k}{k!}.
$$

If

$$
A=A(t),
$$

then differentiating is more complicated because matrices may not commute.

If

$$
A'A=AA',
$$

then

$$
\frac{d}{dt}e^{A(t)} =
e^{A(t)}A'(t).
$$

Without commutativity, integral formulas are needed.

Matrix exponentials appear in differential equations and control theory.

## 139.18 Automatic Differentiation

Automatic differentiation computes derivatives algorithmically using the chain rule.

It is neither symbolic differentiation nor finite-difference approximation.

Instead, computations are decomposed into elementary operations, and derivatives propagate through the computation graph.

Two major modes are:

| Mode | Efficient when |
|---|---|
| Forward mode | Few inputs |
| Reverse mode | Few outputs |

Reverse-mode automatic differentiation underlies backpropagation in deep learning.

Matrix calculus provides the mathematical foundation for these algorithms.

## 139.19 Backpropagation

Neural networks are compositions of matrix operations and nonlinearities.

A layer often has the form

$$
x_{k+1} =
\sigma(W_kx_k+b_k).
$$

The loss function depends on the final output.

Backpropagation computes gradients efficiently by repeatedly applying the chain rule backward through the network.

At each stage:

1. Compute local derivatives,
2. Multiply by incoming sensitivities,
3. Propagate backward.

This process is fundamentally matrix calculus.

## 139.20 Differential Geometry Perspective

Matrix calculus may also be interpreted geometrically.

The derivative of

$$
F:\mathbb{R}^n\to\mathbb{R}^m
$$

at \(x\) is the linear map

$$
DF(x):\mathbb{R}^n\to\mathbb{R}^m
$$

best approximating \(F\) near \(x\).

The Jacobian matrix is simply a coordinate representation of this linear map.

This viewpoint extends naturally to manifolds and tensor calculus.

## 139.21 Summary

Matrix calculus extends differentiation to vector and matrix expressions.

The main concepts are:

| Concept | Meaning |
|---|---|
| Gradient | First derivative of scalar function |
| Jacobian | Derivative matrix of vector function |
| Hessian | Matrix of second derivatives |
| Differential | Linear approximation notation |
| Chain rule | Composition of derivatives |
| Quadratic-form derivative | \((A+A^T)x\) |
| Least squares gradient | \(2A^T(Ax-b)\) |
| Trace calculus | Matrix derivative simplification |
| Determinant derivative | Jacobi formula |
| Inverse derivative | \(-A^{-1}A'A^{-1}\) |
| Eigenvalue derivative | Spectral sensitivity |
| Automatic differentiation | Algorithmic chain-rule propagation |

Matrix calculus provides the language for optimization, machine learning, statistics, control theory, and numerical computation. It transforms multivariable differentiation into structured linear algebra.