Broadcasting Semantics

Broadcasting is the rule system that allows tensor operations between arrays of different shapes without explicitly materializing expanded copies. It is one of the most important structural features in modern tensor systems. It is also one of the most common sources of gradient bugs.

From a mathematical perspective, broadcasting defines an implicit linear replication operator. From a systems perspective, broadcasting defines a virtual tensor view with repeated values along selected axes.

Automatic differentiation must preserve both interpretations exactly.

Motivation

Consider the operation

Y = X + b,

where

X \in \mathbb{R}^{N \times d}, \qquad b \in \mathbb{R}^{d}.

This operation is interpreted as

Y_{ij} = X_{ij} + b_j.

The vector $b$ is conceptually expanded across the batch axis:

b \to \begin{bmatrix} b \\ b \\ \vdots \\ b \end{bmatrix}.

A naive implementation could allocate the expanded tensor explicitly. Broadcasting avoids this allocation. The runtime instead behaves as if the tensor had been replicated.

The reverse pass must then accumulate all replicated contributions back into the original tensor.

This accumulation rule is the essential semantic property of broadcasting.

Shape Compatibility

Most tensor systems use right-aligned broadcasting rules.

Let two tensors have shapes

(a_1,\ldots,a_k)

and

(b_1,\ldots,b_l).

Align dimensions from the right:

a1 a2 ... ak
      b1 ... bl

A pair of dimensions is compatible if:

The dimensions are equal.
One dimension is $1$ .

The resulting dimension is the maximum of the two.

Example:

(8,1,6,1)

broadcast with

(7,1,5)

aligns as

8 1 6 1
  7 1 5

Result shape:

(8,7,6,5).

Dimension-by-dimension:

Axis	Left	Right	Result
1	8	implicit 1	8
2	1	7	7
3	6	1	6
4	1	5	5

If neither dimension equals $1$ and they differ, broadcasting fails.

For example:

(3,4)

and

(5,4)

are incompatible.

Broadcasting as a Linear Operator

Broadcasting is not just a convenience syntax. It is a linear map.

Suppose

x \in \mathbb{R}^d.

Broadcast $x$ across $N$ rows:

B(x) = \begin{bmatrix} x \\ x \\ \vdots \\ x \end{bmatrix} \in \mathbb{R}^{N \times d}.

This operator is linear:

B(\alpha x + \beta y) = \alpha B(x) + \beta B(y).

The adjoint operator is reduction by summation:

B^T(Y) = \sum_{i=1}^{N} Y_i.

This explains the reverse rule immediately.

Forward:

Y = B(x).

Reverse:

\bar{x} = B^T(\bar{Y}) = \sum_i \bar{Y}_i.

Broadcasting and reduction are adjoint operations.

This relationship is fundamental.

Forward Differential of Broadcast Operations

Suppose

Y = X + b,

with

X \in \mathbb{R}^{N \times d}, \qquad b \in \mathbb{R}^{d}.

The indexed form is

Y_{ij} = X_{ij} + b_j.

Differentiate:

dY_{ij} = dX_{ij} + db_j.

The perturbation $db_j$ is automatically broadcast across rows.

In tensor notation:

dY = dX + \operatorname{broadcast}(db).

The local Jacobian therefore contains repeated structure.

Reverse-Mode Rule

Let

\bar{Y} \in \mathbb{R}^{N \times d}

be the output adjoint.

We compute contributions to $X$ and $b$ .

Since

Y_{ij} = X_{ij} + b_j,

we have

\frac{\partial Y_{ij}}{\partial X_{ij}} = 1,

\bar{X}_{ij} \mathrel{+}= \bar{Y}_{ij}.

For the bias:

\frac{\partial Y_{ij}}{\partial b_j} = 1.

Each $b_j$ affects every row. Therefore:

\bar{b}_j \mathrel{+}= \sum_{i=1}^{N} \bar{Y}_{ij}.

Vector form:

\bar{b} = \operatorname{reduce\_sum}(\bar{Y}, \text{axis}=0).

This is the standard bias gradient rule in neural networks.

General Broadcasting Rule

Suppose an input tensor $X$ is broadcast into output tensor $Y$ .

The reverse rule is:

\bar{X} = \operatorname{reduce\_sum} ( \bar{Y}, \text{broadcasted axes} ).

Additionally, axes introduced implicitly must also be reduced.

For example:

X : (d), \qquad Y : (N,d).

Axis $0$ was introduced during broadcasting, so the reverse rule reduces over axis $0$ .

Example:

X : (1,d), \qquad Y : (N,d).

Axis $0$ had size $1$ and was expanded to size $N$ , so the reverse rule again reduces over axis $0$ .

More generally:

Forward Expansion	Reverse Reduction
Missing axis inserted	Reduce over inserted axis
Axis size $1 \to n$	Reduce over expanded axis
Axis unchanged	No reduction

Broadcasting and Elementwise Multiplication

Consider

Y = X \odot b,

where

X \in \mathbb{R}^{N \times d}, \qquad b \in \mathbb{R}^{d}.

Indexed form:

Y_{ij} = X_{ij}b_j.

Differentiate:

dY_{ij} = b_j dX_{ij} + X_{ij} db_j.

Reverse rules:

\bar{X}_{ij} \mathrel{+}= \bar{Y}_{ij} b_j,

\bar{b}_j \mathrel{+}= \sum_i \bar{Y}_{ij} X_{ij}.

Tensor form:

\bar{X} \mathrel{+}= \bar{Y} \odot b,

\bar{b} \mathrel{+}= \operatorname{reduce\_sum} ( \bar{Y}\odot X, \text{axis}=0 ).

This pattern appears in layer normalization, attention scaling, gating mechanisms, and feature-wise affine transforms.

Broadcasting as Stride Manipulation

Most runtimes do not allocate broadcasted tensors.

Instead, broadcasting is represented using stride metadata.

Suppose

x \in \mathbb{R}^{d}

is broadcast to

Y \in \mathbb{R}^{N \times d}.

The runtime may assign a stride of zero along the broadcasted axis.

Conceptually:

shape   = (N, d)
strides = (0, 1)

This means:

Y_{ij}

always reads from

x_j.

Every row references the same memory location.

This is efficient, but it creates an important reverse-mode requirement:

adjoints must accumulate.

If multiple outputs map to the same storage location, gradients cannot overwrite each other.

They must sum.

Aliasing and Accumulation

Broadcast views create aliasing.

Example:

x = [1,2,3]

broadcast to shape

(4,3).

All four rows refer to the same storage.

During reverse mode:

for i in 1..4:
    dx += dY[i]

The accumulation is mathematically required because the same variable contributed to multiple outputs.

This explains why in-place updates on broadcasted tensors are often forbidden or heavily restricted. A single write would ambiguously affect many virtual tensor elements.

Broadcasting and Jacobians

Broadcasting creates structured Jacobians with repeated rows or repeated blocks.

Example:

y = B(x)

with

x \in \mathbb{R}^d, \qquad y \in \mathbb{R}^{N \times d}.

Flattening tensors into vectors, the Jacobian has the form

J_B = \begin{bmatrix} I \\ I \\ \vdots \\ I \end{bmatrix}.

The transpose is

J_B^T = \begin{bmatrix} I & I & \cdots & I \end{bmatrix}.

Therefore reverse mode computes:

J_B^T \bar{y} = \sum_i \bar{y}_i.

Again, reverse broadcasting becomes reduction.

Broadcast Semantics in Deep Learning

Broadcasting appears everywhere in deep learning systems.

Bias Addition

Y = XW + b.

Bias $b$ broadcasts across batch dimension.

Reverse:

\bar{b} = \sum_i \bar{Y}_i.

Layer Normalization

Affine parameters:

Y = \gamma \odot \hat{X} + \beta.

Parameters $\gamma$ and $\beta$ broadcast across batch and sequence dimensions.

Attention Scaling

Attention logits:

S = \frac{QK^T}{\sqrt{d_k}} + M.

Mask tensor $M$ may broadcast across heads or batches.

Residual Connections

A lower-rank tensor may broadcast across multiple dimensions.

Shape semantics determine the correct reduction axes during backpropagation.

Reduction as the Adjoint of Broadcast

This duality is important enough to state explicitly.

Let

B : V \to W

be a broadcast operator.

Its adjoint is

B^T : W \to V.

The adjoint operation is reduction.

Forward:

Operation	Effect
Broadcast	Replicate values

Reverse:

Operation	Effect
Reduction	Accumulate replicated adjoints

This pattern appears throughout AD:

Forward	Reverse
Broadcast	Reduce
Gather	Scatter-add
Expand	Sum
Replicate	Accumulate

Many reverse-mode rules are adjoints of structural tensor operations.

Shape Inference

A broadcast-aware AD engine must track:

input shapes
output shapes
broadcasted axes
inserted dimensions
reduction semantics

Without this metadata, reverse reduction cannot be reconstructed correctly.

Example:

y = x + b

may involve:

x.shape = (32, 128, 256)
b.shape = (256,)

The reverse rule must reduce over axes:

(0,1).

The engine cannot infer this only from output gradients. It must preserve broadcast metadata from the forward pass.

Broadcasting Failures

Broadcasting can silently produce incorrect programs if shapes are accidentally compatible.

Example:

x.shape = (32, 128)
y.shape = (128,)

Addition succeeds.

But if the programmer intended:

y.shape = (32, 128)

the program still runs, but semantics change.

This is one reason many large systems increasingly use:

named tensors
shape typing
dimension labels
compile-time shape checking

Shape-safe tensor systems reduce silent broadcast bugs.

Broadcast Fusion

Compilers often fuse broadcast operations into downstream kernels.

Instead of:

tmp = broadcast(b)
y = x + tmp

the kernel computes:

y[i,j] = x[i,j] + b[j]

without allocating the expanded tensor.

This optimization is critical for GPU efficiency. Materializing broadcasted tensors can increase memory traffic dramatically.

Reverse-mode kernels similarly fuse reduction logic into gradient kernels.

Numerical Stability

Broadcasting itself is numerically exact. However, reductions in reverse mode may accumulate large sums:

\bar{x} = \sum_i \bar{y}_i.

Parallel reductions introduce:

non-associativity
floating-point order dependence
nondeterminism

Different reduction trees may produce slightly different gradients.

Large distributed systems often trade exact reproducibility for throughput.

Formal Rule

The general reverse rule for broadcasting is:

Align input shape with output shape.
Identify axes where:
- dimensions were inserted, or
- input size was $1$ and output size exceeded $1$ .
Sum over those axes.
Reshape to original input shape.

Symbolically:

\bar{X} = \operatorname{reshape} \left( \operatorname{reduce\_sum} ( \bar{Y}, \text{broadcast axes} ), \text{shape}(X) \right).

This rule is implemented in nearly every tensor AD framework.

Summary

Broadcasting defines implicit tensor replication without materialized copies. The reverse-mode rule is reduction across broadcasted dimensions.

The key principles are:

Concept	Reverse Interpretation
Replicated forward values	Summed adjoints
Broadcast operator	Reduction adjoint
Zero-stride views	Gradient accumulation
Shape expansion	Axis reduction

Broadcasting appears simple at the API level, but it imposes strict structural rules on gradient propagation, memory aliasing, and tensor layout. A correct AD engine must treat broadcasting as a first-class semantic operation, not merely a convenience syntax.