Skip to content

Broadcasting Semantics

Broadcasting is the rule system that allows tensor operations between arrays of different shapes without explicitly materializing expanded copies. It is one of the most...

Broadcasting is the rule system that allows tensor operations between arrays of different shapes without explicitly materializing expanded copies. It is one of the most important structural features in modern tensor systems. It is also one of the most common sources of gradient bugs.

From a mathematical perspective, broadcasting defines an implicit linear replication operator. From a systems perspective, broadcasting defines a virtual tensor view with repeated values along selected axes.

Automatic differentiation must preserve both interpretations exactly.

Motivation

Consider the operation

Y=X+b, Y = X + b,

where

XRN×d,bRd. X \in \mathbb{R}^{N \times d}, \qquad b \in \mathbb{R}^{d}.

This operation is interpreted as

Yij=Xij+bj. Y_{ij} = X_{ij} + b_j.

The vector bb is conceptually expanded across the batch axis:

b[bbb]. b \to \begin{bmatrix} b \\ b \\ \vdots \\ b \end{bmatrix}.

A naive implementation could allocate the expanded tensor explicitly. Broadcasting avoids this allocation. The runtime instead behaves as if the tensor had been replicated.

The reverse pass must then accumulate all replicated contributions back into the original tensor.

This accumulation rule is the essential semantic property of broadcasting.

Shape Compatibility

Most tensor systems use right-aligned broadcasting rules.

Let two tensors have shapes

(a1,,ak) (a_1,\ldots,a_k)

and

(b1,,bl). (b_1,\ldots,b_l).

Align dimensions from the right:

a1 a2 ... ak
      b1 ... bl

A pair of dimensions is compatible if:

  1. The dimensions are equal.
  2. One dimension is 11.

The resulting dimension is the maximum of the two.

Example:

(8,1,6,1) (8,1,6,1)

broadcast with

(7,1,5) (7,1,5)

aligns as

8 1 6 1
  7 1 5

Result shape:

(8,7,6,5). (8,7,6,5).

Dimension-by-dimension:

AxisLeftRightResult
18implicit 18
2177
3616
4155

If neither dimension equals 11 and they differ, broadcasting fails.

For example:

(3,4) (3,4)

and

(5,4) (5,4)

are incompatible.

Broadcasting as a Linear Operator

Broadcasting is not just a convenience syntax. It is a linear map.

Suppose

xRd. x \in \mathbb{R}^d.

Broadcast xx across NN rows:

B(x)=[xxx]RN×d. B(x) = \begin{bmatrix} x \\ x \\ \vdots \\ x \end{bmatrix} \in \mathbb{R}^{N \times d}.

This operator is linear:

B(αx+βy)=αB(x)+βB(y). B(\alpha x + \beta y) = \alpha B(x) + \beta B(y).

The adjoint operator is reduction by summation:

BT(Y)=i=1NYi. B^T(Y) = \sum_{i=1}^{N} Y_i.

This explains the reverse rule immediately.

Forward:

Y=B(x). Y = B(x).

Reverse:

xˉ=BT(Yˉ)=iYˉi. \bar{x} = B^T(\bar{Y}) = \sum_i \bar{Y}_i.

Broadcasting and reduction are adjoint operations.

This relationship is fundamental.

Forward Differential of Broadcast Operations

Suppose

Y=X+b, Y = X + b,

with

XRN×d,bRd. X \in \mathbb{R}^{N \times d}, \qquad b \in \mathbb{R}^{d}.

The indexed form is

Yij=Xij+bj. Y_{ij} = X_{ij} + b_j.

Differentiate:

dYij=dXij+dbj. dY_{ij} = dX_{ij} + db_j.

The perturbation dbjdb_j is automatically broadcast across rows.

In tensor notation:

dY=dX+broadcast(db). dY = dX + \operatorname{broadcast}(db).

The local Jacobian therefore contains repeated structure.

Reverse-Mode Rule

Let

YˉRN×d \bar{Y} \in \mathbb{R}^{N \times d}

be the output adjoint.

We compute contributions to XX and bb.

Since

Yij=Xij+bj, Y_{ij} = X_{ij} + b_j,

we have

YijXij=1, \frac{\partial Y_{ij}}{\partial X_{ij}} = 1,

so

Xˉij+=Yˉij. \bar{X}_{ij} \mathrel{+}= \bar{Y}_{ij}.

For the bias:

Yijbj=1. \frac{\partial Y_{ij}}{\partial b_j} = 1.

Each bjb_j affects every row. Therefore:

bˉj+=i=1NYˉij. \bar{b}_j \mathrel{+}= \sum_{i=1}^{N} \bar{Y}_{ij}.

Vector form:

bˉ=reduce_sum(Yˉ,axis=0). \bar{b} = \operatorname{reduce\_sum}(\bar{Y}, \text{axis}=0).

This is the standard bias gradient rule in neural networks.

General Broadcasting Rule

Suppose an input tensor XX is broadcast into output tensor YY.

The reverse rule is:

Xˉ=reduce_sum(Yˉ,broadcasted axes). \bar{X} = \operatorname{reduce\_sum} ( \bar{Y}, \text{broadcasted axes} ).

Additionally, axes introduced implicitly must also be reduced.

For example:

X:(d),Y:(N,d). X : (d), \qquad Y : (N,d).

Axis 00 was introduced during broadcasting, so the reverse rule reduces over axis 00.

Example:

X:(1,d),Y:(N,d). X : (1,d), \qquad Y : (N,d).

Axis 00 had size 11 and was expanded to size NN, so the reverse rule again reduces over axis 00.

More generally:

Forward ExpansionReverse Reduction
Missing axis insertedReduce over inserted axis
Axis size 1n1 \to nReduce over expanded axis
Axis unchangedNo reduction

Broadcasting and Elementwise Multiplication

Consider

Y=Xb, Y = X \odot b,

where

XRN×d,bRd. X \in \mathbb{R}^{N \times d}, \qquad b \in \mathbb{R}^{d}.

Indexed form:

Yij=Xijbj. Y_{ij} = X_{ij}b_j.

Differentiate:

dYij=bjdXij+Xijdbj. dY_{ij} = b_j dX_{ij} + X_{ij} db_j.

Reverse rules:

Xˉij+=Yˉijbj, \bar{X}_{ij} \mathrel{+}= \bar{Y}_{ij} b_j, bˉj+=iYˉijXij. \bar{b}_j \mathrel{+}= \sum_i \bar{Y}_{ij} X_{ij}.

Tensor form:

Xˉ+=Yˉb, \bar{X} \mathrel{+}= \bar{Y} \odot b, bˉ+=reduce_sum(YˉX,axis=0). \bar{b} \mathrel{+}= \operatorname{reduce\_sum} ( \bar{Y}\odot X, \text{axis}=0 ).

This pattern appears in layer normalization, attention scaling, gating mechanisms, and feature-wise affine transforms.

Broadcasting as Stride Manipulation

Most runtimes do not allocate broadcasted tensors.

Instead, broadcasting is represented using stride metadata.

Suppose

xRd x \in \mathbb{R}^{d}

is broadcast to

YRN×d. Y \in \mathbb{R}^{N \times d}.

The runtime may assign a stride of zero along the broadcasted axis.

Conceptually:

shape   = (N, d)
strides = (0, 1)

This means:

Yij Y_{ij}

always reads from

xj. x_j.

Every row references the same memory location.

This is efficient, but it creates an important reverse-mode requirement:

adjoints must accumulate.

If multiple outputs map to the same storage location, gradients cannot overwrite each other.

They must sum.

Aliasing and Accumulation

Broadcast views create aliasing.

Example:

x=[1,2,3] x = [1,2,3]

broadcast to shape

(4,3). (4,3).

All four rows refer to the same storage.

During reverse mode:

for i in 1..4:
    dx += dY[i]

The accumulation is mathematically required because the same variable contributed to multiple outputs.

This explains why in-place updates on broadcasted tensors are often forbidden or heavily restricted. A single write would ambiguously affect many virtual tensor elements.

Broadcasting and Jacobians

Broadcasting creates structured Jacobians with repeated rows or repeated blocks.

Example:

y=B(x) y = B(x)

with

xRd,yRN×d. x \in \mathbb{R}^d, \qquad y \in \mathbb{R}^{N \times d}.

Flattening tensors into vectors, the Jacobian has the form

JB=[III]. J_B = \begin{bmatrix} I \\ I \\ \vdots \\ I \end{bmatrix}.

The transpose is

JBT=[III]. J_B^T = \begin{bmatrix} I & I & \cdots & I \end{bmatrix}.

Therefore reverse mode computes:

JBTyˉ=iyˉi. J_B^T \bar{y} = \sum_i \bar{y}_i.

Again, reverse broadcasting becomes reduction.

Broadcast Semantics in Deep Learning

Broadcasting appears everywhere in deep learning systems.

Bias Addition

Y=XW+b. Y = XW + b.

Bias bb broadcasts across batch dimension.

Reverse:

bˉ=iYˉi. \bar{b} = \sum_i \bar{Y}_i.

Layer Normalization

Affine parameters:

Y=γX^+β. Y = \gamma \odot \hat{X} + \beta.

Parameters γ\gamma and β\beta broadcast across batch and sequence dimensions.

Attention Scaling

Attention logits:

S=QKTdk+M. S = \frac{QK^T}{\sqrt{d_k}} + M.

Mask tensor MM may broadcast across heads or batches.

Residual Connections

A lower-rank tensor may broadcast across multiple dimensions.

Shape semantics determine the correct reduction axes during backpropagation.

Reduction as the Adjoint of Broadcast

This duality is important enough to state explicitly.

Let

B:VW B : V \to W

be a broadcast operator.

Its adjoint is

BT:WV. B^T : W \to V.

The adjoint operation is reduction.

Forward:

OperationEffect
BroadcastReplicate values

Reverse:

OperationEffect
ReductionAccumulate replicated adjoints

This pattern appears throughout AD:

ForwardReverse
BroadcastReduce
GatherScatter-add
ExpandSum
ReplicateAccumulate

Many reverse-mode rules are adjoints of structural tensor operations.

Shape Inference

A broadcast-aware AD engine must track:

input shapes
output shapes
broadcasted axes
inserted dimensions
reduction semantics

Without this metadata, reverse reduction cannot be reconstructed correctly.

Example:

y = x + b

may involve:

x.shape = (32, 128, 256)
b.shape = (256,)

The reverse rule must reduce over axes:

(0,1). (0,1).

The engine cannot infer this only from output gradients. It must preserve broadcast metadata from the forward pass.

Broadcasting Failures

Broadcasting can silently produce incorrect programs if shapes are accidentally compatible.

Example:

x.shape = (32, 128)
y.shape = (128,)

Addition succeeds.

But if the programmer intended:

y.shape = (32, 128)

the program still runs, but semantics change.

This is one reason many large systems increasingly use:

named tensors
shape typing
dimension labels
compile-time shape checking

Shape-safe tensor systems reduce silent broadcast bugs.

Broadcast Fusion

Compilers often fuse broadcast operations into downstream kernels.

Instead of:

tmp = broadcast(b)
y = x + tmp

the kernel computes:

y[i,j] = x[i,j] + b[j]

without allocating the expanded tensor.

This optimization is critical for GPU efficiency. Materializing broadcasted tensors can increase memory traffic dramatically.

Reverse-mode kernels similarly fuse reduction logic into gradient kernels.

Numerical Stability

Broadcasting itself is numerically exact. However, reductions in reverse mode may accumulate large sums:

xˉ=iyˉi. \bar{x} = \sum_i \bar{y}_i.

Parallel reductions introduce:

non-associativity
floating-point order dependence
nondeterminism

Different reduction trees may produce slightly different gradients.

Large distributed systems often trade exact reproducibility for throughput.

Formal Rule

The general reverse rule for broadcasting is:

  1. Align input shape with output shape.
  2. Identify axes where:
    • dimensions were inserted, or
    • input size was 11 and output size exceeded 11.
  3. Sum over those axes.
  4. Reshape to original input shape.

Symbolically:

Xˉ=reshape(reduce_sum(Yˉ,broadcast axes),shape(X)). \bar{X} = \operatorname{reshape} \left( \operatorname{reduce\_sum} ( \bar{Y}, \text{broadcast axes} ), \text{shape}(X) \right).

This rule is implemented in nearly every tensor AD framework.

Summary

Broadcasting defines implicit tensor replication without materialized copies. The reverse-mode rule is reduction across broadcasted dimensions.

The key principles are:

ConceptReverse Interpretation
Replicated forward valuesSummed adjoints
Broadcast operatorReduction adjoint
Zero-stride viewsGradient accumulation
Shape expansionAxis reduction

Broadcasting appears simple at the API level, but it imposes strict structural rules on gradient propagation, memory aliasing, and tensor layout. A correct AD engine must treat broadcasting as a first-class semantic operation, not merely a convenience syntax.