Tensor Arithmetic and Broadcasting

Tensor arithmetic is the basic computation layer of PyTorch. Neural networks are built from additions, multiplications, reductions, matrix products, reshapes, and nonlinear functions. Higher-level layers such as nn.Linear, nn.Conv2d, and nn.MultiheadAttention are composed from these lower-level tensor operations.

This section studies arithmetic at the tensor level. The goal is to understand which operations are elementwise, which operations reduce dimensions, and which operations combine axes through linear algebra.

Elementwise Arithmetic

Elementwise operations apply the same scalar operation independently to each tensor entry.

x = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}, \quad y = \begin{bmatrix} 10 \\ 20 \\ 30 \end{bmatrix},

then

x + y = \begin{bmatrix} 11 \\ 22 \\ 33 \end{bmatrix}.

In PyTorch:

import torch

x = torch.tensor([1.0, 2.0, 3.0])
y = torch.tensor([10.0, 20.0, 30.0])

print(x + y)
print(x - y)
print(x * y)
print(y / x)

The operator * performs elementwise multiplication, not matrix multiplication.

A = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
B = torch.tensor([[10.0, 20.0], [30.0, 40.0]])

print(A * B)

Output:

tensor([[ 10.,  40.],
        [ 90., 160.]])

Each output entry is computed independently:

(A * B)_{ij} = A_{ij}B_{ij}.

Arithmetic with Scalars

A scalar can be combined with a tensor. The scalar is applied to every entry.

x = torch.tensor([1.0, 2.0, 3.0])

print(x + 1)
print(2 * x)
print(x / 10)
print(x ** 2)

Mathematically:

x + 1 = \begin{bmatrix} x_1 + 1 \\ x_2 + 1 \\ x_3 + 1 \end{bmatrix}.

Scalar arithmetic is used throughout deep learning. For example, weight decay adds a scaled parameter tensor to a gradient update. Temperature scaling divides logits by a scalar temperature. Normalization often subtracts a mean and divides by a standard deviation.

Unary Elementwise Functions

PyTorch includes many unary functions that operate elementwise:

Function	Meaning
`torch.exp(x)`	Exponential
`torch.log(x)`	Natural logarithm
`torch.sqrt(x)`	Square root
`torch.abs(x)`	Absolute value
`torch.sin(x)`	Sine
`torch.cos(x)`	Cosine
`torch.sigmoid(x)`	Logistic sigmoid
`torch.tanh(x)`	Hyperbolic tangent
`torch.relu(x)`	Rectified linear unit

Example:

x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])

print(torch.exp(x))
print(torch.sigmoid(x))
print(torch.relu(x))

For a tensor $X$ , an elementwise function $f$ produces a tensor with the same shape:

Y_{ij} = f(X_{ij}).

Elementwise nonlinearities are what allow deep networks to represent nonlinear functions.

Reductions

A reduction combines many tensor entries into fewer entries.

Common reductions:

Function	Meaning
`sum()`	Sum of entries
`mean()`	Average of entries
`max()`	Maximum value
`min()`	Minimum value
`prod()`	Product of entries
`std()`	Standard deviation
`var()`	Variance

Example:

X = torch.tensor([
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0],
])

print(X.sum())
print(X.mean())

Output:

tensor(21.)
tensor(3.5000)

A reduction over all axes returns a scalar tensor.

Reducing Along an Axis

A reduction may also be applied along a specific axis.

X = torch.tensor([
    [1.0, 2.0, 3.0],
    [4.0, 5.0, 6.0],
])

print(X.sum(dim=0))
print(X.sum(dim=1))

Output:

tensor([5., 7., 9.])
tensor([ 6., 15.])

For $X\in\mathbb{R}^{2\times3}$ ,

\text{sum}(X,\text{dim}=0) = \begin{bmatrix} X_{11}+X_{21} & X_{12}+X_{22} & X_{13}+X_{23} \end{bmatrix}.

Reducing with dim=0 collapses the first axis. Reducing with dim=1 collapses the second axis.

This distinction matters in training. For example, a loss tensor may have one value per example:

loss_per_example = torch.tensor([0.4, 1.2, 0.7, 0.9])
loss = loss_per_example.mean()

The scalar loss is then used for backpropagation.

Keeping Reduced Dimensions

By default, a reduction removes the reduced axis. The option keepdim=True preserves it with size 1.

X = torch.randn(32, 64)

mean1 = X.mean(dim=1)
mean2 = X.mean(dim=1, keepdim=True)

print(mean1.shape)  # torch.Size([32])
print(mean2.shape)  # torch.Size([32, 1])

Keeping dimensions is useful for broadcasting.

X = torch.randn(32, 64)

mean = X.mean(dim=1, keepdim=True)
centered = X - mean

print(centered.shape)  # torch.Size([32, 64])

Here mean has shape [32, 1], so it can be subtracted from X row by row.

Broadcasting

Broadcasting allows PyTorch to combine tensors with different but compatible shapes.

Suppose

X = torch.randn(32, 64)
b = torch.randn(64)

Y = X + b

The vector b is treated as if it were copied across the batch dimension. Conceptually:

Y_{ij} = X_{ij} + b_j.

No physical copy is usually made. PyTorch uses strides to interpret the smaller tensor as if it had the larger shape.

Broadcasting compares dimensions from right to left. Two dimensions are compatible when they are equal or one of them is 1.

Shape A	Shape B	Result
`[32, 64]`	`[64]`	`[32, 64]`
`[32, 64]`	`[1, 64]`	`[32, 64]`
`[32, 10, 64]`	`[64]`	`[32, 10, 64]`
`[32, 10, 64]`	`[10, 64]`	`[32, 10, 64]`
`[32, 10, 64]`	`[32]`	invalid

Broadcasting is powerful, but it must be used deliberately. A mistaken shape can produce a valid result with the wrong meaning.

Matrix Multiplication

Matrix multiplication combines rows and columns through dot products.

A\in\mathbb{R}^{m\times n}, \quad B\in\mathbb{R}^{n\times p},

then

C = AB

has shape

C\in\mathbb{R}^{m\times p}.

Each entry is

C_{ij} = \sum_{k=1}^{n} A_{ik}B_{kj}.

In PyTorch:

A = torch.randn(5, 3)
B = torch.randn(3, 2)

C = A @ B
print(C.shape)  # torch.Size([5, 2])

The @ operator performs matrix multiplication. The equivalent function is torch.matmul.

Matrix multiplication is central to neural networks. A fully connected layer is a matrix multiplication followed by bias addition.

X = torch.randn(32, 128)
W = torch.randn(128, 64)
b = torch.randn(64)

Y = X @ W + b

print(Y.shape)  # torch.Size([32, 64])

Batch Matrix Multiplication

Neural networks often need many matrix multiplications at once.

Suppose

A\in\mathbb{R}^{B\times m\times n}, \quad B\in\mathbb{R}^{B\times n\times p}.

Then batch matrix multiplication gives

C\in\mathbb{R}^{B\times m\times p}.

In PyTorch:

A = torch.randn(16, 5, 3)
B = torch.randn(16, 3, 2)

C = torch.bmm(A, B)

print(C.shape)  # torch.Size([16, 5, 2])

torch.matmul also supports batched matrix multiplication and broadcasting over leading dimensions.

A = torch.randn(16, 5, 3)
B = torch.randn(3, 2)

C = torch.matmul(A, B)

print(C.shape)  # torch.Size([16, 5, 2])

This operation appears in attention mechanisms, where each batch contains many query-key and attention-value products.

Dot Products

A dot product combines two vectors into a scalar.

For

x,y\in\mathbb{R}^d,

the dot product is

x^\top y = \sum_{i=1}^{d} x_i y_i.

In PyTorch:

x = torch.randn(64)
y = torch.randn(64)

s = torch.dot(x, y)

print(s.shape)  # torch.Size([])

Dot products measure alignment. If two vectors point in similar directions, their dot product is large and positive. If they point in opposite directions, it is negative. If they are nearly orthogonal, it is near zero.

Attention mechanisms use dot products to compare query and key vectors.

Norms

A norm measures the size of a vector or tensor.

The Euclidean norm of a vector is

\|x\|_2 = \sqrt{\sum_i x_i^2}.

In PyTorch:

x = torch.tensor([3.0, 4.0])

print(torch.linalg.norm(x))  # tensor(5.)

Other common norms include:

\|x\|_1 = \sum_i |x_i|

and

\|x\|_\infty = \max_i |x_i|.

PyTorch:

x = torch.tensor([1.0, -2.0, 3.0])

print(torch.linalg.norm(x, ord=1))
print(torch.linalg.norm(x, ord=2))
print(torch.linalg.norm(x, ord=float("inf")))

Norms are used in regularization, gradient clipping, normalization, and distance computation.

Comparisons and Masks

Comparison operations return Boolean tensors.

x = torch.tensor([-1.0, 0.0, 2.0, 5.0])

mask = x > 0
print(mask)

Output:

tensor([False, False,  True,  True])

Masks are used to select, ignore, or modify entries.

positive = x[mask]
print(positive)

Output:

tensor([2., 5.])

A common pattern is torch.where:

x = torch.tensor([-1.0, 0.0, 2.0, 5.0])

y = torch.where(x > 0, x, torch.zeros_like(x))

print(y)

Output:

tensor([0., 0., 2., 5.])

This is equivalent to applying a ReLU-like operation.

Masks are especially important in sequence models. Padding masks prevent attention layers from attending to padded tokens.

In-Place Operations

Some PyTorch operations modify a tensor in place. These usually end with an underscore.

x = torch.tensor([1.0, 2.0, 3.0])

x.add_(1.0)

print(x)

Output:

tensor([2., 3., 4.])

In-place operations save memory, but they can interfere with automatic differentiation when a value is needed later for gradient computation.

For most model code, prefer ordinary out-of-place operations unless memory use is a measured problem.

Numerical Stability

Some arithmetic expressions are mathematically valid but numerically unstable.

For example, computing softmax naively can overflow:

x = torch.tensor([1000.0, 1001.0, 1002.0])
exp_x = torch.exp(x)
softmax = exp_x / exp_x.sum()

A stable version subtracts the maximum value first:

x = torch.tensor([1000.0, 1001.0, 1002.0])

z = x - x.max()
softmax = torch.exp(z) / torch.exp(z).sum()

print(softmax)

PyTorch provides stable built-in functions:

softmax = torch.softmax(x, dim=0)

Similarly, use torch.logsumexp, torch.nn.functional.cross_entropy, and other built-ins when available. These functions are designed to avoid common overflow and underflow problems.

A Small Tensor Arithmetic Example

The following code implements a linear classifier from raw tensor operations.

import torch
import torch.nn.functional as F

B = 32      # batch size
D = 128     # input dimension
K = 10      # number of classes

X = torch.randn(B, D)
y = torch.randint(0, K, (B,))

W = torch.randn(D, K) * 0.01
b = torch.zeros(K)

logits = X @ W + b
loss = F.cross_entropy(logits, y)

print(logits.shape)  # torch.Size([32, 10])
print(loss.shape)    # torch.Size([])

Shape flow:

Name	Shape	Meaning
`X`	`[32, 128]`	Input batch
`W`	`[128, 10]`	Weight matrix
`b`	`[10]`	Bias vector
`logits`	`[32, 10]`	Class scores
`y`	`[32]`	Class labels
`loss`	`[]`	Scalar training objective

This simple example already uses matrix multiplication, broadcasting, integer labels, reduction, and a numerically stable loss function.

Summary

Tensor arithmetic provides the computational foundation for neural networks. Elementwise operations preserve shape. Reductions remove or shrink axes. Broadcasting combines compatible shapes without explicit copying. Matrix multiplication combines axes through dot products and forms the core of linear layers, attention mechanisms, and many other neural network operations.

A correct PyTorch program tracks both the mathematical operation and the tensor shape. When the two agree, the model is usually easier to debug, optimize, and extend.