Tensor arithmetic is the basic computation layer of PyTorch. Neural networks are built from additions, multiplications, reductions, matrix products, reshapes, and nonlinear functions. Higher-level layers such as nn.Linear, nn.Conv2d, and nn.MultiheadAttention are composed from these lower-level tensor operations.
This section studies arithmetic at the tensor level. The goal is to understand which operations are elementwise, which operations reduce dimensions, and which operations combine axes through linear algebra.
Elementwise Arithmetic
Elementwise operations apply the same scalar operation independently to each tensor entry.
If
then
In PyTorch:
import torch
x = torch.tensor([1.0, 2.0, 3.0])
y = torch.tensor([10.0, 20.0, 30.0])
print(x + y)
print(x - y)
print(x * y)
print(y / x)The operator * performs elementwise multiplication, not matrix multiplication.
A = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
B = torch.tensor([[10.0, 20.0], [30.0, 40.0]])
print(A * B)Output:
tensor([[ 10., 40.],
[ 90., 160.]])Each output entry is computed independently:
Arithmetic with Scalars
A scalar can be combined with a tensor. The scalar is applied to every entry.
x = torch.tensor([1.0, 2.0, 3.0])
print(x + 1)
print(2 * x)
print(x / 10)
print(x ** 2)Mathematically:
Scalar arithmetic is used throughout deep learning. For example, weight decay adds a scaled parameter tensor to a gradient update. Temperature scaling divides logits by a scalar temperature. Normalization often subtracts a mean and divides by a standard deviation.
Unary Elementwise Functions
PyTorch includes many unary functions that operate elementwise:
| Function | Meaning |
|---|---|
torch.exp(x) | Exponential |
torch.log(x) | Natural logarithm |
torch.sqrt(x) | Square root |
torch.abs(x) | Absolute value |
torch.sin(x) | Sine |
torch.cos(x) | Cosine |
torch.sigmoid(x) | Logistic sigmoid |
torch.tanh(x) | Hyperbolic tangent |
torch.relu(x) | Rectified linear unit |
Example:
x = torch.tensor([-2.0, -1.0, 0.0, 1.0, 2.0])
print(torch.exp(x))
print(torch.sigmoid(x))
print(torch.relu(x))For a tensor , an elementwise function produces a tensor with the same shape:
Elementwise nonlinearities are what allow deep networks to represent nonlinear functions.
Reductions
A reduction combines many tensor entries into fewer entries.
Common reductions:
| Function | Meaning |
|---|---|
sum() | Sum of entries |
mean() | Average of entries |
max() | Maximum value |
min() | Minimum value |
prod() | Product of entries |
std() | Standard deviation |
var() | Variance |
Example:
X = torch.tensor([
[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
])
print(X.sum())
print(X.mean())Output:
tensor(21.)
tensor(3.5000)A reduction over all axes returns a scalar tensor.
Reducing Along an Axis
A reduction may also be applied along a specific axis.
X = torch.tensor([
[1.0, 2.0, 3.0],
[4.0, 5.0, 6.0],
])
print(X.sum(dim=0))
print(X.sum(dim=1))Output:
tensor([5., 7., 9.])
tensor([ 6., 15.])For ,
Reducing with dim=0 collapses the first axis. Reducing with dim=1 collapses the second axis.
This distinction matters in training. For example, a loss tensor may have one value per example:
loss_per_example = torch.tensor([0.4, 1.2, 0.7, 0.9])
loss = loss_per_example.mean()The scalar loss is then used for backpropagation.
Keeping Reduced Dimensions
By default, a reduction removes the reduced axis. The option keepdim=True preserves it with size 1.
X = torch.randn(32, 64)
mean1 = X.mean(dim=1)
mean2 = X.mean(dim=1, keepdim=True)
print(mean1.shape) # torch.Size([32])
print(mean2.shape) # torch.Size([32, 1])Keeping dimensions is useful for broadcasting.
X = torch.randn(32, 64)
mean = X.mean(dim=1, keepdim=True)
centered = X - mean
print(centered.shape) # torch.Size([32, 64])Here mean has shape [32, 1], so it can be subtracted from X row by row.
Broadcasting
Broadcasting allows PyTorch to combine tensors with different but compatible shapes.
Suppose
X = torch.randn(32, 64)
b = torch.randn(64)
Y = X + bThe vector b is treated as if it were copied across the batch dimension. Conceptually:
No physical copy is usually made. PyTorch uses strides to interpret the smaller tensor as if it had the larger shape.
Broadcasting compares dimensions from right to left. Two dimensions are compatible when they are equal or one of them is 1.
| Shape A | Shape B | Result |
|---|---|---|
[32, 64] | [64] | [32, 64] |
[32, 64] | [1, 64] | [32, 64] |
[32, 10, 64] | [64] | [32, 10, 64] |
[32, 10, 64] | [10, 64] | [32, 10, 64] |
[32, 10, 64] | [32] | invalid |
Broadcasting is powerful, but it must be used deliberately. A mistaken shape can produce a valid result with the wrong meaning.
Matrix Multiplication
Matrix multiplication combines rows and columns through dot products.
If
then
has shape
Each entry is
In PyTorch:
A = torch.randn(5, 3)
B = torch.randn(3, 2)
C = A @ B
print(C.shape) # torch.Size([5, 2])The @ operator performs matrix multiplication. The equivalent function is torch.matmul.
Matrix multiplication is central to neural networks. A fully connected layer is a matrix multiplication followed by bias addition.
X = torch.randn(32, 128)
W = torch.randn(128, 64)
b = torch.randn(64)
Y = X @ W + b
print(Y.shape) # torch.Size([32, 64])Batch Matrix Multiplication
Neural networks often need many matrix multiplications at once.
Suppose
Then batch matrix multiplication gives
In PyTorch:
A = torch.randn(16, 5, 3)
B = torch.randn(16, 3, 2)
C = torch.bmm(A, B)
print(C.shape) # torch.Size([16, 5, 2])torch.matmul also supports batched matrix multiplication and broadcasting over leading dimensions.
A = torch.randn(16, 5, 3)
B = torch.randn(3, 2)
C = torch.matmul(A, B)
print(C.shape) # torch.Size([16, 5, 2])This operation appears in attention mechanisms, where each batch contains many query-key and attention-value products.
Dot Products
A dot product combines two vectors into a scalar.
For
the dot product is
In PyTorch:
x = torch.randn(64)
y = torch.randn(64)
s = torch.dot(x, y)
print(s.shape) # torch.Size([])Dot products measure alignment. If two vectors point in similar directions, their dot product is large and positive. If they point in opposite directions, it is negative. If they are nearly orthogonal, it is near zero.
Attention mechanisms use dot products to compare query and key vectors.
Norms
A norm measures the size of a vector or tensor.
The Euclidean norm of a vector is
In PyTorch:
x = torch.tensor([3.0, 4.0])
print(torch.linalg.norm(x)) # tensor(5.)Other common norms include:
and
PyTorch:
x = torch.tensor([1.0, -2.0, 3.0])
print(torch.linalg.norm(x, ord=1))
print(torch.linalg.norm(x, ord=2))
print(torch.linalg.norm(x, ord=float("inf")))Norms are used in regularization, gradient clipping, normalization, and distance computation.
Comparisons and Masks
Comparison operations return Boolean tensors.
x = torch.tensor([-1.0, 0.0, 2.0, 5.0])
mask = x > 0
print(mask)Output:
tensor([False, False, True, True])Masks are used to select, ignore, or modify entries.
positive = x[mask]
print(positive)Output:
tensor([2., 5.])A common pattern is torch.where:
x = torch.tensor([-1.0, 0.0, 2.0, 5.0])
y = torch.where(x > 0, x, torch.zeros_like(x))
print(y)Output:
tensor([0., 0., 2., 5.])This is equivalent to applying a ReLU-like operation.
Masks are especially important in sequence models. Padding masks prevent attention layers from attending to padded tokens.
In-Place Operations
Some PyTorch operations modify a tensor in place. These usually end with an underscore.
x = torch.tensor([1.0, 2.0, 3.0])
x.add_(1.0)
print(x)Output:
tensor([2., 3., 4.])In-place operations save memory, but they can interfere with automatic differentiation when a value is needed later for gradient computation.
For most model code, prefer ordinary out-of-place operations unless memory use is a measured problem.
Numerical Stability
Some arithmetic expressions are mathematically valid but numerically unstable.
For example, computing softmax naively can overflow:
x = torch.tensor([1000.0, 1001.0, 1002.0])
exp_x = torch.exp(x)
softmax = exp_x / exp_x.sum()A stable version subtracts the maximum value first:
x = torch.tensor([1000.0, 1001.0, 1002.0])
z = x - x.max()
softmax = torch.exp(z) / torch.exp(z).sum()
print(softmax)PyTorch provides stable built-in functions:
softmax = torch.softmax(x, dim=0)Similarly, use torch.logsumexp, torch.nn.functional.cross_entropy, and other built-ins when available. These functions are designed to avoid common overflow and underflow problems.
A Small Tensor Arithmetic Example
The following code implements a linear classifier from raw tensor operations.
import torch
import torch.nn.functional as F
B = 32 # batch size
D = 128 # input dimension
K = 10 # number of classes
X = torch.randn(B, D)
y = torch.randint(0, K, (B,))
W = torch.randn(D, K) * 0.01
b = torch.zeros(K)
logits = X @ W + b
loss = F.cross_entropy(logits, y)
print(logits.shape) # torch.Size([32, 10])
print(loss.shape) # torch.Size([])Shape flow:
| Name | Shape | Meaning |
|---|---|---|
X | [32, 128] | Input batch |
W | [128, 10] | Weight matrix |
b | [10] | Bias vector |
logits | [32, 10] | Class scores |
y | [32] | Class labels |
loss | [] | Scalar training objective |
This simple example already uses matrix multiplication, broadcasting, integer labels, reduction, and a numerically stable loss function.
Summary
Tensor arithmetic provides the computational foundation for neural networks. Elementwise operations preserve shape. Reductions remove or shrink axes. Broadcasting combines compatible shapes without explicit copying. Matrix multiplication combines axes through dot products and forms the core of linear layers, attention mechanisms, and many other neural network operations.
A correct PyTorch program tracks both the mathematical operation and the tensor shape. When the two agree, the model is usually easier to debug, optimize, and extend.