Self-attention compares tokens by content. By itself, it has no built-in notion of token order.
Self-attention compares tokens by content. By itself, it has no built-in notion of token order. If a sequence is permuted, self-attention follows the permutation. It can still compare all tokens, but it cannot tell whether a token came first, second, or last unless position information is added.
This matters because many sequences are order-sensitive.
dog bites man
man bites dogThese two sequences contain the same words, but they have different meanings. A transformer needs a representation of both content and position.
Positional encoding supplies this missing information. It adds or injects position-dependent signals into token representations before or during attention.
Given token embeddings
a common form is
where
contains one position vector for each sequence index.
The result has the same shape as , but each token vector now contains information about both identity and position.
Why Self-Attention Needs Positions
A self-attention layer computes scores from pairwise dot products:
If the input contains no positional signal, then the score between two tokens depends only on their learned content representations. The operation is permutation equivariant: permuting the input positions permutes the output positions in the same way.
Permutation equivariance is useful for sets and graphs, but it is insufficient for ordered sequences. Language, audio, time series, and action trajectories depend on order. The model must distinguish “A before B” from “B before A.”
Convolutional networks and recurrent networks have order built into their structure. Convolutions use local neighborhoods. Recurrent networks process tokens step by step. Transformers need an explicit positional mechanism.
Learned Absolute Positional Embeddings
The simplest method is to learn a vector for each position.
For a maximum context length , define an embedding table
For a sequence of length , we take the first rows and add them to the token embeddings:
In PyTorch:
import torch
from torch import nn
class LearnedPositionalEmbedding(nn.Module):
def __init__(self, max_length: int, d_model: int):
super().__init__()
self.position_embedding = nn.Embedding(max_length, d_model)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# x: [B, T, D]
B, T, D = x.shape
positions = torch.arange(T, device=x.device)
pos = self.position_embedding(positions)
# pos: [T, D], broadcast to [B, T, D]
return x + posUsage:
x = torch.randn(8, 128, 768)
pos = LearnedPositionalEmbedding(max_length=512, d_model=768)
h = pos(x)
print(h.shape) # torch.Size([8, 128, 768])Learned absolute positional embeddings are simple and effective. Their main limitation is that they are tied to the maximum length used during training. Extending them beyond the trained context length requires interpolation, extrapolation, or retraining.
Sinusoidal Positional Encoding
The original transformer used fixed sinusoidal positional encodings. Instead of learning position vectors, it defines them using sine and cosine waves at different frequencies.
For position and channel index , the encoding is
Even dimensions use sine. Odd dimensions use cosine. Lower dimensions vary quickly. Higher dimensions vary slowly.
This gives each position a deterministic vector, and relative offsets can be expressed through linear relations among sinusoidal components.
In PyTorch:
import math
import torch
from torch import nn
class SinusoidalPositionalEncoding(nn.Module):
def __init__(self, max_length: int, d_model: int):
super().__init__()
pe = torch.zeros(max_length, d_model)
position = torch.arange(0, max_length).unsqueeze(1).float()
div_term = torch.exp(
torch.arange(0, d_model, 2).float()
* (-math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
self.register_buffer("pe", pe)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# x: [B, T, D]
T = x.shape[1]
return x + self.pe[:T, :]Usage:
x = torch.randn(4, 100, 256)
pos = SinusoidalPositionalEncoding(max_length=1000, d_model=256)
h = pos(x)
print(h.shape) # torch.Size([4, 100, 256])The tensor pe is registered as a buffer rather than a parameter. It is saved with the module and moved between devices, but it is not updated by gradient descent.
Absolute Versus Relative Position
Absolute positional methods assign a representation to each index:
position 0
position 1
position 2
...This lets the model know where a token appears in the sequence. However, many sequence relations depend more on relative distance than absolute index.
For example, in language, local dependencies often depend on whether two words are close. In music, rhythm depends on offsets. In time series, recency often matters more than absolute timestamp.
Relative position methods represent the distance between tokens. Instead of only asking “what is the position of token ?”, they also ask “how far is token from token ?”
A relative position bias modifies the attention score:
Here is a learned bias depending on the relative offset between positions and . Nearby tokens can receive different biases from distant tokens. Future tokens can receive different biases from past tokens.
This approach directly affects attention weights rather than token embeddings.
Relative Position Bias
A simple relative bias table stores one scalar per relative distance. For sequence length , relative offsets range from to . A practical implementation usually clips distances to a fixed range.
For heads, the bias can have shape
[H, 2 * max_distance + 1]During attention, we build a bias matrix with shape
[H, T, T]and add it to the attention scores.
A compact implementation:
class RelativePositionBias(nn.Module):
def __init__(self, num_heads: int, max_distance: int):
super().__init__()
self.num_heads = num_heads
self.max_distance = max_distance
self.bias = nn.Embedding(2 * max_distance + 1, num_heads)
def forward(self, T: int) -> torch.Tensor:
device = self.bias.weight.device
positions = torch.arange(T, device=device)
rel = positions[:, None] - positions[None, :]
rel = rel.clamp(-self.max_distance, self.max_distance)
rel = rel + self.max_distance
# [T, T, H]
bias = self.bias(rel)
# [H, T, T]
return bias.permute(2, 0, 1)Using it inside attention:
B, H, T, D = 2, 8, 16, 64
q = torch.randn(B, H, T, D)
k = torch.randn(B, H, T, D)
v = torch.randn(B, H, T, D)
scores = q @ k.transpose(-2, -1)
scores = scores / math.sqrt(D)
rel_bias = RelativePositionBias(num_heads=H, max_distance=32)
scores = scores + rel_bias(T).unsqueeze(0)
weights = torch.softmax(scores, dim=-1)
out = weights @ v
print(out.shape) # torch.Size([2, 8, 16, 64])Relative bias is widely used because it is simple, cheap, and effective.
Rotary Positional Embeddings
Rotary positional embeddings, often called RoPE, encode position by rotating query and key vectors. Instead of adding a position vector to the input, RoPE applies a position-dependent rotation in each pair of hidden dimensions.
For a two-dimensional pair, a vector is rotated by an angle depending on the token position. If the position is , the rotation is
This is applied to query and key vectors before computing attention. The dot product between a rotated query and a rotated key then depends on their relative positions.
A simplified PyTorch implementation:
def rotate_half(x: torch.Tensor) -> torch.Tensor:
x1 = x[..., 0::2]
x2 = x[..., 1::2]
return torch.stack((-x2, x1), dim=-1).flatten(-2)
def apply_rope(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor) -> torch.Tensor:
# x: [B, H, T, D]
# cos: [T, D]
# sin: [T, D]
return x * cos[None, None, :, :] + rotate_half(x) * sin[None, None, :, :]
def rope_cache(T: int, D: int, device=None, base: float = 10000.0):
inv_freq = 1.0 / (
base ** (torch.arange(0, D, 2, device=device).float() / D)
)
positions = torch.arange(T, device=device).float()
freqs = torch.einsum("t,d->td", positions, inv_freq)
emb = torch.cat([freqs, freqs], dim=-1)
return emb.cos(), emb.sin()Usage:
B, H, T, D = 2, 8, 128, 64
q = torch.randn(B, H, T, D)
k = torch.randn(B, H, T, D)
cos, sin = rope_cache(T, D, device=q.device)
q = apply_rope(q, cos, sin)
k = apply_rope(k, cos, sin)RoPE is common in modern decoder-only language models because it works naturally with causal attention and supports relative-position behavior.
ALiBi
ALiBi, short for Attention with Linear Biases, adds a head-specific linear penalty based on distance. It does not add position embeddings to token vectors. It modifies attention scores directly.
For causal attention, the bias often has the form
Here is a slope assigned to attention head . Larger distances receive larger negative bias, so the model has an inductive preference for nearer tokens. Different heads use different slopes, allowing some heads to focus locally and others to attend more broadly.
The main appeal is extrapolation. Since the bias is defined by distance, it can be applied to longer sequences than those seen during training.
A simple causal ALiBi bias:
def alibi_bias(num_heads: int, T: int, device=None) -> torch.Tensor:
slopes = torch.tensor(
[1.0 / (2 ** i) for i in range(num_heads)],
device=device,
)
positions = torch.arange(T, device=device)
distance = positions[:, None] - positions[None, :]
distance = distance.clamp(min=0).float()
# [H, T, T]
return -slopes[:, None, None] * distance[None, :, :]In a real implementation, slopes are usually chosen by a specific schedule. The core idea remains simple: attention scores are biased by relative distance.
Position Interpolation and Long Contexts
Absolute learned embeddings have a fixed maximum length. Sinusoidal encodings, RoPE, and ALiBi can be applied beyond the training length, but performance may still degrade if the model was not trained on long contexts.
Long-context extension methods often modify positional encodings. One common approach is position interpolation. Instead of feeding positions directly, positions are rescaled into the range seen during training.
If a model was trained up to length and we want to use length , we can map test position to
This compresses long positions into the trained range. Variants of this idea are used with rotary embeddings and other long-context adaptations.
Long-context behavior depends on more than position encodings. Data distribution, attention implementation, optimization, memory, and evaluation all matter.
Positional Encoding in Images
Vision transformers divide an image into patches. Each patch becomes a token. Positional information tells the model where each patch came from.
For an image split into a grid of patches, a learned positional embedding may have shape
[1, num_patches + 1, D]The extra position is often used for a class token.
For a image with patches, the patch grid is , so there are 196 patch tokens.
If using a class token, the sequence length is 197.
B = 8
num_patches = 196
D = 768
patch_tokens = torch.randn(B, num_patches, D)
cls_token = torch.randn(B, 1, D)
x = torch.cat([cls_token, patch_tokens], dim=1)
pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, D))
x = x + pos_embedding
print(x.shape) # torch.Size([8, 197, 768])For images of different resolutions, learned absolute position embeddings may need interpolation over the two-dimensional patch grid.
Positional Encoding in Time Series and Audio
Time series and audio often require additional care. Position may represent sample index, time step, frequency bin, or window index.
For raw audio, the sequence length can be very large. Direct full self-attention over all samples is usually too expensive. Models often operate on frames, patches, spectrograms, or compressed latent representations.
For time series, absolute time may matter in some tasks and relative lag may matter in others. A forecasting model may need calendar features such as hour, day, weekday, or season. These can be represented as additional embeddings or continuous features.
Thus positional encoding is not one universal object. It should match the structure of the data.
Choosing a Positional Method
The right positional method depends on the model and task.
| Method | Strength | Limitation |
|---|---|---|
| Learned absolute embeddings | Simple and strong within trained length | Weak extrapolation beyond max length |
| Sinusoidal encoding | Fixed, deterministic, no learned table | Less dominant in modern LLMs |
| Relative position bias | Directly models pairwise distance | Usually tied to attention implementation |
| RoPE | Strong for causal language models | Long-context scaling needs care |
| ALiBi | Simple extrapolation behavior | Less expressive than some alternatives |
| 2D position embeddings | Natural for images | Resolution changes may need interpolation |
For a first PyTorch transformer, learned absolute embeddings are easiest. For a modern decoder-only language model, RoPE is a common default. For vision transformers, learned 2D or flattened patch embeddings are common. For long-context experiments, RoPE scaling or ALiBi are useful starting points.
Summary
Self-attention needs positional information because content-only attention does not encode order. Positional encoding adds order information to token representations or attention scores.
Absolute embeddings assign vectors to positions. Sinusoidal encodings use fixed waves. Relative position bias modifies pairwise attention scores. RoPE rotates query and key vectors so dot products depend on relative position. ALiBi adds distance-based linear bias.
The choice of positional method affects extrapolation, long-context behavior, implementation complexity, and model quality.