Self-attention compares tokens to other tokens, but by itself it has no built-in notion of order.
Self-attention compares tokens to other tokens, but by itself it has no built-in notion of order. If we permute the input sequence and apply the same self-attention operation, the attention mechanism still compares all tokens in the same content-based way. A transformer therefore needs an additional signal that tells it where each token appears.
Positional encoding is the mechanism that injects order information into a transformer. It gives the model access to position, distance, and sometimes direction.
Why Position Is Needed
Consider the two sentences:
The dog chased the cat.
The cat chased the dog.They contain almost the same words, but their meanings differ because the order differs. A model that ignores order cannot reliably distinguish subject from object.
For sequence input
self-attention computes interactions among positions, but the operation itself does not know that position 2 comes before position 5. Positional information must be added to the token representation or to the attention computation.
The usual input form is
Here contains token embeddings and contains positional vectors.
Absolute Positional Embeddings
The simplest method is to learn one vector for each position.
If the maximum sequence length is , the learned positional embedding table is
For a sequence of length , we select the first rows:
The input becomes
In PyTorch:
import torch
from torch import nn
B, T, V, D = 4, 16, 30_000, 768
tokens = torch.randint(0, V, (B, T))
token_emb = nn.Embedding(V, D)
pos_emb = nn.Embedding(512, D)
positions = torch.arange(T, device=tokens.device)
positions = positions.unsqueeze(0).expand(B, T)
x = token_emb(tokens) + pos_emb(positions)
print(x.shape) # torch.Size([4, 16, 768])Learned absolute embeddings are simple and effective. Their main limitation is that they are tied to a maximum context length. Extending a model beyond the trained length requires interpolation, extrapolation, or retraining.
Sinusoidal Positional Encoding
The original transformer used fixed sinusoidal positional encodings. Instead of learning , the encoding is computed from sine and cosine functions.
For position and feature index , the encoding is
Each dimension varies at a different frequency. Low-index dimensions vary quickly. High-index dimensions vary slowly.
In PyTorch:
import math
import torch
def sinusoidal_positional_encoding(max_len: int, d_model: int):
positions = torch.arange(max_len).unsqueeze(1)
dims = torch.arange(0, d_model, 2)
scale = torch.exp(-math.log(10000.0) * dims / d_model)
pe = torch.zeros(max_len, d_model)
pe[:, 0::2] = torch.sin(positions * scale)
pe[:, 1::2] = torch.cos(positions * scale)
return pe
pe = sinusoidal_positional_encoding(512, 768)
print(pe.shape) # torch.Size([512, 768])Sinusoidal encoding has no learned parameters. It can be computed for positions beyond those seen during training, although extrapolation quality still depends on the model.
Relative Position
Absolute position tells the model where a token is. Relative position tells the model how far two tokens are from each other.
For many tasks, distance matters more than absolute index. In the sentence
The animal that the child saw ran away.syntax depends on relations among words, not merely on their absolute positions.
Relative positional methods modify attention scores using a term based on , the distance between query position and key position .
A simplified form is
Here is a learned bias for the relative distance between two positions.
This lets the model learn patterns such as “nearby tokens are often important” or “previous tokens matter more than distant tokens” without relying only on absolute indices.
Attention Biases
Many modern transformer variants use attention biases instead of adding positional vectors to token embeddings. The idea is to alter the attention score matrix directly.
For attention scores
we add a position-dependent bias
The attention weights become
The bias tensor may encode distance, direction, segment information, or task-specific structure.
This approach is common because attention is where token-token relations are computed. Adding positional information at this point gives the model direct access to relative layout.
Rotary Positional Embeddings
Rotary positional embeddings, usually called RoPE, encode position by rotating query and key vectors. Instead of adding a positional vector to the input, RoPE applies a position-dependent rotation inside attention.
For each pair of dimensions, RoPE applies a 2D rotation:
The angle depends on the position and dimension pair . Queries and keys are rotated before computing dot products.
The important property is that the dot product between a rotated query and a rotated key depends on their relative offset. Thus RoPE gives attention access to relative position while keeping the computation efficient.
A simplified PyTorch implementation:
def rotate_half(x):
x_even = x[..., 0::2]
x_odd = x[..., 1::2]
return torch.stack((-x_odd, x_even), dim=-1).flatten(-2)
def apply_rope(x, cos, sin):
return x * cos + rotate_half(x) * sinIn practice, RoPE is applied to query and key tensors with shape such as
[B, heads, T, head_dim]RoPE is widely used in decoder-only language models because it works well with causal attention and long-context extensions.
ALiBi
ALiBi, short for Attention with Linear Biases, adds a head-specific linear penalty based on distance. For query position and key position , the attention score is modified as
where is a slope for attention head .
In causal language models, , so is the distance into the past. More distant tokens receive different bias values depending on the head.
ALiBi has no learned position embedding table and can extrapolate to longer sequences more naturally than learned absolute embeddings. It is simple and memory efficient.
Positional Encoding for Vision Transformers
In Vision Transformers, position refers to patch location rather than word order.
An image of size is split into patches. If each patch has size , the number of patch positions is
Each patch receives a positional embedding. For a 2D image grid, position may be represented by:
| Method | Description |
|---|---|
| Learned 1D embedding | Flatten patches into a sequence and learn one vector per index |
| Learned 2D embedding | Learn separate row and column embeddings |
| Sinusoidal 2D encoding | Use fixed functions for height and width |
| Relative position bias | Add position bias based on patch offset |
| Rotary position | Apply rotations based on 2D position |
If images have different resolutions, learned absolute embeddings may need interpolation. Relative position methods are often more flexible for variable image sizes.
Position and Length Extrapolation
Length extrapolation means using a transformer on sequences longer than those seen during training.
This is difficult because a model may learn behavior that depends on the training context length. Even if the positional method can represent longer positions, the model may not use them correctly.
Common strategies include:
| Strategy | Description |
|---|---|
| Train on longer contexts | Direct but expensive |
| Positional interpolation | Rescale positions into the trained range |
| RoPE scaling | Adjust RoPE frequencies for longer contexts |
| Sliding-window attention | Limit attention to local windows |
| Sparse global attention | Combine local and selected global tokens |
| Retrieval augmentation | Move long-range information outside the context window |
Long-context modeling requires both positional design and attention efficiency. Position encoding alone does not solve the memory cost of full attention.
Choosing a Positional Method
For small educational models, learned absolute embeddings are easiest. They work well when the maximum sequence length is fixed.
For encoder models, learned absolute embeddings and relative position biases are both common. For decoder-only language models, RoPE is a common default. For very long contexts, RoPE scaling, ALiBi, sparse attention, or hybrid memory mechanisms may be considered.
| Setting | Common choice |
|---|---|
| Small text classifier | Learned absolute embeddings |
| BERT-style encoder | Learned absolute embeddings or relative bias |
| GPT-style decoder | RoPE |
| Long-context decoder | RoPE scaling or ALiBi |
| Vision Transformer | Learned 2D or interpolated absolute embeddings |
| Windowed vision model | Relative position bias |
The choice affects extrapolation, memory use, implementation complexity, and compatibility with pretrained checkpoints.
Summary
Self-attention needs positional information because attention alone treats input positions symmetrically. Positional encoding gives a transformer access to order and distance.
Absolute positional embeddings add one vector per position. Sinusoidal encodings compute fixed vectors from sine and cosine functions. Relative position methods alter attention scores using token distances. RoPE rotates query and key vectors so attention depends on relative offsets. ALiBi adds distance-based linear biases.
A practical transformer implementation should make positional encoding an explicit design choice. The right method depends on the architecture, task, context length, and expected extrapolation behavior.