Skip to content

Positional Encoding

Self-attention compares tokens to other tokens, but by itself it has no built-in notion of order.

Self-attention compares tokens to other tokens, but by itself it has no built-in notion of order. If we permute the input sequence and apply the same self-attention operation, the attention mechanism still compares all tokens in the same content-based way. A transformer therefore needs an additional signal that tells it where each token appears.

Positional encoding is the mechanism that injects order information into a transformer. It gives the model access to position, distance, and sometimes direction.

Why Position Is Needed

Consider the two sentences:

The dog chased the cat.
The cat chased the dog.

They contain almost the same words, but their meanings differ because the order differs. A model that ignores order cannot reliably distinguish subject from object.

For sequence input

XRB×T×D, X \in \mathbb{R}^{B \times T \times D},

self-attention computes interactions among positions, but the operation itself does not know that position 2 comes before position 5. Positional information must be added to the token representation or to the attention computation.

The usual input form is

H(0)=Xtok+Xpos. H^{(0)} = X_{\text{tok}} + X_{\text{pos}}.

Here XtokX_{\text{tok}} contains token embeddings and XposX_{\text{pos}} contains positional vectors.

Absolute Positional Embeddings

The simplest method is to learn one vector for each position.

If the maximum sequence length is TmaxT_{\max}, the learned positional embedding table is

PRTmax×D. P \in \mathbb{R}^{T_{\max} \times D}.

For a sequence of length TT, we select the first TT rows:

P0:TRT×D. P_{0:T} \in \mathbb{R}^{T \times D}.

The input becomes

Hb,t,:(0)=Exb,t+Pt. H^{(0)}_{b,t,:} = E_{x_{b,t}} + P_t.

In PyTorch:

import torch
from torch import nn

B, T, V, D = 4, 16, 30_000, 768

tokens = torch.randint(0, V, (B, T))

token_emb = nn.Embedding(V, D)
pos_emb = nn.Embedding(512, D)

positions = torch.arange(T, device=tokens.device)
positions = positions.unsqueeze(0).expand(B, T)

x = token_emb(tokens) + pos_emb(positions)

print(x.shape)  # torch.Size([4, 16, 768])

Learned absolute embeddings are simple and effective. Their main limitation is that they are tied to a maximum context length. Extending a model beyond the trained length requires interpolation, extrapolation, or retraining.

Sinusoidal Positional Encoding

The original transformer used fixed sinusoidal positional encodings. Instead of learning PP, the encoding is computed from sine and cosine functions.

For position pp and feature index ii, the encoding is

PE(p,2i)=sin(p100002i/D), \text{PE}(p,2i)=\sin\left(\frac{p}{10000^{2i/D}}\right), PE(p,2i+1)=cos(p100002i/D). \text{PE}(p,2i+1)=\cos\left(\frac{p}{10000^{2i/D}}\right).

Each dimension varies at a different frequency. Low-index dimensions vary quickly. High-index dimensions vary slowly.

In PyTorch:

import math
import torch

def sinusoidal_positional_encoding(max_len: int, d_model: int):
    positions = torch.arange(max_len).unsqueeze(1)
    dims = torch.arange(0, d_model, 2)

    scale = torch.exp(-math.log(10000.0) * dims / d_model)

    pe = torch.zeros(max_len, d_model)
    pe[:, 0::2] = torch.sin(positions * scale)
    pe[:, 1::2] = torch.cos(positions * scale)

    return pe

pe = sinusoidal_positional_encoding(512, 768)
print(pe.shape)  # torch.Size([512, 768])

Sinusoidal encoding has no learned parameters. It can be computed for positions beyond those seen during training, although extrapolation quality still depends on the model.

Relative Position

Absolute position tells the model where a token is. Relative position tells the model how far two tokens are from each other.

For many tasks, distance matters more than absolute index. In the sentence

The animal that the child saw ran away.

syntax depends on relations among words, not merely on their absolute positions.

Relative positional methods modify attention scores using a term based on iji-j, the distance between query position ii and key position jj.

A simplified form is

Sij=qikjdk+bij. S_{ij} = \frac{q_i^\top k_j}{\sqrt{d_k}} + b_{i-j}.

Here bijb_{i-j} is a learned bias for the relative distance between two positions.

This lets the model learn patterns such as “nearby tokens are often important” or “previous tokens matter more than distant tokens” without relying only on absolute indices.

Attention Biases

Many modern transformer variants use attention biases instead of adding positional vectors to token embeddings. The idea is to alter the attention score matrix directly.

For attention scores

SRB×h×T×T, S \in \mathbb{R}^{B \times h \times T \times T},

we add a position-dependent bias

S=S+Bpos. S' = S + B_{\text{pos}}.

The attention weights become

A=softmax(S). A = \text{softmax}(S').

The bias tensor may encode distance, direction, segment information, or task-specific structure.

This approach is common because attention is where token-token relations are computed. Adding positional information at this point gives the model direct access to relative layout.

Rotary Positional Embeddings

Rotary positional embeddings, usually called RoPE, encode position by rotating query and key vectors. Instead of adding a positional vector to the input, RoPE applies a position-dependent rotation inside attention.

For each pair of dimensions, RoPE applies a 2D rotation:

[x2ix2i+1]=[cosθp,isinθp,isinθp,icosθp,i][x2ix2i+1]. \begin{bmatrix} x'_{2i} \\ x'_{2i+1} \end{bmatrix} = \begin{bmatrix} \cos \theta_{p,i} & -\sin \theta_{p,i} \\ \sin \theta_{p,i} & \cos \theta_{p,i} \end{bmatrix} \begin{bmatrix} x_{2i} \\ x_{2i+1} \end{bmatrix}.

The angle depends on the position pp and dimension pair ii. Queries and keys are rotated before computing dot products.

The important property is that the dot product between a rotated query and a rotated key depends on their relative offset. Thus RoPE gives attention access to relative position while keeping the computation efficient.

A simplified PyTorch implementation:

def rotate_half(x):
    x_even = x[..., 0::2]
    x_odd = x[..., 1::2]
    return torch.stack((-x_odd, x_even), dim=-1).flatten(-2)

def apply_rope(x, cos, sin):
    return x * cos + rotate_half(x) * sin

In practice, RoPE is applied to query and key tensors with shape such as

[B, heads, T, head_dim]

RoPE is widely used in decoder-only language models because it works well with causal attention and long-context extensions.

ALiBi

ALiBi, short for Attention with Linear Biases, adds a head-specific linear penalty based on distance. For query position ii and key position jj, the attention score is modified as

Sij=qikjdk+mh(ij), S_{ij} = \frac{q_i^\top k_j}{\sqrt{d_k}} + m_h(i-j),

where mhm_h is a slope for attention head hh.

In causal language models, jij \le i, so iji-j is the distance into the past. More distant tokens receive different bias values depending on the head.

ALiBi has no learned position embedding table and can extrapolate to longer sequences more naturally than learned absolute embeddings. It is simple and memory efficient.

Positional Encoding for Vision Transformers

In Vision Transformers, position refers to patch location rather than word order.

An image of size H×WH \times W is split into patches. If each patch has size P×PP \times P, the number of patch positions is

T=HPWP. T = \frac{H}{P}\cdot\frac{W}{P}.

Each patch receives a positional embedding. For a 2D image grid, position may be represented by:

MethodDescription
Learned 1D embeddingFlatten patches into a sequence and learn one vector per index
Learned 2D embeddingLearn separate row and column embeddings
Sinusoidal 2D encodingUse fixed functions for height and width
Relative position biasAdd position bias based on patch offset
Rotary positionApply rotations based on 2D position

If images have different resolutions, learned absolute embeddings may need interpolation. Relative position methods are often more flexible for variable image sizes.

Position and Length Extrapolation

Length extrapolation means using a transformer on sequences longer than those seen during training.

This is difficult because a model may learn behavior that depends on the training context length. Even if the positional method can represent longer positions, the model may not use them correctly.

Common strategies include:

StrategyDescription
Train on longer contextsDirect but expensive
Positional interpolationRescale positions into the trained range
RoPE scalingAdjust RoPE frequencies for longer contexts
Sliding-window attentionLimit attention to local windows
Sparse global attentionCombine local and selected global tokens
Retrieval augmentationMove long-range information outside the context window

Long-context modeling requires both positional design and attention efficiency. Position encoding alone does not solve the memory cost of full attention.

Choosing a Positional Method

For small educational models, learned absolute embeddings are easiest. They work well when the maximum sequence length is fixed.

For encoder models, learned absolute embeddings and relative position biases are both common. For decoder-only language models, RoPE is a common default. For very long contexts, RoPE scaling, ALiBi, sparse attention, or hybrid memory mechanisms may be considered.

SettingCommon choice
Small text classifierLearned absolute embeddings
BERT-style encoderLearned absolute embeddings or relative bias
GPT-style decoderRoPE
Long-context decoderRoPE scaling or ALiBi
Vision TransformerLearned 2D or interpolated absolute embeddings
Windowed vision modelRelative position bias

The choice affects extrapolation, memory use, implementation complexity, and compatibility with pretrained checkpoints.

Summary

Self-attention needs positional information because attention alone treats input positions symmetrically. Positional encoding gives a transformer access to order and distance.

Absolute positional embeddings add one vector per position. Sinusoidal encodings compute fixed vectors from sine and cosine functions. Relative position methods alter attention scores using token distances. RoPE rotates query and key vectors so attention depends on relative offsets. ALiBi adds distance-based linear biases.

A practical transformer implementation should make positional encoding an explicit design choice. The right method depends on the architecture, task, context length, and expected extrapolation behavior.