Positional Encoding

Self-attention compares tokens to other tokens, but by itself it has no built-in notion of order. If we permute the input sequence and apply the same self-attention operation, the attention mechanism still compares all tokens in the same content-based way. A transformer therefore needs an additional signal that tells it where each token appears.

Positional encoding is the mechanism that injects order information into a transformer. It gives the model access to position, distance, and sometimes direction.

Why Position Is Needed

Consider the two sentences:

The dog chased the cat.
The cat chased the dog.

They contain almost the same words, but their meanings differ because the order differs. A model that ignores order cannot reliably distinguish subject from object.

For sequence input

X \in \mathbb{R}^{B \times T \times D},

self-attention computes interactions among positions, but the operation itself does not know that position 2 comes before position 5. Positional information must be added to the token representation or to the attention computation.

The usual input form is

H^{(0)} = X_{\text{tok}} + X_{\text{pos}}.

Here $X_{\text{tok}}$ contains token embeddings and $X_{\text{pos}}$ contains positional vectors.

Absolute Positional Embeddings

The simplest method is to learn one vector for each position.

If the maximum sequence length is $T_{\max}$ , the learned positional embedding table is

P \in \mathbb{R}^{T_{\max} \times D}.

For a sequence of length $T$ , we select the first $T$ rows:

P_{0:T} \in \mathbb{R}^{T \times D}.

The input becomes

H^{(0)}_{b,t,:} = E_{x_{b,t}} + P_t.

In PyTorch:

import torch
from torch import nn

B, T, V, D = 4, 16, 30_000, 768

tokens = torch.randint(0, V, (B, T))

token_emb = nn.Embedding(V, D)
pos_emb = nn.Embedding(512, D)

positions = torch.arange(T, device=tokens.device)
positions = positions.unsqueeze(0).expand(B, T)

x = token_emb(tokens) + pos_emb(positions)

print(x.shape)  # torch.Size([4, 16, 768])

Learned absolute embeddings are simple and effective. Their main limitation is that they are tied to a maximum context length. Extending a model beyond the trained length requires interpolation, extrapolation, or retraining.

Sinusoidal Positional Encoding

The original transformer used fixed sinusoidal positional encodings. Instead of learning $P$ , the encoding is computed from sine and cosine functions.

For position $p$ and feature index $i$ , the encoding is

\text{PE}(p,2i)=\sin\left(\frac{p}{10000^{2i/D}}\right),

\text{PE}(p,2i+1)=\cos\left(\frac{p}{10000^{2i/D}}\right).

Each dimension varies at a different frequency. Low-index dimensions vary quickly. High-index dimensions vary slowly.

In PyTorch:

import math
import torch

def sinusoidal_positional_encoding(max_len: int, d_model: int):
    positions = torch.arange(max_len).unsqueeze(1)
    dims = torch.arange(0, d_model, 2)

    scale = torch.exp(-math.log(10000.0) * dims / d_model)

    pe = torch.zeros(max_len, d_model)
    pe[:, 0::2] = torch.sin(positions * scale)
    pe[:, 1::2] = torch.cos(positions * scale)

    return pe

pe = sinusoidal_positional_encoding(512, 768)
print(pe.shape)  # torch.Size([512, 768])

Sinusoidal encoding has no learned parameters. It can be computed for positions beyond those seen during training, although extrapolation quality still depends on the model.

Relative Position

Absolute position tells the model where a token is. Relative position tells the model how far two tokens are from each other.

For many tasks, distance matters more than absolute index. In the sentence

The animal that the child saw ran away.

syntax depends on relations among words, not merely on their absolute positions.

Relative positional methods modify attention scores using a term based on $i-j$ , the distance between query position $i$ and key position $j$ .

A simplified form is

S_{ij} = \frac{q_i^\top k_j}{\sqrt{d_k}} + b_{i-j}.

Here $b_{i-j}$ is a learned bias for the relative distance between two positions.

This lets the model learn patterns such as “nearby tokens are often important” or “previous tokens matter more than distant tokens” without relying only on absolute indices.

Attention Biases

Many modern transformer variants use attention biases instead of adding positional vectors to token embeddings. The idea is to alter the attention score matrix directly.

For attention scores

S \in \mathbb{R}^{B \times h \times T \times T},

we add a position-dependent bias

S' = S + B_{\text{pos}}.

The attention weights become

A = \text{softmax}(S').

The bias tensor may encode distance, direction, segment information, or task-specific structure.

This approach is common because attention is where token-token relations are computed. Adding positional information at this point gives the model direct access to relative layout.

Rotary Positional Embeddings

Rotary positional embeddings, usually called RoPE, encode position by rotating query and key vectors. Instead of adding a positional vector to the input, RoPE applies a position-dependent rotation inside attention.

For each pair of dimensions, RoPE applies a 2D rotation:

\begin{bmatrix} x'_{2i} \\ x'_{2i+1} \end{bmatrix} = \begin{bmatrix} \cos \theta_{p,i} & -\sin \theta_{p,i} \\ \sin \theta_{p,i} & \cos \theta_{p,i} \end{bmatrix} \begin{bmatrix} x_{2i} \\ x_{2i+1} \end{bmatrix}.

The angle depends on the position $p$ and dimension pair $i$ . Queries and keys are rotated before computing dot products.

The important property is that the dot product between a rotated query and a rotated key depends on their relative offset. Thus RoPE gives attention access to relative position while keeping the computation efficient.

A simplified PyTorch implementation:

def rotate_half(x):
    x_even = x[..., 0::2]
    x_odd = x[..., 1::2]
    return torch.stack((-x_odd, x_even), dim=-1).flatten(-2)

def apply_rope(x, cos, sin):
    return x * cos + rotate_half(x) * sin

In practice, RoPE is applied to query and key tensors with shape such as

[B, heads, T, head_dim]

RoPE is widely used in decoder-only language models because it works well with causal attention and long-context extensions.

ALiBi

ALiBi, short for Attention with Linear Biases, adds a head-specific linear penalty based on distance. For query position $i$ and key position $j$ , the attention score is modified as

S_{ij} = \frac{q_i^\top k_j}{\sqrt{d_k}} + m_h(i-j),

where $m_h$ is a slope for attention head $h$ .

In causal language models, $j \le i$ , so $i-j$ is the distance into the past. More distant tokens receive different bias values depending on the head.

ALiBi has no learned position embedding table and can extrapolate to longer sequences more naturally than learned absolute embeddings. It is simple and memory efficient.

Positional Encoding for Vision Transformers

In Vision Transformers, position refers to patch location rather than word order.

An image of size $H \times W$ is split into patches. If each patch has size $P \times P$ , the number of patch positions is

T = \frac{H}{P}\cdot\frac{W}{P}.

Each patch receives a positional embedding. For a 2D image grid, position may be represented by:

Method	Description
Learned 1D embedding	Flatten patches into a sequence and learn one vector per index
Learned 2D embedding	Learn separate row and column embeddings
Sinusoidal 2D encoding	Use fixed functions for height and width
Relative position bias	Add position bias based on patch offset
Rotary position	Apply rotations based on 2D position

If images have different resolutions, learned absolute embeddings may need interpolation. Relative position methods are often more flexible for variable image sizes.

Position and Length Extrapolation

Length extrapolation means using a transformer on sequences longer than those seen during training.

This is difficult because a model may learn behavior that depends on the training context length. Even if the positional method can represent longer positions, the model may not use them correctly.

Common strategies include:

Strategy	Description
Train on longer contexts	Direct but expensive
Positional interpolation	Rescale positions into the trained range
RoPE scaling	Adjust RoPE frequencies for longer contexts
Sliding-window attention	Limit attention to local windows
Sparse global attention	Combine local and selected global tokens
Retrieval augmentation	Move long-range information outside the context window

Long-context modeling requires both positional design and attention efficiency. Position encoding alone does not solve the memory cost of full attention.

Choosing a Positional Method

For small educational models, learned absolute embeddings are easiest. They work well when the maximum sequence length is fixed.

For encoder models, learned absolute embeddings and relative position biases are both common. For decoder-only language models, RoPE is a common default. For very long contexts, RoPE scaling, ALiBi, sparse attention, or hybrid memory mechanisms may be considered.

Setting	Common choice
Small text classifier	Learned absolute embeddings
BERT-style encoder	Learned absolute embeddings or relative bias
GPT-style decoder	RoPE
Long-context decoder	RoPE scaling or ALiBi
Vision Transformer	Learned 2D or interpolated absolute embeddings
Windowed vision model	Relative position bias

The choice affects extrapolation, memory use, implementation complexity, and compatibility with pretrained checkpoints.

Summary

Self-attention needs positional information because attention alone treats input positions symmetrically. Positional encoding gives a transformer access to order and distance.

Absolute positional embeddings add one vector per position. Sinusoidal encodings compute fixed vectors from sine and cosine functions. Relative position methods alter attention scores using token distances. RoPE rotates query and key vectors so attention depends on relative offsets. ALiBi adds distance-based linear biases.

A practical transformer implementation should make positional encoding an explicit design choice. The right method depends on the architecture, task, context length, and expected extrapolation behavior.