Skip to content

Positional Encoding

Self-attention compares tokens by content. By itself, it has no built-in notion of token order.

Self-attention compares tokens by content. By itself, it has no built-in notion of token order. If a sequence is permuted, self-attention follows the permutation. It can still compare all tokens, but it cannot tell whether a token came first, second, or last unless position information is added.

This matters because many sequences are order-sensitive.

dog bites man
man bites dog

These two sequences contain the same words, but they have different meanings. A transformer needs a representation of both content and position.

Positional encoding supplies this missing information. It adds or injects position-dependent signals into token representations before or during attention.

Given token embeddings

XRB×T×D, X \in \mathbb{R}^{B \times T \times D},

a common form is

H=X+P, H = X + P,

where

PRT×D P \in \mathbb{R}^{T \times D}

contains one position vector for each sequence index.

The result HH has the same shape as XX, but each token vector now contains information about both identity and position.

Why Self-Attention Needs Positions

A self-attention layer computes scores from pairwise dot products:

S=QK. S = QK^\top.

If the input contains no positional signal, then the score between two tokens depends only on their learned content representations. The operation is permutation equivariant: permuting the input positions permutes the output positions in the same way.

Permutation equivariance is useful for sets and graphs, but it is insufficient for ordered sequences. Language, audio, time series, and action trajectories depend on order. The model must distinguish “A before B” from “B before A.”

Convolutional networks and recurrent networks have order built into their structure. Convolutions use local neighborhoods. Recurrent networks process tokens step by step. Transformers need an explicit positional mechanism.

Learned Absolute Positional Embeddings

The simplest method is to learn a vector for each position.

For a maximum context length TmaxT_{\max}, define an embedding table

PRTmax×D. P \in \mathbb{R}^{T_{\max} \times D}.

For a sequence of length TT, we take the first TT rows and add them to the token embeddings:

Hb,t,:=Xb,t,:+Pt,:. H_{b,t,:} = X_{b,t,:} + P_{t,:}.

In PyTorch:

import torch
from torch import nn

class LearnedPositionalEmbedding(nn.Module):
    def __init__(self, max_length: int, d_model: int):
        super().__init__()
        self.position_embedding = nn.Embedding(max_length, d_model)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: [B, T, D]
        B, T, D = x.shape

        positions = torch.arange(T, device=x.device)
        pos = self.position_embedding(positions)

        # pos: [T, D], broadcast to [B, T, D]
        return x + pos

Usage:

x = torch.randn(8, 128, 768)

pos = LearnedPositionalEmbedding(max_length=512, d_model=768)
h = pos(x)

print(h.shape)  # torch.Size([8, 128, 768])

Learned absolute positional embeddings are simple and effective. Their main limitation is that they are tied to the maximum length used during training. Extending them beyond the trained context length requires interpolation, extrapolation, or retraining.

Sinusoidal Positional Encoding

The original transformer used fixed sinusoidal positional encodings. Instead of learning position vectors, it defines them using sine and cosine waves at different frequencies.

For position pp and channel index ii, the encoding is

Pp,2i=sin(p100002i/D), P_{p,2i} = \sin \left( \frac{p}{10000^{2i/D}} \right), Pp,2i+1=cos(p100002i/D). P_{p,2i+1} = \cos \left( \frac{p}{10000^{2i/D}} \right).

Even dimensions use sine. Odd dimensions use cosine. Lower dimensions vary quickly. Higher dimensions vary slowly.

This gives each position a deterministic vector, and relative offsets can be expressed through linear relations among sinusoidal components.

In PyTorch:

import math
import torch
from torch import nn

class SinusoidalPositionalEncoding(nn.Module):
    def __init__(self, max_length: int, d_model: int):
        super().__init__()

        pe = torch.zeros(max_length, d_model)
        position = torch.arange(0, max_length).unsqueeze(1).float()

        div_term = torch.exp(
            torch.arange(0, d_model, 2).float()
            * (-math.log(10000.0) / d_model)
        )

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        self.register_buffer("pe", pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: [B, T, D]
        T = x.shape[1]
        return x + self.pe[:T, :]

Usage:

x = torch.randn(4, 100, 256)

pos = SinusoidalPositionalEncoding(max_length=1000, d_model=256)
h = pos(x)

print(h.shape)  # torch.Size([4, 100, 256])

The tensor pe is registered as a buffer rather than a parameter. It is saved with the module and moved between devices, but it is not updated by gradient descent.

Absolute Versus Relative Position

Absolute positional methods assign a representation to each index:

position 0
position 1
position 2
...

This lets the model know where a token appears in the sequence. However, many sequence relations depend more on relative distance than absolute index.

For example, in language, local dependencies often depend on whether two words are close. In music, rhythm depends on offsets. In time series, recency often matters more than absolute timestamp.

Relative position methods represent the distance between tokens. Instead of only asking “what is the position of token ii?”, they also ask “how far is token jj from token ii?”

A relative position bias modifies the attention score:

sij=qikjdh+bij. s_{ij} = \frac{q_i^\top k_j}{\sqrt{d_h}} + b_{i-j}.

Here bijb_{i-j} is a learned bias depending on the relative offset between positions ii and jj. Nearby tokens can receive different biases from distant tokens. Future tokens can receive different biases from past tokens.

This approach directly affects attention weights rather than token embeddings.

Relative Position Bias

A simple relative bias table stores one scalar per relative distance. For sequence length TT, relative offsets range from (T1)-(T-1) to T1T-1. A practical implementation usually clips distances to a fixed range.

For HH heads, the bias can have shape

[H, 2 * max_distance + 1]

During attention, we build a bias matrix with shape

[H, T, T]

and add it to the attention scores.

A compact implementation:

class RelativePositionBias(nn.Module):
    def __init__(self, num_heads: int, max_distance: int):
        super().__init__()
        self.num_heads = num_heads
        self.max_distance = max_distance

        self.bias = nn.Embedding(2 * max_distance + 1, num_heads)

    def forward(self, T: int) -> torch.Tensor:
        device = self.bias.weight.device

        positions = torch.arange(T, device=device)
        rel = positions[:, None] - positions[None, :]

        rel = rel.clamp(-self.max_distance, self.max_distance)
        rel = rel + self.max_distance

        # [T, T, H]
        bias = self.bias(rel)

        # [H, T, T]
        return bias.permute(2, 0, 1)

Using it inside attention:

B, H, T, D = 2, 8, 16, 64

q = torch.randn(B, H, T, D)
k = torch.randn(B, H, T, D)
v = torch.randn(B, H, T, D)

scores = q @ k.transpose(-2, -1)
scores = scores / math.sqrt(D)

rel_bias = RelativePositionBias(num_heads=H, max_distance=32)
scores = scores + rel_bias(T).unsqueeze(0)

weights = torch.softmax(scores, dim=-1)
out = weights @ v

print(out.shape)  # torch.Size([2, 8, 16, 64])

Relative bias is widely used because it is simple, cheap, and effective.

Rotary Positional Embeddings

Rotary positional embeddings, often called RoPE, encode position by rotating query and key vectors. Instead of adding a position vector to the input, RoPE applies a position-dependent rotation in each pair of hidden dimensions.

For a two-dimensional pair, a vector is rotated by an angle depending on the token position. If the position is pp, the rotation is

[x1x2]=[cosθpsinθpsinθpcosθp][x1x2]. \begin{bmatrix} x'_1 \\ x'_2 \end{bmatrix} = \begin{bmatrix} \cos \theta_p & -\sin \theta_p \\ \sin \theta_p & \cos \theta_p \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}.

This is applied to query and key vectors before computing attention. The dot product between a rotated query and a rotated key then depends on their relative positions.

A simplified PyTorch implementation:

def rotate_half(x: torch.Tensor) -> torch.Tensor:
    x1 = x[..., 0::2]
    x2 = x[..., 1::2]
    return torch.stack((-x2, x1), dim=-1).flatten(-2)

def apply_rope(x: torch.Tensor, cos: torch.Tensor, sin: torch.Tensor) -> torch.Tensor:
    # x:   [B, H, T, D]
    # cos: [T, D]
    # sin: [T, D]
    return x * cos[None, None, :, :] + rotate_half(x) * sin[None, None, :, :]

def rope_cache(T: int, D: int, device=None, base: float = 10000.0):
    inv_freq = 1.0 / (
        base ** (torch.arange(0, D, 2, device=device).float() / D)
    )

    positions = torch.arange(T, device=device).float()
    freqs = torch.einsum("t,d->td", positions, inv_freq)

    emb = torch.cat([freqs, freqs], dim=-1)
    return emb.cos(), emb.sin()

Usage:

B, H, T, D = 2, 8, 128, 64

q = torch.randn(B, H, T, D)
k = torch.randn(B, H, T, D)

cos, sin = rope_cache(T, D, device=q.device)

q = apply_rope(q, cos, sin)
k = apply_rope(k, cos, sin)

RoPE is common in modern decoder-only language models because it works naturally with causal attention and supports relative-position behavior.

ALiBi

ALiBi, short for Attention with Linear Biases, adds a head-specific linear penalty based on distance. It does not add position embeddings to token vectors. It modifies attention scores directly.

For causal attention, the bias often has the form

sij=qikjdhmh(ij),ji. s_{ij} = \frac{q_i^\top k_j}{\sqrt{d_h}} - m_h(i-j), \quad j \leq i.

Here mhm_h is a slope assigned to attention head hh. Larger distances receive larger negative bias, so the model has an inductive preference for nearer tokens. Different heads use different slopes, allowing some heads to focus locally and others to attend more broadly.

The main appeal is extrapolation. Since the bias is defined by distance, it can be applied to longer sequences than those seen during training.

A simple causal ALiBi bias:

def alibi_bias(num_heads: int, T: int, device=None) -> torch.Tensor:
    slopes = torch.tensor(
        [1.0 / (2 ** i) for i in range(num_heads)],
        device=device,
    )

    positions = torch.arange(T, device=device)
    distance = positions[:, None] - positions[None, :]
    distance = distance.clamp(min=0).float()

    # [H, T, T]
    return -slopes[:, None, None] * distance[None, :, :]

In a real implementation, slopes are usually chosen by a specific schedule. The core idea remains simple: attention scores are biased by relative distance.

Position Interpolation and Long Contexts

Absolute learned embeddings have a fixed maximum length. Sinusoidal encodings, RoPE, and ALiBi can be applied beyond the training length, but performance may still degrade if the model was not trained on long contexts.

Long-context extension methods often modify positional encodings. One common approach is position interpolation. Instead of feeding positions 0,,T10,\ldots,T-1 directly, positions are rescaled into the range seen during training.

If a model was trained up to length TtrainT_{\text{train}} and we want to use length TtestT_{\text{test}}, we can map test position pp to

p=pTtrainTtest. p' = p \cdot \frac{T_{\text{train}}}{T_{\text{test}}}.

This compresses long positions into the trained range. Variants of this idea are used with rotary embeddings and other long-context adaptations.

Long-context behavior depends on more than position encodings. Data distribution, attention implementation, optimization, memory, and evaluation all matter.

Positional Encoding in Images

Vision transformers divide an image into patches. Each patch becomes a token. Positional information tells the model where each patch came from.

For an image split into a grid of patches, a learned positional embedding may have shape

[1, num_patches + 1, D]

The extra position is often used for a class token.

For a 224×224224 \times 224 image with 16×1616 \times 16 patches, the patch grid is 14×1414 \times 14, so there are 196 patch tokens.

If using a class token, the sequence length is 197.

B = 8
num_patches = 196
D = 768

patch_tokens = torch.randn(B, num_patches, D)
cls_token = torch.randn(B, 1, D)

x = torch.cat([cls_token, patch_tokens], dim=1)

pos_embedding = nn.Parameter(torch.randn(1, num_patches + 1, D))
x = x + pos_embedding

print(x.shape)  # torch.Size([8, 197, 768])

For images of different resolutions, learned absolute position embeddings may need interpolation over the two-dimensional patch grid.

Positional Encoding in Time Series and Audio

Time series and audio often require additional care. Position may represent sample index, time step, frequency bin, or window index.

For raw audio, the sequence length can be very large. Direct full self-attention over all samples is usually too expensive. Models often operate on frames, patches, spectrograms, or compressed latent representations.

For time series, absolute time may matter in some tasks and relative lag may matter in others. A forecasting model may need calendar features such as hour, day, weekday, or season. These can be represented as additional embeddings or continuous features.

Thus positional encoding is not one universal object. It should match the structure of the data.

Choosing a Positional Method

The right positional method depends on the model and task.

MethodStrengthLimitation
Learned absolute embeddingsSimple and strong within trained lengthWeak extrapolation beyond max length
Sinusoidal encodingFixed, deterministic, no learned tableLess dominant in modern LLMs
Relative position biasDirectly models pairwise distanceUsually tied to attention implementation
RoPEStrong for causal language modelsLong-context scaling needs care
ALiBiSimple extrapolation behaviorLess expressive than some alternatives
2D position embeddingsNatural for imagesResolution changes may need interpolation

For a first PyTorch transformer, learned absolute embeddings are easiest. For a modern decoder-only language model, RoPE is a common default. For vision transformers, learned 2D or flattened patch embeddings are common. For long-context experiments, RoPE scaling or ALiBi are useful starting points.

Summary

Self-attention needs positional information because content-only attention does not encode order. Positional encoding adds order information to token representations or attention scores.

Absolute embeddings assign vectors to positions. Sinusoidal encodings use fixed waves. Relative position bias modifies pairwise attention scores. RoPE rotates query and key vectors so dot products depend on relative position. ALiBi adds distance-based linear bias.

The choice of positional method affects extrapolation, long-context behavior, implementation complexity, and model quality.