Skip to content

Attention Complexity

Attention gives a model direct access between positions in a sequence.

Attention gives a model direct access between positions in a sequence. This direct access is powerful, but it has a cost. Standard self-attention compares every position with every other position, so its memory and compute grow quadratically with sequence length.

For short and medium sequences, this cost is acceptable. For long documents, high-resolution images, long audio, video, code repositories, and retrieval contexts, attention cost becomes one of the main limits.

Where the Cost Comes From

Suppose a sequence has length TT. In self-attention, each token produces a query, key, and value.

The expensive part is the score matrix:

S=QK. S = QK^\top.

If

Q,KRT×dk, Q,K\in\mathbb{R}^{T\times d_k},

then

SRT×T. S\in\mathbb{R}^{T\times T}.

Every token is compared with every token. This gives T2T^2 pairwise scores.

For a batch with BB examples and HH heads, the score tensor has shape:

[B, H, T, T]

This tensor dominates memory for long sequences.

Compute Complexity

For each head, computing

QK QK^\top

costs roughly

O(T2dh), O(T^2 d_h),

where dhd_h is the per-head dimension.

Multiplying attention weights by values,

AV, AV,

also costs roughly

O(T2dh). O(T^2 d_h).

With HH heads and model dimension D=HdhD = H d_h, the total attention compute is approximately

O(T2D). O(T^2 D).

The quadratic term T2T^2 is the main issue.

Memory Complexity

The attention matrix for each head stores T2T^2 weights. With batch size BB and HH heads, the memory for attention weights is approximately

O(BHT2). O(BHT^2).

This excludes activations, gradients, optimizer states, and feedforward layers.

During training, memory usage is larger because intermediate tensors must be kept for backpropagation. This is why long-context training is much harder than long-context inference.

Concrete Scale

Consider one layer with:

B=1,H=32,T=8192. B = 1,\quad H = 32,\quad T = 8192.

The attention weight tensor has:

1×32×8192×8192=2,147,483,648 1 \times 32 \times 8192 \times 8192 = 2{,}147{,}483{,}648

entries.

In float16, each entry uses 2 bytes. The attention weights alone require about:

2,147,483,648×24.29 GB. 2{,}147{,}483{,}648 \times 2 \approx 4.29\text{ GB}.

That is only one tensor in one layer. A full model has many layers and many other activations.

Cross-Attention Cost

Cross-attention has a similar structure, but the query length and source length may differ.

Let

Tq T_q

be the query length and

Tk T_k

be the key-value length.

The score matrix has shape:

Tq×Tk. T_q \times T_k.

So the cost is approximately

O(TqTkD). O(T_qT_kD).

Cross-attention can be cheaper than self-attention if one side is short. It can also be expensive when both sides are long, such as long-document question answering or video-text generation.

Attention Versus Feedforward Cost

Transformer blocks also contain feedforward networks. A typical feedforward network expands the hidden dimension from DD to 4D4D, then projects back to DD.

Its compute cost per sequence is roughly

O(TD2). O(TD^2).

Attention costs roughly:

O(T2D). O(T^2D).

Which term dominates depends on TT and DD.

When TT is small and DD is large, feedforward layers can dominate. When TT is large, attention dominates.

A rough crossover happens when:

TD. T \approx D.

For many large language models, DD is several thousand. Attention becomes especially costly when context lengths reach thousands or tens of thousands of tokens.

Autoregressive Inference and KV Cache

During autoregressive generation, tokens are produced one at a time. Without caching, the model would recompute keys and values for all previous tokens at every step.

A key-value cache avoids this.

At each layer, the model stores previous keys and values:

K_cache: [B, H, T, d_h]
V_cache: [B, H, T, d_h]

When a new token arrives, only its new query, key, and value are computed. The query attends to cached keys and values.

This reduces repeated computation, but it introduces memory cost. The cache grows linearly with sequence length, number of layers, number of heads, and head dimension.

KV Cache Memory

For a decoder-only model with LL layers, batch size BB, heads HH, sequence length TT, and head dimension dhd_h, the KV cache stores both keys and values:

2×L×B×H×T×dh 2 \times L \times B \times H \times T \times d_h

numbers.

Since Hdh=DH d_h = D, this is also:

2×L×B×T×D. 2 \times L \times B \times T \times D.

In float16, multiply by 2 bytes per number.

This memory can become the main inference bottleneck for long-context models and large batch serving.

Efficient Attention Kernels

One way to improve attention is to compute exact attention more efficiently.

Standard implementations materialize the full attention matrix in memory. Memory-efficient kernels avoid storing the full matrix. They compute attention in blocks and keep only the necessary intermediate values.

FlashAttention is the main example of this idea.

It preserves exact scaled dot-product attention but reduces memory traffic. This is important because attention is often limited by memory bandwidth rather than arithmetic alone.

The mathematical result is the same. The implementation is more efficient.

Sparse Attention

Sparse attention reduces the number of query-key pairs.

Instead of every token attending to every token, each token attends to a subset.

Common sparse patterns include:

PatternIdea
Local windowAttend only to nearby tokens
Dilated windowAttend at regular intervals
Global tokensSpecial tokens attend broadly
Block sparseAttend between selected blocks
Sliding windowMove a fixed attention window across the sequence

If each token attends to ww tokens instead of TT, the cost becomes roughly

O(TwD). O(TwD).

When wTw \ll T, this is much cheaper than O(T2D)O(T^2D).

Sparse attention works well when the task has locality. It can be weaker when every token may need arbitrary global context.

Linear Attention

Linear attention tries to avoid forming the T×TT\times T attention matrix.

Standard attention has:

softmax(QK)V. \operatorname{softmax}(QK^\top)V.

Linear attention replaces the softmax kernel with a feature map ϕ()\phi(\cdot), allowing a rearrangement such as:

ϕ(Q)(ϕ(K)V). \phi(Q)(\phi(K)^\top V).

This can reduce sequence complexity from quadratic to linear in TT, under suitable assumptions.

The tradeoff is approximation quality. Linear attention often changes the behavior of the model and may underperform exact attention on some tasks.

Low-Rank and Compressed Attention

Another approach is to compress keys and values.

Instead of attending over all TT positions, the model attends over a smaller set of rr learned or computed summaries, where rTr \ll T.

This gives cost closer to:

O(TrD). O(TrD).

Examples include low-rank approximations, pooling-based compression, memory tokens, and latent bottlenecks.

These methods trade exact token-level access for lower cost.

Recurrent and State-Space Alternatives

Some architectures avoid full attention by using recurrent state or state-space models.

The model processes tokens sequentially and maintains a hidden state. This can give linear-time sequence processing and constant or slowly growing memory.

The tradeoff is reduced direct access. Attention can connect any two positions in one layer. Recurrent and state-space models usually compress history into a state.

Hybrid models combine attention with recurrent or state-space components. Attention handles flexible retrieval. The recurrent component handles long streams efficiently.

Chunking and Sliding Context

For long documents, a practical method is to split input into chunks.

Each chunk is processed separately or with limited overlap. A retrieval system may select relevant chunks before the model reads them.

This changes the problem from “attend to everything” to “select what should enter the context.”

Chunking is common in retrieval-augmented generation. It moves part of the burden from attention to indexing and retrieval.

Tradeoffs

There is no free attention mechanism. Each efficient method makes a tradeoff.

MethodBenefitCost
FlashAttentionExact result, lower memory trafficStill quadratic in compute
Sparse attentionLower compute and memoryNeeds a good sparsity pattern
Local attentionEfficient for nearby contextWeak global access
Linear attentionLinear sequence costChanges attention behavior
Low-rank attentionReduces source lengthMay lose fine detail
KV cachingFaster generationLarge inference memory
Retrieval and chunkingHandles huge corporaDepends on retrieval quality

The right choice depends on sequence length, hardware, training budget, and task structure.

PyTorch Example: Measuring Attention Tensor Size

A simple way to estimate attention weight memory is:

def attention_weight_memory(batch_size, num_heads, seq_len, dtype_bytes=2):
    entries = batch_size * num_heads * seq_len * seq_len
    bytes_used = entries * dtype_bytes
    gib = bytes_used / (1024 ** 3)
    return entries, gib

entries, gib = attention_weight_memory(
    batch_size=1,
    num_heads=32,
    seq_len=8192,
    dtype_bytes=2,
)

print(entries)  # 2147483648
print(gib)      # 4.0

This estimates only the attention weight tensor. A real model needs more memory.

Summary

Standard self-attention has quadratic cost in sequence length because it compares every token with every other token. Its compute is roughly O(T2D)O(T^2D), and its attention-weight memory is roughly O(BHT2)O(BHT^2).

This cost is manageable for short sequences but becomes a major bottleneck for long-context models. Efficient attention methods reduce memory traffic, restrict attention patterns, compress memory, approximate attention, or combine attention with retrieval and recurrent mechanisms.

Attention complexity is therefore not just a mathematical detail. It shapes model architecture, hardware requirements, training cost, inference latency, and the maximum usable context length.