Attention Complexity

Attention gives a model direct access between positions in a sequence. This direct access is powerful, but it has a cost. Standard self-attention compares every position with every other position, so its memory and compute grow quadratically with sequence length.

For short and medium sequences, this cost is acceptable. For long documents, high-resolution images, long audio, video, code repositories, and retrieval contexts, attention cost becomes one of the main limits.

Where the Cost Comes From

Suppose a sequence has length $T$ . In self-attention, each token produces a query, key, and value.

The expensive part is the score matrix:

S = QK^\top.

Q,K\in\mathbb{R}^{T\times d_k},

then

S\in\mathbb{R}^{T\times T}.

Every token is compared with every token. This gives $T^2$ pairwise scores.

For a batch with $B$ examples and $H$ heads, the score tensor has shape:

[B, H, T, T]

This tensor dominates memory for long sequences.

Compute Complexity

For each head, computing

QK^\top

costs roughly

O(T^2 d_h),

where $d_h$ is the per-head dimension.

Multiplying attention weights by values,

AV,

also costs roughly

O(T^2 d_h).

With $H$ heads and model dimension $D = H d_h$ , the total attention compute is approximately

O(T^2 D).

The quadratic term $T^2$ is the main issue.

Memory Complexity

The attention matrix for each head stores $T^2$ weights. With batch size $B$ and $H$ heads, the memory for attention weights is approximately

O(BHT^2).

This excludes activations, gradients, optimizer states, and feedforward layers.

During training, memory usage is larger because intermediate tensors must be kept for backpropagation. This is why long-context training is much harder than long-context inference.

Concrete Scale

Consider one layer with:

B = 1,\quad H = 32,\quad T = 8192.

The attention weight tensor has:

1 \times 32 \times 8192 \times 8192 = 2{,}147{,}483{,}648

entries.

In float16, each entry uses 2 bytes. The attention weights alone require about:

2{,}147{,}483{,}648 \times 2 \approx 4.29\text{ GB}.

That is only one tensor in one layer. A full model has many layers and many other activations.

Cross-Attention Cost

Cross-attention has a similar structure, but the query length and source length may differ.

Let

T_q

be the query length and

T_k

be the key-value length.

The score matrix has shape:

T_q \times T_k.

So the cost is approximately

O(T_qT_kD).

Cross-attention can be cheaper than self-attention if one side is short. It can also be expensive when both sides are long, such as long-document question answering or video-text generation.

Attention Versus Feedforward Cost

Transformer blocks also contain feedforward networks. A typical feedforward network expands the hidden dimension from $D$ to $4D$ , then projects back to $D$ .

Its compute cost per sequence is roughly

O(TD^2).

Attention costs roughly:

O(T^2D).

Which term dominates depends on $T$ and $D$ .

When $T$ is small and $D$ is large, feedforward layers can dominate. When $T$ is large, attention dominates.

A rough crossover happens when:

T \approx D.

For many large language models, $D$ is several thousand. Attention becomes especially costly when context lengths reach thousands or tens of thousands of tokens.

Autoregressive Inference and KV Cache

During autoregressive generation, tokens are produced one at a time. Without caching, the model would recompute keys and values for all previous tokens at every step.

A key-value cache avoids this.

At each layer, the model stores previous keys and values:

K_cache: [B, H, T, d_h]
V_cache: [B, H, T, d_h]

When a new token arrives, only its new query, key, and value are computed. The query attends to cached keys and values.

This reduces repeated computation, but it introduces memory cost. The cache grows linearly with sequence length, number of layers, number of heads, and head dimension.

KV Cache Memory

For a decoder-only model with $L$ layers, batch size $B$ , heads $H$ , sequence length $T$ , and head dimension $d_h$ , the KV cache stores both keys and values:

2 \times L \times B \times H \times T \times d_h

numbers.

Since $H d_h = D$ , this is also:

2 \times L \times B \times T \times D.

In float16, multiply by 2 bytes per number.

This memory can become the main inference bottleneck for long-context models and large batch serving.

Efficient Attention Kernels

One way to improve attention is to compute exact attention more efficiently.

Standard implementations materialize the full attention matrix in memory. Memory-efficient kernels avoid storing the full matrix. They compute attention in blocks and keep only the necessary intermediate values.

FlashAttention is the main example of this idea.

It preserves exact scaled dot-product attention but reduces memory traffic. This is important because attention is often limited by memory bandwidth rather than arithmetic alone.

The mathematical result is the same. The implementation is more efficient.

Sparse Attention

Sparse attention reduces the number of query-key pairs.

Instead of every token attending to every token, each token attends to a subset.

Common sparse patterns include:

Pattern	Idea
Local window	Attend only to nearby tokens
Dilated window	Attend at regular intervals
Global tokens	Special tokens attend broadly
Block sparse	Attend between selected blocks
Sliding window	Move a fixed attention window across the sequence

If each token attends to $w$ tokens instead of $T$ , the cost becomes roughly

O(TwD).

When $w \ll T$ , this is much cheaper than $O(T^2D)$ .

Sparse attention works well when the task has locality. It can be weaker when every token may need arbitrary global context.

Linear Attention

Linear attention tries to avoid forming the $T\times T$ attention matrix.

Standard attention has:

\operatorname{softmax}(QK^\top)V.

Linear attention replaces the softmax kernel with a feature map $\phi(\cdot)$ , allowing a rearrangement such as:

\phi(Q)(\phi(K)^\top V).

This can reduce sequence complexity from quadratic to linear in $T$ , under suitable assumptions.

The tradeoff is approximation quality. Linear attention often changes the behavior of the model and may underperform exact attention on some tasks.

Low-Rank and Compressed Attention

Another approach is to compress keys and values.

Instead of attending over all $T$ positions, the model attends over a smaller set of $r$ learned or computed summaries, where $r \ll T$ .

This gives cost closer to:

O(TrD).

Examples include low-rank approximations, pooling-based compression, memory tokens, and latent bottlenecks.

These methods trade exact token-level access for lower cost.

Recurrent and State-Space Alternatives

Some architectures avoid full attention by using recurrent state or state-space models.

The model processes tokens sequentially and maintains a hidden state. This can give linear-time sequence processing and constant or slowly growing memory.

The tradeoff is reduced direct access. Attention can connect any two positions in one layer. Recurrent and state-space models usually compress history into a state.

Hybrid models combine attention with recurrent or state-space components. Attention handles flexible retrieval. The recurrent component handles long streams efficiently.

Chunking and Sliding Context

For long documents, a practical method is to split input into chunks.

Each chunk is processed separately or with limited overlap. A retrieval system may select relevant chunks before the model reads them.

This changes the problem from “attend to everything” to “select what should enter the context.”

Chunking is common in retrieval-augmented generation. It moves part of the burden from attention to indexing and retrieval.

Tradeoffs

There is no free attention mechanism. Each efficient method makes a tradeoff.

Method	Benefit	Cost
FlashAttention	Exact result, lower memory traffic	Still quadratic in compute
Sparse attention	Lower compute and memory	Needs a good sparsity pattern
Local attention	Efficient for nearby context	Weak global access
Linear attention	Linear sequence cost	Changes attention behavior
Low-rank attention	Reduces source length	May lose fine detail
KV caching	Faster generation	Large inference memory
Retrieval and chunking	Handles huge corpora	Depends on retrieval quality

The right choice depends on sequence length, hardware, training budget, and task structure.

PyTorch Example: Measuring Attention Tensor Size

A simple way to estimate attention weight memory is:

def attention_weight_memory(batch_size, num_heads, seq_len, dtype_bytes=2):
    entries = batch_size * num_heads * seq_len * seq_len
    bytes_used = entries * dtype_bytes
    gib = bytes_used / (1024 ** 3)
    return entries, gib

entries, gib = attention_weight_memory(
    batch_size=1,
    num_heads=32,
    seq_len=8192,
    dtype_bytes=2,
)

print(entries)  # 2147483648
print(gib)      # 4.0

This estimates only the attention weight tensor. A real model needs more memory.

Summary

Standard self-attention has quadratic cost in sequence length because it compares every token with every other token. Its compute is roughly $O(T^2D)$ , and its attention-weight memory is roughly $O(BHT^2)$ .

This cost is manageable for short sequences but becomes a major bottleneck for long-context models. Efficient attention methods reduce memory traffic, restrict attention patterns, compress memory, approximate attention, or combine attention with retrieval and recurrent mechanisms.

Attention complexity is therefore not just a mathematical detail. It shapes model architecture, hardware requirements, training cost, inference latency, and the maximum usable context length.