# Attention Complexity

Attention gives a model direct access between positions in a sequence. This direct access is powerful, but it has a cost. Standard self-attention compares every position with every other position, so its memory and compute grow quadratically with sequence length.

For short and medium sequences, this cost is acceptable. For long documents, high-resolution images, long audio, video, code repositories, and retrieval contexts, attention cost becomes one of the main limits.

### Where the Cost Comes From

Suppose a sequence has length $T$. In self-attention, each token produces a query, key, and value.

The expensive part is the score matrix:

$$
S = QK^\top.
$$

If

$$
Q,K\in\mathbb{R}^{T\times d_k},
$$

then

$$
S\in\mathbb{R}^{T\times T}.
$$

Every token is compared with every token. This gives $T^2$ pairwise scores.

For a batch with $B$ examples and $H$ heads, the score tensor has shape:

```python
[B, H, T, T]
```

This tensor dominates memory for long sequences.

### Compute Complexity

For each head, computing

$$
QK^\top
$$

costs roughly

$$
O(T^2 d_h),
$$

where $d_h$ is the per-head dimension.

Multiplying attention weights by values,

$$
AV,
$$

also costs roughly

$$
O(T^2 d_h).
$$

With $H$ heads and model dimension $D = H d_h$, the total attention compute is approximately

$$
O(T^2 D).
$$

The quadratic term $T^2$ is the main issue.

### Memory Complexity

The attention matrix for each head stores $T^2$ weights. With batch size $B$ and $H$ heads, the memory for attention weights is approximately

$$
O(BHT^2).
$$

This excludes activations, gradients, optimizer states, and feedforward layers.

During training, memory usage is larger because intermediate tensors must be kept for backpropagation. This is why long-context training is much harder than long-context inference.

### Concrete Scale

Consider one layer with:

$$
B = 1,\quad H = 32,\quad T = 8192.
$$

The attention weight tensor has:

$$
1 \times 32 \times 8192 \times 8192 =
2{,}147{,}483{,}648
$$

entries.

In float16, each entry uses 2 bytes. The attention weights alone require about:

$$
2{,}147{,}483{,}648 \times 2
\approx 4.29\text{ GB}.
$$

That is only one tensor in one layer. A full model has many layers and many other activations.

### Cross-Attention Cost

Cross-attention has a similar structure, but the query length and source length may differ.

Let

$$
T_q
$$

be the query length and

$$
T_k
$$

be the key-value length.

The score matrix has shape:

$$
T_q \times T_k.
$$

So the cost is approximately

$$
O(T_qT_kD).
$$

Cross-attention can be cheaper than self-attention if one side is short. It can also be expensive when both sides are long, such as long-document question answering or video-text generation.

### Attention Versus Feedforward Cost

Transformer blocks also contain feedforward networks. A typical feedforward network expands the hidden dimension from $D$ to $4D$, then projects back to $D$.

Its compute cost per sequence is roughly

$$
O(TD^2).
$$

Attention costs roughly:

$$
O(T^2D).
$$

Which term dominates depends on $T$ and $D$.

When $T$ is small and $D$ is large, feedforward layers can dominate. When $T$ is large, attention dominates.

A rough crossover happens when:

$$
T \approx D.
$$

For many large language models, $D$ is several thousand. Attention becomes especially costly when context lengths reach thousands or tens of thousands of tokens.

### Autoregressive Inference and KV Cache

During autoregressive generation, tokens are produced one at a time. Without caching, the model would recompute keys and values for all previous tokens at every step.

A key-value cache avoids this.

At each layer, the model stores previous keys and values:

```text
K_cache: [B, H, T, d_h]
V_cache: [B, H, T, d_h]
```

When a new token arrives, only its new query, key, and value are computed. The query attends to cached keys and values.

This reduces repeated computation, but it introduces memory cost. The cache grows linearly with sequence length, number of layers, number of heads, and head dimension.

### KV Cache Memory

For a decoder-only model with $L$ layers, batch size $B$, heads $H$, sequence length $T$, and head dimension $d_h$, the KV cache stores both keys and values:

$$
2 \times L \times B \times H \times T \times d_h
$$

numbers.

Since $H d_h = D$, this is also:

$$
2 \times L \times B \times T \times D.
$$

In float16, multiply by 2 bytes per number.

This memory can become the main inference bottleneck for long-context models and large batch serving.

### Efficient Attention Kernels

One way to improve attention is to compute exact attention more efficiently.

Standard implementations materialize the full attention matrix in memory. Memory-efficient kernels avoid storing the full matrix. They compute attention in blocks and keep only the necessary intermediate values.

FlashAttention is the main example of this idea.

It preserves exact scaled dot-product attention but reduces memory traffic. This is important because attention is often limited by memory bandwidth rather than arithmetic alone.

The mathematical result is the same. The implementation is more efficient.

### Sparse Attention

Sparse attention reduces the number of query-key pairs.

Instead of every token attending to every token, each token attends to a subset.

Common sparse patterns include:

| Pattern | Idea |
|---|---|
| Local window | Attend only to nearby tokens |
| Dilated window | Attend at regular intervals |
| Global tokens | Special tokens attend broadly |
| Block sparse | Attend between selected blocks |
| Sliding window | Move a fixed attention window across the sequence |

If each token attends to $w$ tokens instead of $T$, the cost becomes roughly

$$
O(TwD).
$$

When $w \ll T$, this is much cheaper than $O(T^2D)$.

Sparse attention works well when the task has locality. It can be weaker when every token may need arbitrary global context.

### Linear Attention

Linear attention tries to avoid forming the $T\times T$ attention matrix.

Standard attention has:

$$
\operatorname{softmax}(QK^\top)V.
$$

Linear attention replaces the softmax kernel with a feature map $\phi(\cdot)$, allowing a rearrangement such as:

$$
\phi(Q)(\phi(K)^\top V).
$$

This can reduce sequence complexity from quadratic to linear in $T$, under suitable assumptions.

The tradeoff is approximation quality. Linear attention often changes the behavior of the model and may underperform exact attention on some tasks.

### Low-Rank and Compressed Attention

Another approach is to compress keys and values.

Instead of attending over all $T$ positions, the model attends over a smaller set of $r$ learned or computed summaries, where $r \ll T$.

This gives cost closer to:

$$
O(TrD).
$$

Examples include low-rank approximations, pooling-based compression, memory tokens, and latent bottlenecks.

These methods trade exact token-level access for lower cost.

### Recurrent and State-Space Alternatives

Some architectures avoid full attention by using recurrent state or state-space models.

The model processes tokens sequentially and maintains a hidden state. This can give linear-time sequence processing and constant or slowly growing memory.

The tradeoff is reduced direct access. Attention can connect any two positions in one layer. Recurrent and state-space models usually compress history into a state.

Hybrid models combine attention with recurrent or state-space components. Attention handles flexible retrieval. The recurrent component handles long streams efficiently.

### Chunking and Sliding Context

For long documents, a practical method is to split input into chunks.

Each chunk is processed separately or with limited overlap. A retrieval system may select relevant chunks before the model reads them.

This changes the problem from “attend to everything” to “select what should enter the context.”

Chunking is common in retrieval-augmented generation. It moves part of the burden from attention to indexing and retrieval.

### Tradeoffs

There is no free attention mechanism. Each efficient method makes a tradeoff.

| Method | Benefit | Cost |
|---|---|---|
| FlashAttention | Exact result, lower memory traffic | Still quadratic in compute |
| Sparse attention | Lower compute and memory | Needs a good sparsity pattern |
| Local attention | Efficient for nearby context | Weak global access |
| Linear attention | Linear sequence cost | Changes attention behavior |
| Low-rank attention | Reduces source length | May lose fine detail |
| KV caching | Faster generation | Large inference memory |
| Retrieval and chunking | Handles huge corpora | Depends on retrieval quality |

The right choice depends on sequence length, hardware, training budget, and task structure.

### PyTorch Example: Measuring Attention Tensor Size

A simple way to estimate attention weight memory is:

```python
def attention_weight_memory(batch_size, num_heads, seq_len, dtype_bytes=2):
    entries = batch_size * num_heads * seq_len * seq_len
    bytes_used = entries * dtype_bytes
    gib = bytes_used / (1024 ** 3)
    return entries, gib

entries, gib = attention_weight_memory(
    batch_size=1,
    num_heads=32,
    seq_len=8192,
    dtype_bytes=2,
)

print(entries)  # 2147483648
print(gib)      # 4.0
```

This estimates only the attention weight tensor. A real model needs more memory.

### Summary

Standard self-attention has quadratic cost in sequence length because it compares every token with every other token. Its compute is roughly $O(T^2D)$, and its attention-weight memory is roughly $O(BHT^2)$.

This cost is manageable for short sequences but becomes a major bottleneck for long-context models. Efficient attention methods reduce memory traffic, restrict attention patterns, compress memory, approximate attention, or combine attention with retrieval and recurrent mechanisms.

Attention complexity is therefore not just a mathematical detail. It shapes model architecture, hardware requirements, training cost, inference latency, and the maximum usable context length.

