# Efficient Transformers

Standard transformer attention scales quadratically with sequence length. For a sequence of length $T$, self-attention constructs a score matrix of size

$$
T \times T.
$$

This gives attention complexity

$$
O(T^2D),
$$

where $D$ is the model dimension.

Quadratic scaling becomes expensive for long documents, code repositories, videos, audio streams, retrieval contexts, and agent memory traces. Efficient transformer methods reduce attention cost, memory usage, or latency while preserving as much modeling quality as possible.

Efficient transformers are therefore both an algorithmic and systems topic.

### The Quadratic Attention Bottleneck

Standard self-attention computes

$$
S = \frac{QK^\top}{\sqrt{d_k}},
$$

where

$$
Q,K \in \mathbb{R}^{T \times d_k}.
$$

The matrix multiplication

$$
QK^\top
$$

produces

$$
S\in\mathbb{R}^{T\times T}.
$$

The memory required for $S$ grows quadratically with sequence length.

For example:

| Sequence length | Attention matrix size |
|---|---:|
| 1,024 | $\approx 10^6$ entries |
| 8,192 | $\approx 6.7\times10^7$ entries |
| 32,768 | $\approx 1.1\times10^9$ entries |

Even when activations are stored in low precision, long-context attention quickly becomes expensive.

Efficient transformers aim to reduce one or more of:

| Resource | Problem |
|---|---|
| Compute | Too many attention operations |
| Memory | Large attention matrices |
| Latency | Slow autoregressive decoding |
| Bandwidth | Excessive movement of KV tensors |

### Categories of Efficient Attention

Efficient transformer methods can be grouped into several broad categories.

| Category | Main idea |
|---|---|
| Sparse attention | Attend only to selected positions |
| Local attention | Restrict attention to nearby tokens |
| Linear attention | Avoid explicit $T\times T$ matrices |
| Low-rank attention | Approximate attention structure |
| Memory compression | Reduce stored sequence representations |
| Retrieval augmentation | Move information outside attention |
| State-space models | Replace attention with recurrent dynamics |
| Kernel fusion | Improve implementation efficiency |

Different methods optimize different bottlenecks. Some reduce theoretical complexity. Others mainly improve hardware efficiency.

### Local Attention

Local attention restricts each token to a fixed-size neighborhood.

Suppose each token attends only within a window of size $w$. Then token $i$ attends to positions

$$
[i-w, \ldots, i+w].
$$

Instead of attending to all $T$ tokens, each token attends to roughly $2w+1$ positions.

The attention complexity becomes

$$
O(TwD).
$$

If $w \ll T$, this is much cheaper than quadratic attention.

Local attention works well when most useful dependencies are nearby. This is common in language, audio, and images.

However, purely local attention struggles with long-range dependencies unless information propagates gradually across layers.

### Sliding Window Attention

Sliding window attention is a practical form of local attention.

Each attention window overlaps with nearby windows:

```text id="qqn0y8"
Window 1: [1 2 3 4]
Window 2:   [3 4 5 6]
Window 3:     [5 6 7 8]
```

Overlapping windows allow information to propagate across long sequences over multiple layers.

If each layer expands the receptive field slightly, deep stacks can still model long-range structure.

Sliding-window attention is common in long-context language models because it preserves locality while remaining efficient.

### Global Tokens

Local attention can be augmented with global tokens.

A global token attends to all positions and is visible to all positions. This creates a communication pathway across distant regions.

Examples include:

| Global structure | Purpose |
|---|---|
| CLS token | Sequence summarization |
| Retrieval tokens | External memory access |
| Sink tokens | Stabilize long-context attention |
| Landmark tokens | Hierarchical compression |

A hybrid local-global pattern may look like:

```text id="9cegjg"
Local windows + selected global connections
```

This preserves efficient local computation while enabling long-range interaction.

### Sparse Attention

Sparse attention allows each token to attend only to selected positions rather than to the full sequence.

Instead of a dense $T\times T$ attention matrix, sparse attention uses a sparse graph.

Common sparse patterns include:

| Pattern | Description |
|---|---|
| Local | Nearby tokens only |
| Strided | Every $k$-th token |
| Block sparse | Dense blocks with sparse global structure |
| Random | Randomly selected links |
| Dilated | Exponentially spaced positions |

A sparse attention mask might look conceptually like:

```text id="m5mjlwm"
x . . x . . x
. x . . x . .
. . x . . x .
x . . x . . x
```

Here `x` indicates allowed attention positions.

Sparse attention reduces complexity approximately to

$$
O(TsD),
$$

where $s$ is the average number of attended positions per token.

The challenge is preserving information flow and hardware efficiency. Sparse patterns may reduce theoretical complexity but perform poorly on GPUs if memory access becomes irregular.

### Block Sparse Attention

Block sparse attention groups tokens into blocks and applies sparse connectivity among blocks.

Suppose the sequence is divided into blocks of size $b$. Instead of attending token-by-token, attention is computed block-by-block.

This improves hardware utilization because matrix operations remain dense inside blocks.

A block sparse pattern may include:

| Connection type | Purpose |
|---|---|
| Local neighboring blocks | Nearby context |
| Global blocks | Shared information |
| Random blocks | Long-range connectivity |

Block sparse methods often achieve better accelerator efficiency than fine-grained sparse masks.

### Linear Attention

Linear attention avoids explicitly constructing the $T\times T$ attention matrix.

Standard attention computes

$$
\text{softmax}(QK^\top)V.
$$

Linear attention methods rewrite or approximate this expression so computation scales linearly with sequence length.

A common idea is to approximate attention using feature maps:

$$
\phi(Q)\phi(K)^\top.
$$

Then attention becomes

$$
\text{Attention}(Q,K,V) =
\frac{
\phi(Q)
\left(
\phi(K)^\top V
\right)
}{
\phi(Q)
\left(
\phi(K)^\top \mathbf{1}
\right)
}.
$$

The key observation is that

$$
\phi(K)^\top V
$$

can be accumulated incrementally without forming a full attention matrix.

This changes complexity from

$$
O(T^2D)
$$

to approximately

$$
O(TD^2).
$$

Linear attention is attractive for very long sequences, but approximating softmax attention accurately is difficult. Many linear methods trade modeling quality for efficiency.

### Kernelized Attention

Linear attention often uses kernel methods.

Suppose softmax attention is approximated by a positive kernel:

$$
\exp(q^\top k)
\approx
\phi(q)^\top \phi(k).
$$

The feature map $\phi$ transforms queries and keys into a space where the attention interaction becomes associative.

This allows reordering computation:

$$
(QK^\top)V =
Q(K^\top V).
$$

The right-hand side avoids constructing a $T\times T$ matrix explicitly.

Kernelized attention methods differ mainly in their feature map design and numerical stabilization.

### Low-Rank Attention

Attention matrices often have redundant structure. Low-rank methods approximate the attention matrix using fewer dimensions.

A low-rank approximation may represent

$$
S \approx UV^\top,
$$

where

$$
U,V\in\mathbb{R}^{T\times r},
$$

and $r \ll T$.

Low-rank methods reduce memory and compute cost, especially when the attention matrix is approximately compressible.

However, some tasks require highly detailed token-token interactions, making aggressive compression harmful.

### Attention Pooling and Compression

Another approach compresses the sequence itself.

Suppose the original sequence length is $T$. A compression layer produces a shorter representation of length $T'$:

$$
T' < T.
$$

Attention is then computed over the compressed representation.

Compression methods include:

| Method | Description |
|---|---|
| Pooling | Average or max pooling |
| Learned projection | Linear compression |
| Clustering | Group similar tokens |
| Landmark selection | Keep important positions |
| Downsampling | Reduce temporal resolution |

This is common in hierarchical transformers for documents, videos, and multimodal systems.

### Memory-Augmented Transformers

Memory-augmented transformers extend context using external memory.

Instead of storing everything in attention, the model maintains memory states:

| Memory type | Description |
|---|---|
| Cached hidden states | Reuse previous activations |
| Retrieval memory | External database or vector store |
| Learned memory tokens | Persistent trainable memory |
| Compressed summaries | Historical context compression |

Transformer-XL introduced segment recurrence. Hidden states from previous segments become memory for future segments.

Retrieval-augmented systems use external retrieval instead of expanding the context window indefinitely.

### FlashAttention

Some efficient transformer methods do not change the mathematical attention formula. Instead, they improve implementation efficiency.

FlashAttention computes exact attention while reducing memory movement between GPU memory and on-chip SRAM.

The key idea is tiled attention computation:

1. Load small query/key/value blocks.
2. Compute partial attention.
3. Fuse softmax and matrix multiplication.
4. Avoid materializing the full attention matrix.

FlashAttention reduces memory overhead and improves throughput while preserving exact softmax attention.

This is important because modern accelerators are often bandwidth-bound rather than arithmetic-bound.

In practice, FlashAttention is widely used even in standard transformers because it improves efficiency without changing model behavior.

### Grouped and Multi-Query Attention

During autoregressive decoding, KV cache memory becomes expensive.

Standard multi-head attention stores separate keys and values for every head:

$$
K,V\in\mathbb{R}^{B\times h\times T\times d_h}.
$$

Grouped-query attention and multi-query attention reduce KV storage.

| Method | KV sharing |
|---|---|
| Multi-head attention | Separate KV per head |
| Multi-query attention | One shared KV for all heads |
| Grouped-query attention | One KV per head group |

This reduces memory bandwidth and inference latency.

Grouped-query attention is widely used in large decoder-only language models because it improves serving efficiency while preserving quality better than fully shared KV.

### Chunked and Streaming Attention

Streaming systems process sequences incrementally rather than as complete fixed contexts.

For audio or live interaction, tokens arrive continuously:

```text id="vqud9l"
x1 → x2 → x3 → ...
```

Streaming attention uses chunked processing:

| Method | Description |
|---|---|
| Fixed chunks | Process windows independently |
| Overlapping chunks | Preserve continuity |
| Rolling memory | Carry limited state forward |
| Streaming causal attention | Online autoregressive processing |

Streaming models must balance latency and context size.

### State Space Alternatives

Some architectures replace attention entirely.

State space models process sequences using learned dynamical systems rather than token-token attention.

A simplified state-space recurrence is

$$
h_t = Ah_{t-1} + Bx_t,
$$

$$
y_t = Ch_t.
$$

Modern state-space models use structured parameterizations and parallel scan algorithms to scale efficiently.

Advantages include:

| Property | Benefit |
|---|---|
| Linear sequence scaling | Better long-context efficiency |
| Constant-size hidden state | Compact memory |
| Streaming-friendly | Natural online processing |

Attention remains stronger for some reasoning and retrieval tasks, but state-space methods are increasingly competitive for long sequences.

### Efficient Attention in Vision

Vision transformers often use structured efficient attention.

Common strategies include:

| Method | Purpose |
|---|---|
| Window attention | Restrict attention locally |
| Shifted windows | Cross-window interaction |
| Hierarchical pooling | Reduce resolution |
| Patch merging | Compress sequence length |
| Sparse global tokens | Long-range communication |

Images have strong spatial locality, so efficient local attention works particularly well.

### Choosing an Efficient Transformer Design

The best efficient transformer depends on the application.

| Application | Common strategy |
|---|---|
| Long documents | Sliding-window or sparse attention |
| Real-time generation | KV-efficient decoding |
| Mobile inference | Quantized compact attention |
| Video modeling | Hierarchical compression |
| Streaming speech | Chunked causal attention |
| Retrieval systems | Retrieval augmentation |
| Million-token contexts | Sparse/local/retrieval hybrids |

There is no universally best efficient transformer. Tradeoffs involve:

| Tradeoff | Example |
|---|---|
| Quality vs speed | Dense attention is usually strongest |
| Simplicity vs scalability | Sparse patterns complicate kernels |
| Memory vs compute | Checkpointing recomputes activations |
| Exactness vs approximation | Linear attention approximates softmax |

Practical systems often combine multiple techniques.

### Summary

Standard transformer attention scales quadratically with sequence length because it constructs a full attention matrix. Efficient transformers reduce compute, memory, latency, or bandwidth cost using sparse attention, local attention, linear attention, low-rank approximations, sequence compression, retrieval augmentation, optimized kernels, or alternative sequence models.

Some methods modify the attention algorithm itself. Others keep exact attention but optimize implementation. Modern large-scale systems often combine efficient attention, KV-efficient decoding, FlashAttention kernels, retrieval systems, and distributed memory management.

