Skip to content

Efficient Transformers

Standard transformer attention scales quadratically with sequence length. For a sequence of length $T$, self-attention constructs a score matrix of size

Standard transformer attention scales quadratically with sequence length. For a sequence of length TT, self-attention constructs a score matrix of size

T×T. T \times T.

This gives attention complexity

O(T2D), O(T^2D),

where DD is the model dimension.

Quadratic scaling becomes expensive for long documents, code repositories, videos, audio streams, retrieval contexts, and agent memory traces. Efficient transformer methods reduce attention cost, memory usage, or latency while preserving as much modeling quality as possible.

Efficient transformers are therefore both an algorithmic and systems topic.

The Quadratic Attention Bottleneck

Standard self-attention computes

S=QKdk, S = \frac{QK^\top}{\sqrt{d_k}},

where

Q,KRT×dk. Q,K \in \mathbb{R}^{T \times d_k}.

The matrix multiplication

QK QK^\top

produces

SRT×T. S\in\mathbb{R}^{T\times T}.

The memory required for SS grows quadratically with sequence length.

For example:

Sequence lengthAttention matrix size
1,024106\approx 10^6 entries
8,1926.7×107\approx 6.7\times10^7 entries
32,7681.1×109\approx 1.1\times10^9 entries

Even when activations are stored in low precision, long-context attention quickly becomes expensive.

Efficient transformers aim to reduce one or more of:

ResourceProblem
ComputeToo many attention operations
MemoryLarge attention matrices
LatencySlow autoregressive decoding
BandwidthExcessive movement of KV tensors

Categories of Efficient Attention

Efficient transformer methods can be grouped into several broad categories.

CategoryMain idea
Sparse attentionAttend only to selected positions
Local attentionRestrict attention to nearby tokens
Linear attentionAvoid explicit T×TT\times T matrices
Low-rank attentionApproximate attention structure
Memory compressionReduce stored sequence representations
Retrieval augmentationMove information outside attention
State-space modelsReplace attention with recurrent dynamics
Kernel fusionImprove implementation efficiency

Different methods optimize different bottlenecks. Some reduce theoretical complexity. Others mainly improve hardware efficiency.

Local Attention

Local attention restricts each token to a fixed-size neighborhood.

Suppose each token attends only within a window of size ww. Then token ii attends to positions

[iw,,i+w]. [i-w, \ldots, i+w].

Instead of attending to all TT tokens, each token attends to roughly 2w+12w+1 positions.

The attention complexity becomes

O(TwD). O(TwD).

If wTw \ll T, this is much cheaper than quadratic attention.

Local attention works well when most useful dependencies are nearby. This is common in language, audio, and images.

However, purely local attention struggles with long-range dependencies unless information propagates gradually across layers.

Sliding Window Attention

Sliding window attention is a practical form of local attention.

Each attention window overlaps with nearby windows:

Window 1: [1 2 3 4]
Window 2:   [3 4 5 6]
Window 3:     [5 6 7 8]

Overlapping windows allow information to propagate across long sequences over multiple layers.

If each layer expands the receptive field slightly, deep stacks can still model long-range structure.

Sliding-window attention is common in long-context language models because it preserves locality while remaining efficient.

Global Tokens

Local attention can be augmented with global tokens.

A global token attends to all positions and is visible to all positions. This creates a communication pathway across distant regions.

Examples include:

Global structurePurpose
CLS tokenSequence summarization
Retrieval tokensExternal memory access
Sink tokensStabilize long-context attention
Landmark tokensHierarchical compression

A hybrid local-global pattern may look like:

Local windows + selected global connections

This preserves efficient local computation while enabling long-range interaction.

Sparse Attention

Sparse attention allows each token to attend only to selected positions rather than to the full sequence.

Instead of a dense T×TT\times T attention matrix, sparse attention uses a sparse graph.

Common sparse patterns include:

PatternDescription
LocalNearby tokens only
StridedEvery kk-th token
Block sparseDense blocks with sparse global structure
RandomRandomly selected links
DilatedExponentially spaced positions

A sparse attention mask might look conceptually like:

x . . x . . x
. x . . x . .
. . x . . x .
x . . x . . x

Here x indicates allowed attention positions.

Sparse attention reduces complexity approximately to

O(TsD), O(TsD),

where ss is the average number of attended positions per token.

The challenge is preserving information flow and hardware efficiency. Sparse patterns may reduce theoretical complexity but perform poorly on GPUs if memory access becomes irregular.

Block Sparse Attention

Block sparse attention groups tokens into blocks and applies sparse connectivity among blocks.

Suppose the sequence is divided into blocks of size bb. Instead of attending token-by-token, attention is computed block-by-block.

This improves hardware utilization because matrix operations remain dense inside blocks.

A block sparse pattern may include:

Connection typePurpose
Local neighboring blocksNearby context
Global blocksShared information
Random blocksLong-range connectivity

Block sparse methods often achieve better accelerator efficiency than fine-grained sparse masks.

Linear Attention

Linear attention avoids explicitly constructing the T×TT\times T attention matrix.

Standard attention computes

softmax(QK)V. \text{softmax}(QK^\top)V.

Linear attention methods rewrite or approximate this expression so computation scales linearly with sequence length.

A common idea is to approximate attention using feature maps:

ϕ(Q)ϕ(K). \phi(Q)\phi(K)^\top.

Then attention becomes

Attention(Q,K,V)=ϕ(Q)(ϕ(K)V)ϕ(Q)(ϕ(K)1). \text{Attention}(Q,K,V) = \frac{ \phi(Q) \left( \phi(K)^\top V \right) }{ \phi(Q) \left( \phi(K)^\top \mathbf{1} \right) }.

The key observation is that

ϕ(K)V \phi(K)^\top V

can be accumulated incrementally without forming a full attention matrix.

This changes complexity from

O(T2D) O(T^2D)

to approximately

O(TD2). O(TD^2).

Linear attention is attractive for very long sequences, but approximating softmax attention accurately is difficult. Many linear methods trade modeling quality for efficiency.

Kernelized Attention

Linear attention often uses kernel methods.

Suppose softmax attention is approximated by a positive kernel:

exp(qk)ϕ(q)ϕ(k). \exp(q^\top k) \approx \phi(q)^\top \phi(k).

The feature map ϕ\phi transforms queries and keys into a space where the attention interaction becomes associative.

This allows reordering computation:

(QK)V=Q(KV). (QK^\top)V = Q(K^\top V).

The right-hand side avoids constructing a T×TT\times T matrix explicitly.

Kernelized attention methods differ mainly in their feature map design and numerical stabilization.

Low-Rank Attention

Attention matrices often have redundant structure. Low-rank methods approximate the attention matrix using fewer dimensions.

A low-rank approximation may represent

SUV, S \approx UV^\top,

where

U,VRT×r, U,V\in\mathbb{R}^{T\times r},

and rTr \ll T.

Low-rank methods reduce memory and compute cost, especially when the attention matrix is approximately compressible.

However, some tasks require highly detailed token-token interactions, making aggressive compression harmful.

Attention Pooling and Compression

Another approach compresses the sequence itself.

Suppose the original sequence length is TT. A compression layer produces a shorter representation of length TT':

T<T. T' < T.

Attention is then computed over the compressed representation.

Compression methods include:

MethodDescription
PoolingAverage or max pooling
Learned projectionLinear compression
ClusteringGroup similar tokens
Landmark selectionKeep important positions
DownsamplingReduce temporal resolution

This is common in hierarchical transformers for documents, videos, and multimodal systems.

Memory-Augmented Transformers

Memory-augmented transformers extend context using external memory.

Instead of storing everything in attention, the model maintains memory states:

Memory typeDescription
Cached hidden statesReuse previous activations
Retrieval memoryExternal database or vector store
Learned memory tokensPersistent trainable memory
Compressed summariesHistorical context compression

Transformer-XL introduced segment recurrence. Hidden states from previous segments become memory for future segments.

Retrieval-augmented systems use external retrieval instead of expanding the context window indefinitely.

FlashAttention

Some efficient transformer methods do not change the mathematical attention formula. Instead, they improve implementation efficiency.

FlashAttention computes exact attention while reducing memory movement between GPU memory and on-chip SRAM.

The key idea is tiled attention computation:

  1. Load small query/key/value blocks.
  2. Compute partial attention.
  3. Fuse softmax and matrix multiplication.
  4. Avoid materializing the full attention matrix.

FlashAttention reduces memory overhead and improves throughput while preserving exact softmax attention.

This is important because modern accelerators are often bandwidth-bound rather than arithmetic-bound.

In practice, FlashAttention is widely used even in standard transformers because it improves efficiency without changing model behavior.

Grouped and Multi-Query Attention

During autoregressive decoding, KV cache memory becomes expensive.

Standard multi-head attention stores separate keys and values for every head:

K,VRB×h×T×dh. K,V\in\mathbb{R}^{B\times h\times T\times d_h}.

Grouped-query attention and multi-query attention reduce KV storage.

MethodKV sharing
Multi-head attentionSeparate KV per head
Multi-query attentionOne shared KV for all heads
Grouped-query attentionOne KV per head group

This reduces memory bandwidth and inference latency.

Grouped-query attention is widely used in large decoder-only language models because it improves serving efficiency while preserving quality better than fully shared KV.

Chunked and Streaming Attention

Streaming systems process sequences incrementally rather than as complete fixed contexts.

For audio or live interaction, tokens arrive continuously:

x1 → x2 → x3 → ...

Streaming attention uses chunked processing:

MethodDescription
Fixed chunksProcess windows independently
Overlapping chunksPreserve continuity
Rolling memoryCarry limited state forward
Streaming causal attentionOnline autoregressive processing

Streaming models must balance latency and context size.

State Space Alternatives

Some architectures replace attention entirely.

State space models process sequences using learned dynamical systems rather than token-token attention.

A simplified state-space recurrence is

ht=Aht1+Bxt, h_t = Ah_{t-1} + Bx_t, yt=Cht. y_t = Ch_t.

Modern state-space models use structured parameterizations and parallel scan algorithms to scale efficiently.

Advantages include:

PropertyBenefit
Linear sequence scalingBetter long-context efficiency
Constant-size hidden stateCompact memory
Streaming-friendlyNatural online processing

Attention remains stronger for some reasoning and retrieval tasks, but state-space methods are increasingly competitive for long sequences.

Efficient Attention in Vision

Vision transformers often use structured efficient attention.

Common strategies include:

MethodPurpose
Window attentionRestrict attention locally
Shifted windowsCross-window interaction
Hierarchical poolingReduce resolution
Patch mergingCompress sequence length
Sparse global tokensLong-range communication

Images have strong spatial locality, so efficient local attention works particularly well.

Choosing an Efficient Transformer Design

The best efficient transformer depends on the application.

ApplicationCommon strategy
Long documentsSliding-window or sparse attention
Real-time generationKV-efficient decoding
Mobile inferenceQuantized compact attention
Video modelingHierarchical compression
Streaming speechChunked causal attention
Retrieval systemsRetrieval augmentation
Million-token contextsSparse/local/retrieval hybrids

There is no universally best efficient transformer. Tradeoffs involve:

TradeoffExample
Quality vs speedDense attention is usually strongest
Simplicity vs scalabilitySparse patterns complicate kernels
Memory vs computeCheckpointing recomputes activations
Exactness vs approximationLinear attention approximates softmax

Practical systems often combine multiple techniques.

Summary

Standard transformer attention scales quadratically with sequence length because it constructs a full attention matrix. Efficient transformers reduce compute, memory, latency, or bandwidth cost using sparse attention, local attention, linear attention, low-rank approximations, sequence compression, retrieval augmentation, optimized kernels, or alternative sequence models.

Some methods modify the attention algorithm itself. Others keep exact attention but optimize implementation. Modern large-scale systems often combine efficient attention, KV-efficient decoding, FlashAttention kernels, retrieval systems, and distributed memory management.