Additive attention was one of the first successful neural attention mechanisms. It was introduced for neural machine translation to allow a decoder to selectively focus on different encoder states during generation.
Additive attention was one of the first successful neural attention mechanisms. It was introduced for neural machine translation to allow a decoder to selectively focus on different encoder states during generation.
The key idea is simple. Instead of measuring similarity between vectors using only a dot product, additive attention learns a small neural network that computes how well a query and a key match.
This approach is sometimes called Bahdanau attention, after the work that introduced it.
From Fixed Context to Dynamic Context
Consider an encoder-decoder model.
The encoder processes an input sequence:
and produces hidden states:
At decoder step , the decoder has a hidden state:
Without attention, the decoder would rely mainly on a single fixed encoder representation. With attention, the decoder computes a context vector:
which depends on all encoder states.
The context vector changes at every decoding step. Different output tokens can therefore focus on different input tokens.
For example, in translation:
| Decoder output | Important encoder positions |
|---|---|
| Subject noun | Early source tokens |
| Verb | Middle source tokens |
| Adjective | Nearby descriptive words |
Attention lets the decoder adapt dynamically instead of using one static summary.
The Alignment Problem
The decoder must decide which encoder states matter most.
Suppose we have:
- decoder state ,
- encoder states .
We define an alignment score:
which measures how relevant encoder state is for decoder step .
Additive attention computes this score using a learned neural network:
where is called the alignment model.
The original additive attention formulation uses:
genui{“math_block_widget_always_prefetch_v2”:{“content”:“e_{t,i}=v^\top\tanh(W_s s_t + W_h h_i)”}}
This equation is central to additive attention.
Understanding the Components
The equation contains several learned parameters.
| Symbol | Meaning |
|---|---|
| Decoder hidden state | |
| Encoder hidden state | |
| Projection matrix for decoder state | |
| Projection matrix for encoder state | |
| Output projection vector | |
| Nonlinear activation |
The process works as follows:
- Project the decoder state.
- Project the encoder state.
- Add them together.
- Apply a nonlinearity.
- Project to a scalar score.
The scalar score represents compatibility between the decoder state and encoder state.
Large positive values indicate strong relevance.
Why Use a Neural Network for Scoring
A dot product assumes that similarity can be measured directly in the representation space. Additive attention instead learns a nonlinear compatibility function.
This provides more flexibility.
For example, the model may learn:
- syntactic alignment,
- semantic compatibility,
- positional preferences,
- language-specific translation patterns.
The neural scoring function can represent more complex relationships than a simple inner product.
This was particularly useful in early recurrent sequence models, where hidden states often had different distributions and scales.
From Scores to Probabilities
The raw scores
must be normalized.
The model applies softmax:
genui{“math_block_widget_always_prefetch_v2”:{“content”:"\alpha_{t,i}=\frac{\exp(e_{t,i})}{\sum_{j=1}^{T}\exp(e_{t,j})}"}}
The resulting coefficients satisfy:
The coefficients are attention weights.
Each weight indicates how strongly the decoder attends to encoder position .
Computing the Context Vector
The final context vector is a weighted sum of encoder states:
genui{“math_block_widget_always_prefetch_v2”:{“content”:“c_t=\sum_{i=1}^{T}\alpha_{t,i} h_i”}}
The context vector summarizes the relevant encoder information for decoder step .
If attention weight is large, encoder state contributes strongly to the context vector.
This context vector is then combined with the decoder state to produce the output distribution.
Attention as Soft Alignment
Additive attention creates a soft alignment between input and output positions.
Suppose a translation model generates:
| Source sentence | The cat sat on the mat |
|---|---|
| Target sentence | Le chat était assis sur le tapis |
When generating “chat”, the decoder may place high attention weight on “cat”.
When generating “tapis”, the decoder may focus on “mat”.
The alignment is soft because the weights form a probability distribution. The decoder can attend to multiple input positions simultaneously.
This differs from classical alignment methods that selected one discrete position.
Attention Matrix
For an output sequence of length and input sequence of length , the attention weights form a matrix:
Each row corresponds to one decoder step.
Each row sums to 1 because of the softmax normalization.
Visualizing this matrix often reveals interpretable patterns:
- diagonal alignments in translation,
- local focus in speech recognition,
- structural dependencies in parsing.
Attention heatmaps became one of the first practical interpretability tools for neural sequence models.
Additive Attention Versus Dot-Product Attention
Additive attention and dot-product attention solve the same problem differently.
| Property | Additive attention | Dot-product attention |
|---|---|---|
| Score function | Small neural network | Inner product |
| Complexity | Higher | Lower |
| Flexibility | More expressive | Simpler |
| Parallel efficiency | Lower | Higher |
| Transformer usage | Rare | Standard |
Additive attention was common in recurrent sequence models. Dot-product attention became dominant in transformers because matrix multiplication is highly optimized on GPUs and TPUs.
Still, additive attention remains conceptually important because it clearly separates:
- scoring,
- normalization,
- retrieval.
Vectorized Formulation
Modern implementations avoid loops over sequence positions.
Suppose:
contains decoder states and
contains encoder states.
The attention mechanism computes all pairwise scores simultaneously.
The output score tensor has shape:
Each decoder position compares itself with every encoder position.
This fully vectorized structure is important for accelerator efficiency.
PyTorch Implementation
A simple additive attention module may look like this:
import torch
import torch.nn as nn
class AdditiveAttention(nn.Module):
def __init__(self, hidden_dim):
super().__init__()
self.W_q = nn.Linear(hidden_dim, hidden_dim)
self.W_k = nn.Linear(hidden_dim, hidden_dim)
self.v = nn.Linear(hidden_dim, 1)
def forward(self, query, keys, values):
"""
query: [B, 1, D]
keys: [B, T, D]
values: [B, T, D]
"""
q = self.W_q(query)
k = self.W_k(keys)
scores = self.v(
torch.tanh(q + k)
).squeeze(-1)
weights = torch.softmax(scores, dim=-1)
context = torch.bmm(
weights.unsqueeze(1),
values
)
return context, weightsThe tensor shapes are important.
| Tensor | Shape |
|---|---|
| Query | [B, 1, D] |
| Keys | [B, T, D] |
| Scores | [B, T] |
| Weights | [B, T] |
| Context | [B, 1, D] |
The batch matrix multiplication torch.bmm computes the weighted sum efficiently across the batch dimension.
Attention and Gradient Flow
Attention also improves optimization.
Without attention, gradients must pass through many recurrent transitions to connect distant positions. This creates long computational paths and weak gradient signals.
Attention creates shorter paths.
A decoder state can directly access any encoder state through one attention operation. This improves gradient flow and helps the model learn long-range dependencies.
This advantage became even more important as sequence lengths increased.
Limitations of Additive Attention
Although additive attention was influential, it has several limitations.
First, recurrent encoders and decoders still process sequences sequentially. Parallelism remains limited.
Second, the learned scoring network increases computational cost.
Third, pairwise recurrent attention becomes slow for very long sequences.
These limitations motivated the transition toward transformer architectures based entirely on self-attention and matrix operations.
Historical Importance
Additive attention marked a major transition in deep learning.
Before attention, neural sequence models relied heavily on fixed-size compression. Attention introduced dynamic information access.
Several ideas introduced here became foundational:
- differentiable memory access,
- soft alignment,
- weighted retrieval,
- content-dependent computation.
Modern transformers generalize these ideas to fully attention-based architectures.
Summary
Additive attention computes relevance scores between queries and keys using a learned neural network. These scores are normalized into attention weights, which produce a weighted combination of value vectors.
The mechanism solves the fixed-vector bottleneck of early encoder-decoder systems and enables dynamic access to input information. It improves alignment, gradient flow, and long-range dependency modeling.
Although transformers mainly use dot-product attention today, additive attention established the core conceptual structure used by nearly all modern attention systems.