Additive Attention

Additive attention was one of the first successful neural attention mechanisms. It was introduced for neural machine translation to allow a decoder to selectively focus on different encoder states during generation.

The key idea is simple. Instead of measuring similarity between vectors using only a dot product, additive attention learns a small neural network that computes how well a query and a key match.

This approach is sometimes called Bahdanau attention, after the work that introduced it.

From Fixed Context to Dynamic Context

Consider an encoder-decoder model.

The encoder processes an input sequence:

x_1, x_2, \ldots, x_T

and produces hidden states:

h_1, h_2, \ldots, h_T.

At decoder step $t$ , the decoder has a hidden state:

s_t.

Without attention, the decoder would rely mainly on a single fixed encoder representation. With attention, the decoder computes a context vector:

c_t,

which depends on all encoder states.

The context vector changes at every decoding step. Different output tokens can therefore focus on different input tokens.

For example, in translation:

Decoder output	Important encoder positions
Subject noun	Early source tokens
Verb	Middle source tokens
Adjective	Nearby descriptive words

Attention lets the decoder adapt dynamically instead of using one static summary.

The Alignment Problem

The decoder must decide which encoder states matter most.

Suppose we have:

decoder state $s_t$ ,
encoder states $h_1, h_2, \ldots, h_T$ .

We define an alignment score:

e_{t,i},

which measures how relevant encoder state $h_i$ is for decoder step $t$ .

Additive attention computes this score using a learned neural network:

e_{t,i} = a(s_t, h_i),

where $a(\cdot)$ is called the alignment model.

The original additive attention formulation uses:

e_{t,i} = v^\top \tanh(W_s s_t + W_h h_i).

genui{“math_block_widget_always_prefetch_v2”:{“content”:“e_{t,i}=v^\top\tanh(W_s s_t + W_h h_i)”}}

This equation is central to additive attention.

Understanding the Components

The equation contains several learned parameters.

Symbol	Meaning
$s_t$	Decoder hidden state
$h_i$	Encoder hidden state
$W_s$	Projection matrix for decoder state
$W_h$	Projection matrix for encoder state
$v$	Output projection vector
$\tanh$	Nonlinear activation

The process works as follows:

Project the decoder state.
Project the encoder state.
Add them together.
Apply a nonlinearity.
Project to a scalar score.

The scalar score represents compatibility between the decoder state and encoder state.

Large positive values indicate strong relevance.

Why Use a Neural Network for Scoring

A dot product assumes that similarity can be measured directly in the representation space. Additive attention instead learns a nonlinear compatibility function.

This provides more flexibility.

For example, the model may learn:

syntactic alignment,
semantic compatibility,
positional preferences,
language-specific translation patterns.

The neural scoring function can represent more complex relationships than a simple inner product.

This was particularly useful in early recurrent sequence models, where hidden states often had different distributions and scales.

From Scores to Probabilities

The raw scores

e_{t,1}, e_{t,2}, \ldots, e_{t,T}

must be normalized.

The model applies softmax:

\alpha_{t,i} = \frac{\exp(e_{t,i})} {\sum_{j=1}^{T}\exp(e_{t,j})}.

genui{“math_block_widget_always_prefetch_v2”:{“content”:"\alpha_{t,i}=\frac{\exp(e_{t,i})}{\sum_{j=1}^{T}\exp(e_{t,j})}"}}

The resulting coefficients satisfy:

\alpha_{t,i} \ge 0, \qquad \sum_{i=1}^{T}\alpha_{t,i}=1.

The coefficients are attention weights.

Each weight indicates how strongly the decoder attends to encoder position $i$ .

Computing the Context Vector

The final context vector is a weighted sum of encoder states:

c_t = \sum_{i=1}^{T} \alpha_{t,i} h_i.

genui{“math_block_widget_always_prefetch_v2”:{“content”:“c_t=\sum_{i=1}^{T}\alpha_{t,i} h_i”}}

The context vector summarizes the relevant encoder information for decoder step $t$ .

If attention weight $\alpha_{t,i}$ is large, encoder state $h_i$ contributes strongly to the context vector.

This context vector is then combined with the decoder state to produce the output distribution.

Attention as Soft Alignment

Additive attention creates a soft alignment between input and output positions.

Suppose a translation model generates:

Source sentence	The cat sat on the mat
Target sentence	Le chat était assis sur le tapis

When generating “chat”, the decoder may place high attention weight on “cat”.

When generating “tapis”, the decoder may focus on “mat”.

The alignment is soft because the weights form a probability distribution. The decoder can attend to multiple input positions simultaneously.

This differs from classical alignment methods that selected one discrete position.

Attention Matrix

For an output sequence of length $T_y$ and input sequence of length $T_x$ , the attention weights form a matrix:

A\in\mathbb{R}^{T_y\times T_x}.

Each row corresponds to one decoder step.

Each row sums to 1 because of the softmax normalization.

Visualizing this matrix often reveals interpretable patterns:

diagonal alignments in translation,
local focus in speech recognition,
structural dependencies in parsing.

Attention heatmaps became one of the first practical interpretability tools for neural sequence models.

Additive Attention Versus Dot-Product Attention

Additive attention and dot-product attention solve the same problem differently.

Property	Additive attention	Dot-product attention
Score function	Small neural network	Inner product
Complexity	Higher	Lower
Flexibility	More expressive	Simpler
Parallel efficiency	Lower	Higher
Transformer usage	Rare	Standard

Additive attention was common in recurrent sequence models. Dot-product attention became dominant in transformers because matrix multiplication is highly optimized on GPUs and TPUs.

Still, additive attention remains conceptually important because it clearly separates:

scoring,
normalization,
retrieval.

Vectorized Formulation

Modern implementations avoid loops over sequence positions.

Suppose:

S\in\mathbb{R}^{B\times T_y\times d},

contains decoder states and

H\in\mathbb{R}^{B\times T_x\times d},

contains encoder states.

The attention mechanism computes all pairwise scores simultaneously.

The output score tensor has shape:

[B, T_y, T_x].

Each decoder position compares itself with every encoder position.

This fully vectorized structure is important for accelerator efficiency.

PyTorch Implementation

A simple additive attention module may look like this:

import torch
import torch.nn as nn

class AdditiveAttention(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()

        self.W_q = nn.Linear(hidden_dim, hidden_dim)
        self.W_k = nn.Linear(hidden_dim, hidden_dim)
        self.v = nn.Linear(hidden_dim, 1)

    def forward(self, query, keys, values):
        """
        query:  [B, 1, D]
        keys:   [B, T, D]
        values: [B, T, D]
        """

        q = self.W_q(query)
        k = self.W_k(keys)

        scores = self.v(
            torch.tanh(q + k)
        ).squeeze(-1)

        weights = torch.softmax(scores, dim=-1)

        context = torch.bmm(
            weights.unsqueeze(1),
            values
        )

        return context, weights

The tensor shapes are important.

Tensor	Shape
Query	`[B, 1, D]`
Keys	`[B, T, D]`
Scores	`[B, T]`
Weights	`[B, T]`
Context	`[B, 1, D]`

The batch matrix multiplication torch.bmm computes the weighted sum efficiently across the batch dimension.

Attention and Gradient Flow

Attention also improves optimization.

Without attention, gradients must pass through many recurrent transitions to connect distant positions. This creates long computational paths and weak gradient signals.

Attention creates shorter paths.

A decoder state can directly access any encoder state through one attention operation. This improves gradient flow and helps the model learn long-range dependencies.

This advantage became even more important as sequence lengths increased.

Limitations of Additive Attention

Although additive attention was influential, it has several limitations.

First, recurrent encoders and decoders still process sequences sequentially. Parallelism remains limited.

Second, the learned scoring network increases computational cost.

Third, pairwise recurrent attention becomes slow for very long sequences.

These limitations motivated the transition toward transformer architectures based entirely on self-attention and matrix operations.

Historical Importance

Additive attention marked a major transition in deep learning.

Before attention, neural sequence models relied heavily on fixed-size compression. Attention introduced dynamic information access.

Several ideas introduced here became foundational:

differentiable memory access,
soft alignment,
weighted retrieval,
content-dependent computation.

Modern transformers generalize these ideas to fully attention-based architectures.

Summary

Additive attention computes relevance scores between queries and keys using a learned neural network. These scores are normalized into attention weights, which produce a weighted combination of value vectors.

The mechanism solves the fixed-vector bottleneck of early encoder-decoder systems and enables dynamic access to input information. It improves alignment, gradient flow, and long-range dependency modeling.

Although transformers mainly use dot-product attention today, additive attention established the core conceptual structure used by nearly all modern attention systems.