Named Entity Recognition

Named entity recognition, usually abbreviated NER, identifies spans of text that refer to named or typed entities. Typical entity types include people, organizations, locations, dates, products, events, quantities, and domain-specific terms.

For example:

Ada Lovelace worked with Charles Babbage in London.

A named entity recognizer may produce:

Span	Entity type
`Ada Lovelace`	`PERSON`
`Charles Babbage`	`PERSON`
`London`	`LOCATION`

NER is a sequence labeling problem. Unlike text classification, which assigns one label to the whole input, NER assigns a label to each token.

Token-Level Labels

Given a token sequence

x = (x_1, x_2, \dots, x_T),

a sequence labeling model predicts a label sequence

y = (y_1, y_2, \dots, y_T).

Each input token receives one output label.

For example:

Token	Label
`Ada`	`B-PER`
`Lovelace`	`I-PER`
`worked`	`O`
`with`	`O`
`Charles`	`B-PER`
`Babbage`	`I-PER`
`in`	`O`
`London`	`B-LOC`

The label O means “outside any entity.”

BIO Tagging

NER commonly uses the BIO tagging scheme.

Prefix	Meaning
`B`	Beginning of an entity
`I`	Inside an entity
`O`	Outside any entity

If the entity types are PER, ORG, and LOC, the label set may be:

O
B-PER
I-PER
B-ORG
I-ORG
B-LOC
I-LOC

BIO labels allow the model to represent multi-token entities:

Token	Label
`New`	`B-LOC`
`York`	`I-LOC`
`City`	`I-LOC`

A new entity of the same type begins with another B label:

Token	Label
`Paris`	`B-LOC`
`and`	`O`
`London`	`B-LOC`

This avoids ambiguity between one long entity and two separate entities.

Tokenization and Label Alignment

NER becomes more complicated when using subword tokenization. A word may be split into several tokens.

For example:

Lovelace

might become:

Love ##lace

The original word has one NER label, but the tokenizer produces two model tokens. We must align word-level labels with token-level inputs.

One common rule is to assign the label to the first subword and ignore the remaining subwords during loss computation.

Word	Word label	Tokens	Token labels
`Ada`	`B-PER`	`Ada`	`B-PER`
`Lovelace`	`I-PER`	`Love`, `##lace`	`I-PER`, `IGNORE`

Another rule assigns I-PER to continuation subwords:

Word	Word label	Tokens	Token labels
`Lovelace`	`I-PER`	`Love`, `##lace`	`I-PER`, `I-PER`

Both conventions are used. The first avoids overweighting words that split into many subwords. The second gives every token a valid label. The chosen convention must be used consistently in training and evaluation.

In PyTorch, ignored positions are often labeled with -100, because nn.CrossEntropyLoss ignores targets with ignore_index=-100 by default.

labels = torch.tensor([
    [1, 2, -100, 0, 3],
])

Here the third token is ignored during loss computation.

A Transformer NER Model

A transformer NER model produces one contextual vector per token. A linear classifier maps each token vector to entity-label logits.

If the input has shape

input_ids: [B, T]

then the transformer output has shape

hidden_states: [B, T, D]

The classifier maps this to:

logits: [B, T, K]

where $K$ is the number of NER labels.

A simplified PyTorch classifier head:

import torch
import torch.nn as nn

class TokenClassificationHead(nn.Module):
    def __init__(self, hidden_dim, num_labels, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(hidden_dim, num_labels)

    def forward(self, hidden_states):
        # hidden_states: [B, T, D]
        x = self.dropout(hidden_states)
        logits = self.classifier(x)
        # logits: [B, T, K]
        return logits

The same classifier is applied independently at every token position. Context comes from the transformer layers before the classifier.

Loss Function for NER

NER usually uses cross-entropy loss at each token position. Padding tokens and ignored subwords should be excluded from the loss.

Suppose:

logits: [B, T, K]
labels: [B, T]

PyTorch’s CrossEntropyLoss expects class logits with shape [N, K] and labels with shape [N]. We flatten the batch and sequence axes:

loss_fn = nn.CrossEntropyLoss(ignore_index=-100)

B, T, K = logits.shape

loss = loss_fn(
    logits.reshape(B * T, K),
    labels.reshape(B * T),
)

The ignored labels do not contribute to the loss.

Attention Masks and Padding

NER batches are padded just like other NLP batches. The attention mask prevents the transformer from attending to padding positions. The label tensor also marks padding positions as ignored.

Example:

input_ids = torch.tensor([
    [101, 2030, 2293, 102, 0, 0],
    [101, 2759, 3000, 1999, 2414, 102],
])

attention_mask = torch.tensor([
    [1, 1, 1, 1, 0, 0],
    [1, 1, 1, 1, 1, 1],
])

labels = torch.tensor([
    [-100, 0, 1, -100, -100, -100],
    [-100, 1, 2, 0, 3, -100],
])

Special tokens such as [CLS] and [SEP] are often assigned -100, because they do not correspond to original text tokens.

Padding must be handled in both places: attention masks for the model, ignored labels for the loss.

Building a Small NER Model

The following example shows the structure of a complete token classification model assuming an encoder module that returns hidden states.

class NERModel(nn.Module):
    def __init__(self, encoder, hidden_dim, num_labels, dropout=0.1):
        super().__init__()
        self.encoder = encoder
        self.dropout = nn.Dropout(dropout)
        self.classifier = nn.Linear(hidden_dim, num_labels)

    def forward(self, input_ids, attention_mask=None, labels=None):
        # encoder output: [B, T, D]
        hidden_states = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )

        hidden_states = self.dropout(hidden_states)
        logits = self.classifier(hidden_states)
        # logits: [B, T, K]

        if labels is None:
            return logits

        B, T, K = logits.shape
        loss_fn = nn.CrossEntropyLoss(ignore_index=-100)

        loss = loss_fn(
            logits.reshape(B * T, K),
            labels.reshape(B * T),
        )

        return loss, logits

For a pretrained transformer, the encoder is usually loaded from a model library. The classifier head is randomly initialized and fine-tuned with the transformer.

Decoding Token Predictions

At inference time, the model returns logits for every token. We choose the highest-scoring label at each position:

pred_ids = logits.argmax(dim=-1)

The result has shape:

[B, T]

To recover entity spans, we must map token predictions back to text. Ignored tokens, padding tokens, and special tokens are skipped. BIO labels are then converted into spans.

Example token predictions:

Token	Predicted label
`Ada`	`B-PER`
`Love`	`I-PER`
`##lace`	`IGNORE`
`worked`	`O`
`London`	`B-LOC`

After merging subwords, the output spans are:

Span	Type
`Ada Lovelace`	`PER`
`London`	`LOC`

Decoding is part of the model interface. A useful NER system returns spans with character offsets, not only token labels.

For example:

[
  {
    "text": "Ada Lovelace",
    "type": "PER",
    "start": 0,
    "end": 12
  }
]

Character offsets are important for highlighting text, storing annotations, and downstream information extraction.

Conditional Random Fields

A token classifier predicts each label independently given the encoder output. This can produce invalid BIO sequences.

For example:

Token	Predicted label
`worked`	`I-ORG`

An I-ORG label should usually follow B-ORG or another I-ORG, not begin an entity by itself.

A conditional random field, or CRF, adds a structured decoding layer. It learns transition scores between labels and searches for the best full label sequence rather than selecting each token independently.

The score of a label sequence can be written as:

s(x, y) = \sum_{t=1}^{T} e_t(y_t) + \sum_{t=2}^{T} a(y_{t-1}, y_t),

where $e_t(y_t)$ is the emission score from the neural network and $a(y_{t-1}, y_t)$ is the transition score from the previous label to the current label.

A CRF can learn that B-PER may be followed by I-PER, while B-PER followed by I-LOC is unlikely.

CRFs were especially common in pre-transformer NER systems. With strong pretrained transformers, a simple token classifier often works well. CRFs remain useful when label consistency is important or training data is limited.

Evaluation

NER evaluation usually happens at the entity-span level, not the token level. A predicted entity is correct only when its span boundaries and type match the gold annotation.

Suppose the gold entity is:

[Ada Lovelace]PER

If the model predicts only:

[Ada]PER

then the entity is usually counted as wrong, even though one token was correct.

Common metrics are precision, recall, and F1.

Metric	Meaning
Precision	Fraction of predicted entities that are correct
Recall	Fraction of gold entities that were found
F1	Harmonic mean of precision and recall

NER systems should also report per-entity-type metrics. A model may perform well on PERSON and poorly on ORG or domain-specific entities.

Common Sources of Error

NER errors often fall into a few categories.

Error type	Example
Boundary error	Predicts `New York` instead of `New York City`
Type error	Predicts `Amazon` as `LOCATION` instead of `ORG`
Missed entity	Fails to detect a rare product name
Spurious entity	Marks an ordinary noun as an entity
Nested entity error	Struggles with entities inside entities
Tokenization error	Splits or aligns labels incorrectly

Boundary errors are common because entity spans are not always linguistically obvious. For example, should University of California include Berkeley when the text says University of California, Berkeley? The answer depends on the annotation guidelines.

Domain-Specific NER

General NER labels such as PERSON, ORG, and LOC are useful, but many real systems need domain-specific labels.

In medicine, labels may include:

DISEASE
DRUG
DOSAGE
SYMPTOM
ANATOMY

In law, labels may include:

CASE_NAME
STATUTE
COURT
JUDGE
DATE

In finance, labels may include:

TICKER
COMPANY
METRIC
CURRENCY_AMOUNT
FISCAL_PERIOD

Domain-specific NER usually requires carefully annotated data. The quality of the annotation guidelines often matters as much as the model architecture.

Nested and Overlapping Entities

Standard BIO tagging assumes that each token has at most one label. This works for flat entities, but some text contains nested entities.

Example:

University of California, Berkeley

This may be both an organization and contain a location:

[University of California, Berkeley]ORG
                           [Berkeley]LOC

A single BIO label sequence cannot represent both spans at once. Alternatives include span classification, layered sequence labeling, hypergraph methods, and generative extraction.

For many practical systems, flat NER is sufficient. For legal, biomedical, and scientific text, nested entities may be important.

Span-Based NER

Instead of labeling each token, a span-based model considers candidate spans and classifies each span.

For a sequence of length $T$ , possible spans are pairs:

(i, j), \quad 1 \le i \le j \le T.

Each span receives a label such as PERSON, ORG, or NONE.

Span-based NER can naturally represent overlapping and nested entities. The main cost is that there are many possible spans. In practice, systems often limit the maximum span length.

A span representation may concatenate the start token vector, end token vector, and a pooled vector over the span:

span_repr = [h_start; h_end; mean(h_start...h_end)]

A classifier then predicts the entity type.

NER in Information Extraction Pipelines

NER is often the first stage in a larger information extraction pipeline.

A typical pipeline may be:

raw text
-> sentence splitting
-> tokenization
-> named entity recognition
-> entity normalization
-> relation extraction
-> database insertion

Entity normalization maps a detected mention to a canonical database entry. For example, IBM, International Business Machines, and IBM Corp. may all refer to the same company.

NER finds spans. Normalization decides what real-world object each span denotes.

Practical Guidelines

Start with a pretrained transformer and a token classification head. Use a simple BIO scheme before considering more complex models.

Keep annotation guidelines precise. Many NER failures come from inconsistent labels rather than weak models.

Inspect examples, not only aggregate metrics. Boundary errors and type errors require different fixes.

Handle subword alignment carefully. Incorrect alignment silently damages training.

Evaluate at the entity level. Token-level accuracy can look high even when span extraction quality is poor, because most tokens are usually labeled O.

Summary

Named entity recognition identifies typed spans in text. It is usually formulated as token classification with BIO labels. A transformer encoder produces contextual token vectors, and a classifier predicts one label per token.

NER requires careful handling of tokenization, padding, ignored labels, decoding, and span-level evaluation. Simple token classifiers are strong baselines. CRFs and span-based methods add structure when label consistency, nesting, or overlapping entities are important.