Named Entity Recognition

Named entity recognition, or NER, is the task of finding spans of text that refer to entities and assigning each span a type. Common entity types include people, organizations, locations, dates, products, medical terms, legal references, gene names, and monetary amounts.

For example:

Apple hired John Smith in California.

A named entity recognizer may produce:

Span	Entity type
Apple	ORG
John Smith	PERSON
California	LOCATION

NER is a sequence labeling problem. Instead of assigning one label to the whole input, the model assigns a label to each token.

If the input sequence is

x = (x_1, x_2, \ldots, x_T),

then the output is

y = (y_1, y_2, \ldots, y_T).

Each token receives one label.

Entity Spans and Token Labels

Entity recognition must solve two problems at the same time. First, it must find the boundary of each entity. Second, it must classify the entity type.

Consider:

Barack Obama was born in Honolulu.

The entity spans are:

Span	Type
Barack Obama	PERSON
Honolulu	LOCATION

Since neural models usually operate token by token, spans are encoded as token labels. A common convention is the BIO scheme:

Prefix	Meaning
B	Beginning of an entity
I	Inside an entity
O	Outside any entity

The sentence above becomes:

Token	Label
Barack	B-PERSON
Obama	I-PERSON
was	O
born	O
in	O
Honolulu	B-LOCATION
.	O

BIO encoding allows the model to represent multi-token entities.

A stricter alternative is BIOES:

Prefix	Meaning
B	Beginning
I	Inside
O	Outside
E	End
S	Single-token entity

BIO is simpler and common. BIOES gives more boundary information, which can improve span-level accuracy in some systems.

Tokenization Alignment

NER becomes more complicated when using subword tokenization. Transformer tokenizers may split one word into multiple pieces.

For example:

unbelievable

may become:

["un", "##bel", "##ievable"]

If the original word has one entity label, we must decide how to assign labels to its subword pieces.

Common strategies include:

Strategy	Description
First-piece labeling	Compute loss only on the first subword
All-piece labeling	Copy the word label to every subword
Masked subword loss	Ignore non-first subwords during loss computation

First-piece labeling is common with transformer models. The model predicts labels for all subwords, but the loss ignores continuation pieces.

In PyTorch, ignored labels are often set to -100, because nn.CrossEntropyLoss ignores targets with that value by default.

Example label alignment:

Word	Label	Subwords	Training labels
John	B-PERSON	John	B-PERSON
Washington	B-LOCATION	Washing, ##ton	B-LOCATION, -100
arrived	O	arrived	O

This prevents one word from contributing multiple loss terms merely because it was split into several subwords.

Model Output Shape

A token classifier produces one logit vector per token.

For a batch of token IDs,

X \in \mathbb{Z}^{B \times T},

the encoder produces hidden states

H \in \mathbb{R}^{B \times T \times D}.

The token classification head maps each hidden state into entity-label logits:

Z \in \mathbb{R}^{B \times T \times K},

where $K$ is the number of token labels.

For example, if the label set is

O
B-PERSON
I-PERSON
B-ORG
I-ORG
B-LOCATION
I-LOCATION

then $K=7$ .

The prediction at token $t$ is

\hat{y}_t = \arg\max_k Z_{t,k}.

In PyTorch, logits usually have shape:

[B, T, K]

but nn.CrossEntropyLoss expects class logits on the second dimension. So we often flatten both logits and labels:

loss = loss_fn(
    logits.reshape(-1, num_labels),
    labels.reshape(-1),
)

Here labels has shape [B, T].

A Transformer Token Classifier

A modern NER model is usually a pretrained transformer encoder with a token-level classification head.

import torch
import torch.nn as nn

class TransformerNER(nn.Module):
    def __init__(self, encoder, hidden_dim: int, num_labels: int):
        super().__init__()
        self.encoder = encoder
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(hidden_dim, num_labels)

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )

        hidden = outputs.last_hidden_state        # [B, T, D]
        hidden = self.dropout(hidden)

        logits = self.classifier(hidden)          # [B, T, K]

        if labels is None:
            return logits

        loss_fn = nn.CrossEntropyLoss(ignore_index=-100)
        loss = loss_fn(
            logits.reshape(-1, logits.size(-1)),  # [B*T, K]
            labels.reshape(-1),                   # [B*T]
        )

        return loss, logits

This architecture is simple. The transformer handles contextual representation learning. The linear layer assigns a label to each token representation.

Why Context Matters

Entity type often depends on context.

The word Amazon may refer to a company, a river, or a region:

Amazon reported higher revenue.

The Amazon flows through Brazil.

Species in the Amazon are under pressure.

A context-free model cannot reliably distinguish these meanings. A transformer encoder uses surrounding words to produce different contextual embeddings for the same token.

This is why pretrained language models are effective for NER. They encode syntactic and semantic context before the classification head makes token-level predictions.

Sequence Constraints

A plain token classifier predicts each token label independently after contextual encoding. This can produce invalid BIO sequences.

For example:

O I-PERSON O

The label I-PERSON should usually follow B-PERSON or another I-PERSON, not O.

A conditional random field, or CRF, can enforce sequence-level consistency. Instead of choosing each token label independently, a CRF scores the whole label sequence.

The score of a label sequence can be written as

s(x,y) = \sum_{t=1}^{T} A_{y_{t-1},y_t} + \sum_{t=1}^{T} Z_{t,y_t},

where $A$ contains transition scores between labels and $Z_{t,y_t}$ is the model score for assigning label $y_t$ to token $t$ .

The predicted sequence is

\hat{y} = \arg\max_y s(x,y).

This decoding is usually done with the Viterbi algorithm.

CRFs were common in earlier neural NER systems. With large transformers, the gain from CRFs is often smaller, but CRFs still help when exact span boundaries and valid label transitions matter.

Training Objective

Without a CRF, token classification uses cross-entropy at each valid token position.

For one sequence, the loss is

\mathcal{L} = -\sum_{t \in M} \log p_\theta(y_t \mid x),

where $M$ is the set of token positions included in the loss. Padding tokens and ignored subword tokens are excluded.

For a batch, the loss is averaged over valid positions.

In PyTorch:

loss_fn = nn.CrossEntropyLoss(ignore_index=-100)

loss = loss_fn(
    logits.reshape(-1, num_labels),
    labels.reshape(-1),
)

The ignore_index=-100 setting ensures that padding and ignored subwords do not affect the gradient.

Evaluation

NER should be evaluated at the entity-span level, not merely at the token level.

A token-level metric may give partial credit for almost-correct entities. Span-level evaluation requires the predicted entity boundaries and type to match the gold annotation.

Example:

Gold entity	Predicted entity	Correct?
New York City, LOCATION	New York, LOCATION	No
John Smith, PERSON	John Smith, PERSON	Yes
OpenAI, ORG	OpenAI, PRODUCT	No

The standard metrics are precision, recall, and F1.

Precision measures how many predicted entities are correct:

\text{precision} = \frac{\text{correct predicted entities}} {\text{predicted entities}}.

Recall measures how many gold entities are found:

\text{recall} = \frac{\text{correct predicted entities}} {\text{gold entities}}.

F1 combines them:

F_1 = \frac{2PR}{P+R}.

NER systems should report both overall F1 and per-entity-type performance. A model may perform well on common entities such as PERSON and LOCATION, while failing on rare types such as LAW, CHEMICAL, or DISEASE.

Common Error Types

NER errors usually fall into several categories.

Error type	Example
Boundary error	Predicts `New York` instead of `New York City`
Type error	Predicts `Amazon` as LOCATION instead of ORG
Missed entity	Fails to detect an entity span
Spurious entity	Marks ordinary text as an entity
Nested entity error	Fails when one entity contains another
Abbreviation error	Misses short forms such as `UN` or `FDA`
Domain shift	Performs poorly on medical, legal, or scientific text

Boundary errors are especially common. Many entity spans include titles, suffixes, punctuation, or multiword names.

Nested and Overlapping Entities

Basic BIO tagging assumes that each token belongs to at most one entity. This works for many datasets, but some domains contain nested entities.

Example:

University of California, Berkeley

This may be annotated as one organization, while California may also be a location.

Flat BIO tagging cannot represent both spans at the same time. Common alternatives include span classification, layered tagging, hypergraph methods, and sequence-to-sequence extraction.

In span classification, the model considers candidate spans and classifies each one:

(i,j) \rightarrow \text{entity type}.

This approach can represent nested and overlapping entities, but it is more expensive because the number of candidate spans grows with sequence length.

Domain-Specific NER

General NER models often recognize people, organizations, and locations. Many real applications need specialized entity types.

Examples:

Domain	Entity types
Medical	disease, symptom, drug, dosage, gene
Legal	statute, case name, court, party, date
Finance	company, ticker, currency, amount, instrument
Scientific	material, method, dataset, metric, organism
Software	library, function, file path, error code

Domain-specific NER usually needs domain-specific annotation. A general model may know ordinary names, but fail on specialized terminology.

Useful adaptation methods include fine-tuning a domain language model, adding domain data, using weak supervision, and combining rules with neural models.

Rule-Based and Hybrid NER

Not every entity recognizer must be purely neural. Some entities are better handled with rules.

Examples include:

Entity type	Useful method
Email address	Regular expression
URL	Regular expression
Phone number	Pattern matching
Date	Rule-based parser
Currency amount	Rule plus numeric parser
Product code	Domain-specific pattern

Hybrid systems often work best. A neural model handles ambiguous natural language entities. Rules handle precise patterns that have stable syntax.

A practical NER pipeline may combine:

Regex extractors
Dictionary matchers
Neural token classifier
Span merger
Conflict resolution
Entity normalization

Entity normalization maps a detected span to a canonical ID. For example, IBM, International Business Machines, and IBM Corp. may all map to the same company identifier.

Practical Decoding

After a model predicts token labels, the labels must be converted back into spans.

Example labels:

John        B-PERSON
Smith       I-PERSON
works       O
at          O
OpenAI      B-ORG
.           O

Decoded spans:

Start	End	Text	Type
0	2	John Smith	PERSON
4	5	OpenAI	ORG

A simple BIO decoder scans from left to right. When it sees B-X, it starts a new entity of type X. Consecutive I-X labels extend the entity. A new B-Y closes the current entity and starts another.

Invalid transitions require a policy. For example, I-ORG after O may be treated as B-ORG, or it may be discarded. The policy should be consistent between training, validation, and inference.

Minimal BIO Decoder

def decode_bio(tokens, labels):
    entities = []
    start = None
    ent_type = None

    for i, label in enumerate(labels):
        if label == "O":
            if start is not None:
                entities.append((start, i, ent_type))
                start = None
                ent_type = None
            continue

        prefix, typ = label.split("-", 1)

        if prefix == "B":
            if start is not None:
                entities.append((start, i, ent_type))
            start = i
            ent_type = typ

        elif prefix == "I":
            if start is None:
                start = i
                ent_type = typ
            elif typ != ent_type:
                entities.append((start, i, ent_type))
                start = i
                ent_type = typ

    if start is not None:
        entities.append((start, len(labels), ent_type))

    return [
        {
            "text": " ".join(tokens[start:end]),
            "start": start,
            "end": end,
            "type": typ,
        }
        for start, end, typ in entities
    ]

This decoder is intentionally simple. Production systems often need character offsets, subword merging, punctuation handling, and normalization.

Character Offsets

Applications often need entity locations in the original text, not just token indices.

For example:

{
  "text": "John Smith works at OpenAI.",
  "entities": [
    {"start": 0, "end": 10, "type": "PERSON"},
    {"start": 20, "end": 26, "type": "ORG"}
  ]
}

Character offsets make it possible to highlight entities in a document, link entities to databases, redact sensitive information, or pass spans to downstream systems.

When using subword tokenizers, keep offset mappings from the tokenizer. These mappings record which character range corresponds to each token.

Summary

Named entity recognition identifies entity spans and assigns entity types. In deep learning, NER is usually formulated as token classification with BIO or BIOES labels.

A transformer-based NER model maps token IDs to contextual hidden states, then applies a linear classifier at each token position. Training uses cross-entropy over valid token positions, while padding and ignored subword pieces are excluded from the loss.

NER quality should be measured at the span level. Correct predictions require both the boundary and entity type to match. Practical systems must handle tokenization alignment, invalid BIO transitions, domain-specific entities, nested entities, character offsets, and entity normalization.