Skip to content

Named Entity Recognition

Named entity recognition, or NER, is the task of finding spans of text that refer to entities and assigning each span a type.

Named entity recognition, or NER, is the task of finding spans of text that refer to entities and assigning each span a type. Common entity types include people, organizations, locations, dates, products, medical terms, legal references, gene names, and monetary amounts.

For example:

Apple hired John Smith in California.

A named entity recognizer may produce:

SpanEntity type
AppleORG
John SmithPERSON
CaliforniaLOCATION

NER is a sequence labeling problem. Instead of assigning one label to the whole input, the model assigns a label to each token.

If the input sequence is

x=(x1,x2,,xT), x = (x_1, x_2, \ldots, x_T),

then the output is

y=(y1,y2,,yT). y = (y_1, y_2, \ldots, y_T).

Each token receives one label.

Entity Spans and Token Labels

Entity recognition must solve two problems at the same time. First, it must find the boundary of each entity. Second, it must classify the entity type.

Consider:

Barack Obama was born in Honolulu.

The entity spans are:

SpanType
Barack ObamaPERSON
HonoluluLOCATION

Since neural models usually operate token by token, spans are encoded as token labels. A common convention is the BIO scheme:

PrefixMeaning
BBeginning of an entity
IInside an entity
OOutside any entity

The sentence above becomes:

TokenLabel
BarackB-PERSON
ObamaI-PERSON
wasO
bornO
inO
HonoluluB-LOCATION
.O

BIO encoding allows the model to represent multi-token entities.

A stricter alternative is BIOES:

PrefixMeaning
BBeginning
IInside
OOutside
EEnd
SSingle-token entity

BIO is simpler and common. BIOES gives more boundary information, which can improve span-level accuracy in some systems.

Tokenization Alignment

NER becomes more complicated when using subword tokenization. Transformer tokenizers may split one word into multiple pieces.

For example:

unbelievable

may become:

["un", "##bel", "##ievable"]

If the original word has one entity label, we must decide how to assign labels to its subword pieces.

Common strategies include:

StrategyDescription
First-piece labelingCompute loss only on the first subword
All-piece labelingCopy the word label to every subword
Masked subword lossIgnore non-first subwords during loss computation

First-piece labeling is common with transformer models. The model predicts labels for all subwords, but the loss ignores continuation pieces.

In PyTorch, ignored labels are often set to -100, because nn.CrossEntropyLoss ignores targets with that value by default.

Example label alignment:

WordLabelSubwordsTraining labels
JohnB-PERSONJohnB-PERSON
WashingtonB-LOCATIONWashing, ##tonB-LOCATION, -100
arrivedOarrivedO

This prevents one word from contributing multiple loss terms merely because it was split into several subwords.

Model Output Shape

A token classifier produces one logit vector per token.

For a batch of token IDs,

XZB×T, X \in \mathbb{Z}^{B \times T},

the encoder produces hidden states

HRB×T×D. H \in \mathbb{R}^{B \times T \times D}.

The token classification head maps each hidden state into entity-label logits:

ZRB×T×K, Z \in \mathbb{R}^{B \times T \times K},

where KK is the number of token labels.

For example, if the label set is

O
B-PERSON
I-PERSON
B-ORG
I-ORG
B-LOCATION
I-LOCATION

then K=7K=7.

The prediction at token tt is

y^t=argmaxkZt,k. \hat{y}_t = \arg\max_k Z_{t,k}.

In PyTorch, logits usually have shape:

[B, T, K]

but nn.CrossEntropyLoss expects class logits on the second dimension. So we often flatten both logits and labels:

loss = loss_fn(
    logits.reshape(-1, num_labels),
    labels.reshape(-1),
)

Here labels has shape [B, T].

A Transformer Token Classifier

A modern NER model is usually a pretrained transformer encoder with a token-level classification head.

import torch
import torch.nn as nn

class TransformerNER(nn.Module):
    def __init__(self, encoder, hidden_dim: int, num_labels: int):
        super().__init__()
        self.encoder = encoder
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(hidden_dim, num_labels)

    def forward(self, input_ids, attention_mask, labels=None):
        outputs = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask,
        )

        hidden = outputs.last_hidden_state        # [B, T, D]
        hidden = self.dropout(hidden)

        logits = self.classifier(hidden)          # [B, T, K]

        if labels is None:
            return logits

        loss_fn = nn.CrossEntropyLoss(ignore_index=-100)
        loss = loss_fn(
            logits.reshape(-1, logits.size(-1)),  # [B*T, K]
            labels.reshape(-1),                   # [B*T]
        )

        return loss, logits

This architecture is simple. The transformer handles contextual representation learning. The linear layer assigns a label to each token representation.

Why Context Matters

Entity type often depends on context.

The word Amazon may refer to a company, a river, or a region:

Amazon reported higher revenue.
The Amazon flows through Brazil.
Species in the Amazon are under pressure.

A context-free model cannot reliably distinguish these meanings. A transformer encoder uses surrounding words to produce different contextual embeddings for the same token.

This is why pretrained language models are effective for NER. They encode syntactic and semantic context before the classification head makes token-level predictions.

Sequence Constraints

A plain token classifier predicts each token label independently after contextual encoding. This can produce invalid BIO sequences.

For example:

O I-PERSON O

The label I-PERSON should usually follow B-PERSON or another I-PERSON, not O.

A conditional random field, or CRF, can enforce sequence-level consistency. Instead of choosing each token label independently, a CRF scores the whole label sequence.

The score of a label sequence can be written as

s(x,y)=t=1TAyt1,yt+t=1TZt,yt, s(x,y) = \sum_{t=1}^{T} A_{y_{t-1},y_t} + \sum_{t=1}^{T} Z_{t,y_t},

where AA contains transition scores between labels and Zt,ytZ_{t,y_t} is the model score for assigning label yty_t to token tt.

The predicted sequence is

y^=argmaxys(x,y). \hat{y} = \arg\max_y s(x,y).

This decoding is usually done with the Viterbi algorithm.

CRFs were common in earlier neural NER systems. With large transformers, the gain from CRFs is often smaller, but CRFs still help when exact span boundaries and valid label transitions matter.

Training Objective

Without a CRF, token classification uses cross-entropy at each valid token position.

For one sequence, the loss is

L=tMlogpθ(ytx), \mathcal{L} = -\sum_{t \in M} \log p_\theta(y_t \mid x),

where MM is the set of token positions included in the loss. Padding tokens and ignored subword tokens are excluded.

For a batch, the loss is averaged over valid positions.

In PyTorch:

loss_fn = nn.CrossEntropyLoss(ignore_index=-100)

loss = loss_fn(
    logits.reshape(-1, num_labels),
    labels.reshape(-1),
)

The ignore_index=-100 setting ensures that padding and ignored subwords do not affect the gradient.

Evaluation

NER should be evaluated at the entity-span level, not merely at the token level.

A token-level metric may give partial credit for almost-correct entities. Span-level evaluation requires the predicted entity boundaries and type to match the gold annotation.

Example:

Gold entityPredicted entityCorrect?
New York City, LOCATIONNew York, LOCATIONNo
John Smith, PERSONJohn Smith, PERSONYes
OpenAI, ORGOpenAI, PRODUCTNo

The standard metrics are precision, recall, and F1.

Precision measures how many predicted entities are correct:

precision=correct predicted entitiespredicted entities. \text{precision} = \frac{\text{correct predicted entities}} {\text{predicted entities}}.

Recall measures how many gold entities are found:

recall=correct predicted entitiesgold entities. \text{recall} = \frac{\text{correct predicted entities}} {\text{gold entities}}.

F1 combines them:

F1=2PRP+R. F_1 = \frac{2PR}{P+R}.

NER systems should report both overall F1 and per-entity-type performance. A model may perform well on common entities such as PERSON and LOCATION, while failing on rare types such as LAW, CHEMICAL, or DISEASE.

Common Error Types

NER errors usually fall into several categories.

Error typeExample
Boundary errorPredicts New York instead of New York City
Type errorPredicts Amazon as LOCATION instead of ORG
Missed entityFails to detect an entity span
Spurious entityMarks ordinary text as an entity
Nested entity errorFails when one entity contains another
Abbreviation errorMisses short forms such as UN or FDA
Domain shiftPerforms poorly on medical, legal, or scientific text

Boundary errors are especially common. Many entity spans include titles, suffixes, punctuation, or multiword names.

Nested and Overlapping Entities

Basic BIO tagging assumes that each token belongs to at most one entity. This works for many datasets, but some domains contain nested entities.

Example:

University of California, Berkeley

This may be annotated as one organization, while California may also be a location.

Flat BIO tagging cannot represent both spans at the same time. Common alternatives include span classification, layered tagging, hypergraph methods, and sequence-to-sequence extraction.

In span classification, the model considers candidate spans and classifies each one:

(i,j)entity type. (i,j) \rightarrow \text{entity type}.

This approach can represent nested and overlapping entities, but it is more expensive because the number of candidate spans grows with sequence length.

Domain-Specific NER

General NER models often recognize people, organizations, and locations. Many real applications need specialized entity types.

Examples:

DomainEntity types
Medicaldisease, symptom, drug, dosage, gene
Legalstatute, case name, court, party, date
Financecompany, ticker, currency, amount, instrument
Scientificmaterial, method, dataset, metric, organism
Softwarelibrary, function, file path, error code

Domain-specific NER usually needs domain-specific annotation. A general model may know ordinary names, but fail on specialized terminology.

Useful adaptation methods include fine-tuning a domain language model, adding domain data, using weak supervision, and combining rules with neural models.

Rule-Based and Hybrid NER

Not every entity recognizer must be purely neural. Some entities are better handled with rules.

Examples include:

Entity typeUseful method
Email addressRegular expression
URLRegular expression
Phone numberPattern matching
DateRule-based parser
Currency amountRule plus numeric parser
Product codeDomain-specific pattern

Hybrid systems often work best. A neural model handles ambiguous natural language entities. Rules handle precise patterns that have stable syntax.

A practical NER pipeline may combine:

  1. Regex extractors
  2. Dictionary matchers
  3. Neural token classifier
  4. Span merger
  5. Conflict resolution
  6. Entity normalization

Entity normalization maps a detected span to a canonical ID. For example, IBM, International Business Machines, and IBM Corp. may all map to the same company identifier.

Practical Decoding

After a model predicts token labels, the labels must be converted back into spans.

Example labels:

John        B-PERSON
Smith       I-PERSON
works       O
at          O
OpenAI      B-ORG
.           O

Decoded spans:

StartEndTextType
02John SmithPERSON
45OpenAIORG

A simple BIO decoder scans from left to right. When it sees B-X, it starts a new entity of type X. Consecutive I-X labels extend the entity. A new B-Y closes the current entity and starts another.

Invalid transitions require a policy. For example, I-ORG after O may be treated as B-ORG, or it may be discarded. The policy should be consistent between training, validation, and inference.

Minimal BIO Decoder

def decode_bio(tokens, labels):
    entities = []
    start = None
    ent_type = None

    for i, label in enumerate(labels):
        if label == "O":
            if start is not None:
                entities.append((start, i, ent_type))
                start = None
                ent_type = None
            continue

        prefix, typ = label.split("-", 1)

        if prefix == "B":
            if start is not None:
                entities.append((start, i, ent_type))
            start = i
            ent_type = typ

        elif prefix == "I":
            if start is None:
                start = i
                ent_type = typ
            elif typ != ent_type:
                entities.append((start, i, ent_type))
                start = i
                ent_type = typ

    if start is not None:
        entities.append((start, len(labels), ent_type))

    return [
        {
            "text": " ".join(tokens[start:end]),
            "start": start,
            "end": end,
            "type": typ,
        }
        for start, end, typ in entities
    ]

This decoder is intentionally simple. Production systems often need character offsets, subword merging, punctuation handling, and normalization.

Character Offsets

Applications often need entity locations in the original text, not just token indices.

For example:

{
  "text": "John Smith works at OpenAI.",
  "entities": [
    {"start": 0, "end": 10, "type": "PERSON"},
    {"start": 20, "end": 26, "type": "ORG"}
  ]
}

Character offsets make it possible to highlight entities in a document, link entities to databases, redact sensitive information, or pass spans to downstream systems.

When using subword tokenizers, keep offset mappings from the tokenizer. These mappings record which character range corresponds to each token.

Summary

Named entity recognition identifies entity spans and assigns entity types. In deep learning, NER is usually formulated as token classification with BIO or BIOES labels.

A transformer-based NER model maps token IDs to contextual hidden states, then applies a linear classifier at each token position. Training uses cross-entropy over valid token positions, while padding and ignored subword pieces are excluded from the loss.

NER quality should be measured at the span level. Correct predictions require both the boundary and entity type to match. Practical systems must handle tokenization alignment, invalid BIO transitions, domain-specific entities, nested entities, character offsets, and entity normalization.