Named entity recognition, or NER, is the task of finding spans of text that refer to entities and assigning each span a type.
Named entity recognition, or NER, is the task of finding spans of text that refer to entities and assigning each span a type. Common entity types include people, organizations, locations, dates, products, medical terms, legal references, gene names, and monetary amounts.
For example:
Apple hired John Smith in California.A named entity recognizer may produce:
| Span | Entity type |
|---|---|
| Apple | ORG |
| John Smith | PERSON |
| California | LOCATION |
NER is a sequence labeling problem. Instead of assigning one label to the whole input, the model assigns a label to each token.
If the input sequence is
then the output is
Each token receives one label.
Entity Spans and Token Labels
Entity recognition must solve two problems at the same time. First, it must find the boundary of each entity. Second, it must classify the entity type.
Consider:
Barack Obama was born in Honolulu.The entity spans are:
| Span | Type |
|---|---|
| Barack Obama | PERSON |
| Honolulu | LOCATION |
Since neural models usually operate token by token, spans are encoded as token labels. A common convention is the BIO scheme:
| Prefix | Meaning |
|---|---|
| B | Beginning of an entity |
| I | Inside an entity |
| O | Outside any entity |
The sentence above becomes:
| Token | Label |
|---|---|
| Barack | B-PERSON |
| Obama | I-PERSON |
| was | O |
| born | O |
| in | O |
| Honolulu | B-LOCATION |
| . | O |
BIO encoding allows the model to represent multi-token entities.
A stricter alternative is BIOES:
| Prefix | Meaning |
|---|---|
| B | Beginning |
| I | Inside |
| O | Outside |
| E | End |
| S | Single-token entity |
BIO is simpler and common. BIOES gives more boundary information, which can improve span-level accuracy in some systems.
Tokenization Alignment
NER becomes more complicated when using subword tokenization. Transformer tokenizers may split one word into multiple pieces.
For example:
unbelievablemay become:
["un", "##bel", "##ievable"]If the original word has one entity label, we must decide how to assign labels to its subword pieces.
Common strategies include:
| Strategy | Description |
|---|---|
| First-piece labeling | Compute loss only on the first subword |
| All-piece labeling | Copy the word label to every subword |
| Masked subword loss | Ignore non-first subwords during loss computation |
First-piece labeling is common with transformer models. The model predicts labels for all subwords, but the loss ignores continuation pieces.
In PyTorch, ignored labels are often set to -100, because nn.CrossEntropyLoss ignores targets with that value by default.
Example label alignment:
| Word | Label | Subwords | Training labels |
|---|---|---|---|
| John | B-PERSON | John | B-PERSON |
| Washington | B-LOCATION | Washing, ##ton | B-LOCATION, -100 |
| arrived | O | arrived | O |
This prevents one word from contributing multiple loss terms merely because it was split into several subwords.
Model Output Shape
A token classifier produces one logit vector per token.
For a batch of token IDs,
the encoder produces hidden states
The token classification head maps each hidden state into entity-label logits:
where is the number of token labels.
For example, if the label set is
O
B-PERSON
I-PERSON
B-ORG
I-ORG
B-LOCATION
I-LOCATIONthen .
The prediction at token is
In PyTorch, logits usually have shape:
[B, T, K]but nn.CrossEntropyLoss expects class logits on the second dimension. So we often flatten both logits and labels:
loss = loss_fn(
logits.reshape(-1, num_labels),
labels.reshape(-1),
)Here labels has shape [B, T].
A Transformer Token Classifier
A modern NER model is usually a pretrained transformer encoder with a token-level classification head.
import torch
import torch.nn as nn
class TransformerNER(nn.Module):
def __init__(self, encoder, hidden_dim: int, num_labels: int):
super().__init__()
self.encoder = encoder
self.dropout = nn.Dropout(0.1)
self.classifier = nn.Linear(hidden_dim, num_labels)
def forward(self, input_ids, attention_mask, labels=None):
outputs = self.encoder(
input_ids=input_ids,
attention_mask=attention_mask,
)
hidden = outputs.last_hidden_state # [B, T, D]
hidden = self.dropout(hidden)
logits = self.classifier(hidden) # [B, T, K]
if labels is None:
return logits
loss_fn = nn.CrossEntropyLoss(ignore_index=-100)
loss = loss_fn(
logits.reshape(-1, logits.size(-1)), # [B*T, K]
labels.reshape(-1), # [B*T]
)
return loss, logitsThis architecture is simple. The transformer handles contextual representation learning. The linear layer assigns a label to each token representation.
Why Context Matters
Entity type often depends on context.
The word Amazon may refer to a company, a river, or a region:
Amazon reported higher revenue.The Amazon flows through Brazil.Species in the Amazon are under pressure.A context-free model cannot reliably distinguish these meanings. A transformer encoder uses surrounding words to produce different contextual embeddings for the same token.
This is why pretrained language models are effective for NER. They encode syntactic and semantic context before the classification head makes token-level predictions.
Sequence Constraints
A plain token classifier predicts each token label independently after contextual encoding. This can produce invalid BIO sequences.
For example:
O I-PERSON OThe label I-PERSON should usually follow B-PERSON or another I-PERSON, not O.
A conditional random field, or CRF, can enforce sequence-level consistency. Instead of choosing each token label independently, a CRF scores the whole label sequence.
The score of a label sequence can be written as
where contains transition scores between labels and is the model score for assigning label to token .
The predicted sequence is
This decoding is usually done with the Viterbi algorithm.
CRFs were common in earlier neural NER systems. With large transformers, the gain from CRFs is often smaller, but CRFs still help when exact span boundaries and valid label transitions matter.
Training Objective
Without a CRF, token classification uses cross-entropy at each valid token position.
For one sequence, the loss is
where is the set of token positions included in the loss. Padding tokens and ignored subword tokens are excluded.
For a batch, the loss is averaged over valid positions.
In PyTorch:
loss_fn = nn.CrossEntropyLoss(ignore_index=-100)
loss = loss_fn(
logits.reshape(-1, num_labels),
labels.reshape(-1),
)The ignore_index=-100 setting ensures that padding and ignored subwords do not affect the gradient.
Evaluation
NER should be evaluated at the entity-span level, not merely at the token level.
A token-level metric may give partial credit for almost-correct entities. Span-level evaluation requires the predicted entity boundaries and type to match the gold annotation.
Example:
| Gold entity | Predicted entity | Correct? |
|---|---|---|
| New York City, LOCATION | New York, LOCATION | No |
| John Smith, PERSON | John Smith, PERSON | Yes |
| OpenAI, ORG | OpenAI, PRODUCT | No |
The standard metrics are precision, recall, and F1.
Precision measures how many predicted entities are correct:
Recall measures how many gold entities are found:
F1 combines them:
NER systems should report both overall F1 and per-entity-type performance. A model may perform well on common entities such as PERSON and LOCATION, while failing on rare types such as LAW, CHEMICAL, or DISEASE.
Common Error Types
NER errors usually fall into several categories.
| Error type | Example |
|---|---|
| Boundary error | Predicts New York instead of New York City |
| Type error | Predicts Amazon as LOCATION instead of ORG |
| Missed entity | Fails to detect an entity span |
| Spurious entity | Marks ordinary text as an entity |
| Nested entity error | Fails when one entity contains another |
| Abbreviation error | Misses short forms such as UN or FDA |
| Domain shift | Performs poorly on medical, legal, or scientific text |
Boundary errors are especially common. Many entity spans include titles, suffixes, punctuation, or multiword names.
Nested and Overlapping Entities
Basic BIO tagging assumes that each token belongs to at most one entity. This works for many datasets, but some domains contain nested entities.
Example:
University of California, BerkeleyThis may be annotated as one organization, while California may also be a location.
Flat BIO tagging cannot represent both spans at the same time. Common alternatives include span classification, layered tagging, hypergraph methods, and sequence-to-sequence extraction.
In span classification, the model considers candidate spans and classifies each one:
This approach can represent nested and overlapping entities, but it is more expensive because the number of candidate spans grows with sequence length.
Domain-Specific NER
General NER models often recognize people, organizations, and locations. Many real applications need specialized entity types.
Examples:
| Domain | Entity types |
|---|---|
| Medical | disease, symptom, drug, dosage, gene |
| Legal | statute, case name, court, party, date |
| Finance | company, ticker, currency, amount, instrument |
| Scientific | material, method, dataset, metric, organism |
| Software | library, function, file path, error code |
Domain-specific NER usually needs domain-specific annotation. A general model may know ordinary names, but fail on specialized terminology.
Useful adaptation methods include fine-tuning a domain language model, adding domain data, using weak supervision, and combining rules with neural models.
Rule-Based and Hybrid NER
Not every entity recognizer must be purely neural. Some entities are better handled with rules.
Examples include:
| Entity type | Useful method |
|---|---|
| Email address | Regular expression |
| URL | Regular expression |
| Phone number | Pattern matching |
| Date | Rule-based parser |
| Currency amount | Rule plus numeric parser |
| Product code | Domain-specific pattern |
Hybrid systems often work best. A neural model handles ambiguous natural language entities. Rules handle precise patterns that have stable syntax.
A practical NER pipeline may combine:
- Regex extractors
- Dictionary matchers
- Neural token classifier
- Span merger
- Conflict resolution
- Entity normalization
Entity normalization maps a detected span to a canonical ID. For example, IBM, International Business Machines, and IBM Corp. may all map to the same company identifier.
Practical Decoding
After a model predicts token labels, the labels must be converted back into spans.
Example labels:
John B-PERSON
Smith I-PERSON
works O
at O
OpenAI B-ORG
. ODecoded spans:
| Start | End | Text | Type |
|---|---|---|---|
| 0 | 2 | John Smith | PERSON |
| 4 | 5 | OpenAI | ORG |
A simple BIO decoder scans from left to right. When it sees B-X, it starts a new entity of type X. Consecutive I-X labels extend the entity. A new B-Y closes the current entity and starts another.
Invalid transitions require a policy. For example, I-ORG after O may be treated as B-ORG, or it may be discarded. The policy should be consistent between training, validation, and inference.
Minimal BIO Decoder
def decode_bio(tokens, labels):
entities = []
start = None
ent_type = None
for i, label in enumerate(labels):
if label == "O":
if start is not None:
entities.append((start, i, ent_type))
start = None
ent_type = None
continue
prefix, typ = label.split("-", 1)
if prefix == "B":
if start is not None:
entities.append((start, i, ent_type))
start = i
ent_type = typ
elif prefix == "I":
if start is None:
start = i
ent_type = typ
elif typ != ent_type:
entities.append((start, i, ent_type))
start = i
ent_type = typ
if start is not None:
entities.append((start, len(labels), ent_type))
return [
{
"text": " ".join(tokens[start:end]),
"start": start,
"end": end,
"type": typ,
}
for start, end, typ in entities
]This decoder is intentionally simple. Production systems often need character offsets, subword merging, punctuation handling, and normalization.
Character Offsets
Applications often need entity locations in the original text, not just token indices.
For example:
{
"text": "John Smith works at OpenAI.",
"entities": [
{"start": 0, "end": 10, "type": "PERSON"},
{"start": 20, "end": 26, "type": "ORG"}
]
}Character offsets make it possible to highlight entities in a document, link entities to databases, redact sensitive information, or pass spans to downstream systems.
When using subword tokenizers, keep offset mappings from the tokenizer. These mappings record which character range corresponds to each token.
Summary
Named entity recognition identifies entity spans and assigns entity types. In deep learning, NER is usually formulated as token classification with BIO or BIOES labels.
A transformer-based NER model maps token IDs to contextual hidden states, then applies a linear classifier at each token position. Training uses cross-entropy over valid token positions, while padding and ignored subword pieces are excluded from the loss.
NER quality should be measured at the span level. Correct predictions require both the boundary and entity type to match. Practical systems must handle tokenization alignment, invalid BIO transitions, domain-specific entities, nested entities, character offsets, and entity normalization.