16.18 Unicode and Normalization

Many string algorithms assume that a string is simply a sequence of characters.

16.18 Unicode and Normalization

Problem

Many string algorithms assume that a string is simply a sequence of characters.

For English text, this assumption often appears correct.

hello
world
algorithm

A character seems to correspond to:

  • One symbol
  • One storage unit
  • One display position

Unicode breaks this assumption.

Consider:

é

This character may be represented as:

U+00E9

or as:

U+0065 U+0301

which means:

e + combining acute accent

Visually they appear identical.

Byte-by-byte comparison says they are different.

Human readers say they are the same.

The challenge is to define what "equal" means and ensure string algorithms operate consistently.

Solution

Separate three concepts:

  1. Bytes
  2. Code points
  3. Grapheme clusters

Then apply normalization before performing comparisons, indexing, searching, or hashing.

Many Unicode bugs arise because these concepts are mixed accidentally.

Bytes

A file ultimately contains bytes.

For example:

hello

in UTF-8 becomes:

68 65 6C 6C 6F

Each byte is storage.

Bytes are useful for:

  • File I/O
  • Network protocols
  • Compression
  • Binary formats

They are usually not the correct unit for text algorithms.

Consider:

é

UTF-8 encoding:

C3 A9

Two bytes.

Treating bytes as characters produces incorrect results.

Code Points

Unicode assigns every character a code point.

Examples:

Character Code Point
A U+0041
a U+0061
é U+00E9
U+65E5
😀 U+1F600

A code point is a logical character identifier.

Many string algorithms operate naturally on code points.

For example:

A
B
C

becomes:

U+0041
U+0042
U+0043

This is usually better than byte-level processing.

However, even code points are not always what users perceive as characters.

Grapheme Clusters

Consider:

e + combining acute accent

represented as:

U+0065
U+0301

A user sees:

é

one visual character.

Unicode calls this a grapheme cluster.

Similarly:

🇺🇸

appears as one symbol but consists of multiple code points.

Many user-facing operations should count grapheme clusters rather than code points.

For example:

String length
Cursor movement
Text selection

often operate at the grapheme level.

Equivalent Representations

Consider:

é

Representation 1:

U+00E9

Representation 2:

U+0065 U+0301

Visually:

é
é

Byte comparison:

Different

Human comparison:

Equal

This discrepancy creates problems for:

  • Search
  • Hashing
  • Sorting
  • Deduplication
  • Databases

The solution is normalization.

Unicode Normalization

Normalization converts equivalent representations into a canonical form.

The most common forms are:

Form Meaning
NFC Canonical composition
NFD Canonical decomposition
NFKC Compatibility composition
NFKD Compatibility decomposition

Most applications use NFC.

Example:

e + combining accent

becomes:

é

After normalization:

byte comparison

works correctly.

Why Normalization Matters

Suppose a database contains:

José

stored as:

J o s é

A search query uses:

J o s e + accent

Without normalization:

No match

With normalization:

Match

The same issue affects:

  • Hash tables
  • Suffix arrays
  • Tries
  • Search indexes

Normalization should occur before building the data structure.

Unicode and String Length

What is the length of:

é

Depending on the representation:

Measurement Length
Bytes 2
Code points 1 or 2
Grapheme clusters 1

Similarly:

🇺🇸

may be:

Measurement Length
Bytes 8
Code points 2
Grapheme clusters 1

Whenever an algorithm refers to:

length

the unit must be defined explicitly.

Unicode-Aware Comparison

A robust comparison pipeline often follows:

Input
  ↓
Normalize
  ↓
Case fold (optional)
  ↓
Compare

Example:

CAFÉ
café

Case-sensitive:

Different

Case-folded:

Equal

The desired behavior depends on the application.

Search engines often ignore case.

Compilers usually do not.

Unicode and Hashing

Hashing must be consistent with equality.

Suppose:

é

and:

e + accent

are considered equal.

Then:

hash(x) = hash(y)

must also hold.

Otherwise:

HashMap lookup

may fail unexpectedly.

Therefore normalization should occur before hashing.

Unicode and Tries

Suppose a trie stores:

café

in NFC.

A query arrives in NFD.

Without normalization:

Path not found

because the code-point sequence differs.

Normalize both insertion and lookup.

Consistency is more important than the chosen normalization form.

Unicode and Suffix Arrays

Suffix arrays assume a stable ordering.

Consider:

éclair

represented in multiple ways.

If normalization changes after construction:

Sorting order changes

The suffix array becomes invalid.

Therefore:

  1. Normalize first.
  2. Build indexes afterward.
  3. Use the same normalization for queries.

This rule applies to:

  • Suffix arrays
  • Tries
  • Search indexes
  • Databases

Unicode and Edit Distance

Edit distance depends on representation.

Compare:

é

with:

e

At the byte level:

Multiple edits

At the grapheme level:

One substitution

The algorithm is unchanged.

The unit of comparison changes.

This distinction often matters in spell checking and natural language processing.

Lean Sketch

A Unicode-aware pipeline might look like:

def normalize
  (s : String)
  : String :=
by
  -- NFC normalization.
  sorry

Then:

def equalText
  (a b : String)
  : Bool :=
  normalize a = normalize b

Most production systems rely on specialized Unicode libraries rather than implementing normalization directly.

Complexity Considerations

Let:

Symbol Meaning
n Number of code points
Operation Complexity
NFC normalization O(n)
Case folding O(n)
Unicode comparison O(n)
Unicode hashing O(n)

Normalization is linear and typically performed once during ingestion.

The cost is usually negligible compared with the correctness benefits.

Common Pitfalls

Do not assume:

one byte = one character

This is false for UTF-8.

Do not assume:

one code point = one visible character

This is also false.

Normalize before hashing, indexing, or storing keys.

Use the same normalization strategy everywhere.

Define clearly whether algorithms operate on:

  • Bytes
  • Code points
  • Grapheme clusters

Mixing these units causes subtle bugs.

Correctness

A Unicode-aware algorithm is correct only when equality, ordering, hashing, and indexing all operate on the same textual representation.

Normalization provides that representation.

Once normalization is applied consistently, equivalent text sequences map to the same canonical form, allowing comparisons and data structures to behave predictably.

Takeaway

Unicode transforms text processing from a simple sequence-of-characters problem into a representation problem. Before applying string algorithms, you must decide what constitutes a character, what constitutes equality, and how equivalent representations are handled. Normalization provides the foundation for these decisions. Without it, otherwise correct algorithms can produce surprising results. With it, classic techniques such as hashing, tries, suffix arrays, edit distance, and pattern matching continue to work reliably across modern multilingual text.