16.18 Unicode and Normalization
Many string algorithms assume that a string is simply a sequence of characters.
16.18 Unicode and Normalization
Problem
Many string algorithms assume that a string is simply a sequence of characters.
For English text, this assumption often appears correct.
hello
world
algorithm
A character seems to correspond to:
- One symbol
- One storage unit
- One display position
Unicode breaks this assumption.
Consider:
é
This character may be represented as:
U+00E9
or as:
U+0065 U+0301
which means:
e + combining acute accent
Visually they appear identical.
Byte-by-byte comparison says they are different.
Human readers say they are the same.
The challenge is to define what "equal" means and ensure string algorithms operate consistently.
Solution
Separate three concepts:
- Bytes
- Code points
- Grapheme clusters
Then apply normalization before performing comparisons, indexing, searching, or hashing.
Many Unicode bugs arise because these concepts are mixed accidentally.
Bytes
A file ultimately contains bytes.
For example:
hello
in UTF-8 becomes:
68 65 6C 6C 6F
Each byte is storage.
Bytes are useful for:
- File I/O
- Network protocols
- Compression
- Binary formats
They are usually not the correct unit for text algorithms.
Consider:
é
UTF-8 encoding:
C3 A9
Two bytes.
Treating bytes as characters produces incorrect results.
Code Points
Unicode assigns every character a code point.
Examples:
| Character | Code Point |
|---|---|
| A | U+0041 |
| a | U+0061 |
| é | U+00E9 |
| 日 | U+65E5 |
| 😀 | U+1F600 |
A code point is a logical character identifier.
Many string algorithms operate naturally on code points.
For example:
A
B
C
becomes:
U+0041
U+0042
U+0043
This is usually better than byte-level processing.
However, even code points are not always what users perceive as characters.
Grapheme Clusters
Consider:
e + combining acute accent
represented as:
U+0065
U+0301
A user sees:
é
one visual character.
Unicode calls this a grapheme cluster.
Similarly:
🇺🇸
appears as one symbol but consists of multiple code points.
Many user-facing operations should count grapheme clusters rather than code points.
For example:
String length
Cursor movement
Text selection
often operate at the grapheme level.
Equivalent Representations
Consider:
é
Representation 1:
U+00E9
Representation 2:
U+0065 U+0301
Visually:
é
é
Byte comparison:
Different
Human comparison:
Equal
This discrepancy creates problems for:
- Search
- Hashing
- Sorting
- Deduplication
- Databases
The solution is normalization.
Unicode Normalization
Normalization converts equivalent representations into a canonical form.
The most common forms are:
| Form | Meaning |
|---|---|
| NFC | Canonical composition |
| NFD | Canonical decomposition |
| NFKC | Compatibility composition |
| NFKD | Compatibility decomposition |
Most applications use NFC.
Example:
e + combining accent
becomes:
é
After normalization:
byte comparison
works correctly.
Why Normalization Matters
Suppose a database contains:
José
stored as:
J o s é
A search query uses:
J o s e + accent
Without normalization:
No match
With normalization:
Match
The same issue affects:
- Hash tables
- Suffix arrays
- Tries
- Search indexes
Normalization should occur before building the data structure.
Unicode and String Length
What is the length of:
é
Depending on the representation:
| Measurement | Length |
|---|---|
| Bytes | 2 |
| Code points | 1 or 2 |
| Grapheme clusters | 1 |
Similarly:
🇺🇸
may be:
| Measurement | Length |
|---|---|
| Bytes | 8 |
| Code points | 2 |
| Grapheme clusters | 1 |
Whenever an algorithm refers to:
length
the unit must be defined explicitly.
Unicode-Aware Comparison
A robust comparison pipeline often follows:
Input
↓
Normalize
↓
Case fold (optional)
↓
Compare
Example:
CAFÉ
café
Case-sensitive:
Different
Case-folded:
Equal
The desired behavior depends on the application.
Search engines often ignore case.
Compilers usually do not.
Unicode and Hashing
Hashing must be consistent with equality.
Suppose:
é
and:
e + accent
are considered equal.
Then:
hash(x) = hash(y)
must also hold.
Otherwise:
HashMap lookup
may fail unexpectedly.
Therefore normalization should occur before hashing.
Unicode and Tries
Suppose a trie stores:
café
in NFC.
A query arrives in NFD.
Without normalization:
Path not found
because the code-point sequence differs.
Normalize both insertion and lookup.
Consistency is more important than the chosen normalization form.
Unicode and Suffix Arrays
Suffix arrays assume a stable ordering.
Consider:
éclair
represented in multiple ways.
If normalization changes after construction:
Sorting order changes
The suffix array becomes invalid.
Therefore:
- Normalize first.
- Build indexes afterward.
- Use the same normalization for queries.
This rule applies to:
- Suffix arrays
- Tries
- Search indexes
- Databases
Unicode and Edit Distance
Edit distance depends on representation.
Compare:
é
with:
e
At the byte level:
Multiple edits
At the grapheme level:
One substitution
The algorithm is unchanged.
The unit of comparison changes.
This distinction often matters in spell checking and natural language processing.
Lean Sketch
A Unicode-aware pipeline might look like:
def normalize
(s : String)
: String :=
by
-- NFC normalization.
sorry
Then:
def equalText
(a b : String)
: Bool :=
normalize a = normalize b
Most production systems rely on specialized Unicode libraries rather than implementing normalization directly.
Complexity Considerations
Let:
| Symbol | Meaning |
|---|---|
| n | Number of code points |
| Operation | Complexity |
|---|---|
| NFC normalization | O(n) |
| Case folding | O(n) |
| Unicode comparison | O(n) |
| Unicode hashing | O(n) |
Normalization is linear and typically performed once during ingestion.
The cost is usually negligible compared with the correctness benefits.
Common Pitfalls
Do not assume:
one byte = one character
This is false for UTF-8.
Do not assume:
one code point = one visible character
This is also false.
Normalize before hashing, indexing, or storing keys.
Use the same normalization strategy everywhere.
Define clearly whether algorithms operate on:
- Bytes
- Code points
- Grapheme clusters
Mixing these units causes subtle bugs.
Correctness
A Unicode-aware algorithm is correct only when equality, ordering, hashing, and indexing all operate on the same textual representation.
Normalization provides that representation.
Once normalization is applied consistently, equivalent text sequences map to the same canonical form, allowing comparisons and data structures to behave predictably.
Takeaway
Unicode transforms text processing from a simple sequence-of-characters problem into a representation problem. Before applying string algorithms, you must decide what constitutes a character, what constitutes equality, and how equivalent representations are handled. Normalization provides the foundation for these decisions. Without it, otherwise correct algorithms can produce surprising results. With it, classic techniques such as hashing, tries, suffix arrays, edit distance, and pattern matching continue to work reliably across modern multilingual text.