16.15 Lexicographic Order
Many string algorithms depend on a precise ordering of strings.
16.15 Lexicographic Order
Problem
Many string algorithms depend on a precise ordering of strings.
Suffix arrays sort suffixes. Dictionaries sort words. Search systems compare terms. Databases compare keys. Compression algorithms order repeated fragments. In all of these cases, the algorithm must answer a basic question:
Which string comes first?
For example:
apple < banana
app < apple
cat < catalog
The challenge is to define this ordering exactly and implement comparisons without hidden ambiguity.
Solution
Use lexicographic order.
Lexicographic order compares strings from left to right. At the first position where the strings differ, the smaller character determines the smaller string.
If one string ends before any mismatch occurs, the shorter string is smaller.
This is the same principle used in dictionary ordering, with one important caveat: actual program behavior depends on the character representation and comparison rules.
Basic Rule
Compare:
algorithm
algebra
Step by step:
a = a
l = l
g > g? no, equal
o > e
The first differing position is:
o
e
Since:
e < o
we have:
algebra < algorithm
The comparison stops at the first mismatch.
Prefix Rule
If one string is a prefix of another, the shorter string comes first.
app < apple
because:
app
app le
The first three characters match, then the first string ends.
This rule is essential for suffix arrays.
For example, in:
banana
the suffix:
a
comes before:
ana
because a is a prefix of ana.
Comparator Contract
A string comparator should satisfy three properties.
| Property | Meaning |
|---|---|
| Antisymmetry | If a < b, then b < a is false |
| Transitivity | If a < b and b < c, then a < c |
| Totality | Any two strings can be compared |
Sorting algorithms assume this contract. If a comparator violates it, sort results can become unstable, incorrect, or implementation-dependent.
A reliable comparator returns one of three outcomes:
less
equal
greater
rather than only a Boolean.
Lean Sketch
A comparison type makes the contract explicit.
inductive Ordering where
| lt
| eq
| gt
A lexicographic comparison can then be described recursively:
def compareChars
(a b : List Char)
: Ordering :=
by
-- Compare heads.
-- Recurse on tails when equal.
-- Shorter list wins when one side ends.
sorry
The shape of the algorithm is simple:
compare([], []) = equal
compare([], _ :: _) = less
compare(_ :: _, []) = greater
compare(a::as, b::bs) =
if a < b then less
else if a > b then greater
else compare(as, bs)
Comparing Suffixes
For suffix arrays, strings are not usually copied. Instead, suffixes are represented by starting positions.
Given:
text = banana
suffix position 1 means:
anana
suffix position 3 means:
ana
A suffix comparator compares:
text[1..]
text[3..]
without allocating new strings.
The result is:
ana < anana
because ana is a prefix of anana.
This distinction matters. A production suffix-array implementation stores indexes, not substrings.
Prefix-Limited Comparison
Pattern search with a suffix array uses a slightly different comparison.
Suppose:
pattern = ana
suffix = anana
For search, this should count as a match, because the suffix begins with the pattern.
So the comparison is not:
ana < anana
Instead, it asks:
Does suffix[0..pattern.length) equal pattern?
That gives:
equal for search purposes
This prefix-limited comparison is central to correct suffix-array queries.
Example: Sorting Suffixes
Text:
mississippi
Suffixes:
| Position | Suffix |
|---|---|
| 0 | mississippi |
| 1 | ississippi |
| 2 | ssissippi |
| 3 | sissippi |
| 4 | issippi |
| 5 | ssippi |
| 6 | sippi |
| 7 | ippi |
| 8 | ppi |
| 9 | pi |
| 10 | i |
Sorted order begins:
| Rank | Position | Suffix |
|---|---|---|
| 0 | 10 | i |
| 1 | 7 | ippi |
| 2 | 4 | issippi |
| 3 | 1 | ississippi |
| 4 | 0 | mississippi |
The order is determined by the first differing character, then by the prefix rule.
Case Sensitivity
Lexicographic order depends on the character ordering.
In many encodings:
A < Z < a < z
Therefore:
Apple < apple
under raw code-point ordering.
For user-facing text, this may be undesirable. You may need case folding:
Apple
apple
both become:
apple
before comparison.
Algorithmic string structures usually prefer raw, deterministic ordering. User interfaces often prefer locale-aware ordering.
Do not mix the two casually.
Unicode and Locale
Unicode complicates lexicographic order.
These may look similar:
é
e + combining acute accent
but they can have different internal representations.
Locale also matters. Sorting rules for names and words may differ across languages.
For algorithmic structures such as suffix arrays, choose a normalized representation first:
- Decide the unit: bytes, code points, or grapheme clusters.
- Normalize if required.
- Compare using a deterministic rule.
- Use the same rule everywhere.
Consistency is more important than cultural correctness inside low-level indexes.
Lexicographic Order in Algorithms
Lexicographic order appears throughout string processing.
| Use Case | Role |
|---|---|
| Suffix arrays | Sort suffixes |
| Tries | Order outgoing edges |
| Search indexes | Sort dictionary terms |
| Compression | Group repeated substrings |
| Range queries | Find prefix intervals |
| Databases | Compare text keys |
The most important property is contiguity.
All strings with a common prefix occupy one contiguous interval in sorted lexicographic order.
For example:
car
carbon
card
care
cat
dog
All strings beginning with:
car
are adjacent.
This is why binary search can find prefix ranges.
Prefix Ranges
Suppose a sorted dictionary contains:
app
apple
application
apply
banana
band
All words beginning with:
app
form a contiguous interval:
app
apple
application
apply
A prefix query becomes two binary searches:
- First word not less than
app. - First word greater than all strings beginning with
app.
This technique underlies autocomplete, term dictionaries, and suffix-array pattern search.
Complexity Analysis
Let m be the length of the shorter string.
| Operation | Complexity |
|---|---|
| Compare two strings | O(m) |
Sort n strings naively |
O(n log n * m) |
| Prefix-range query | O(m log n) |
| Compare suffixes naively | O(n) worst case |
The cost of comparison is often the hidden factor in string algorithms. Sorting suffixes with naive comparisons can be much more expensive than the sort complexity suggests.
Correctness
Lexicographic comparison is correct because it follows the first position at which two strings can be distinguished.
If such a position exists, the character ordering decides the result.
If no such position exists before one string ends, then the shorter string must precede the longer string because it has no additional characters.
If neither string ends first and no mismatch exists, the strings are equal.
These cases are exhaustive.
Common Pitfalls
Do not use locale-aware comparison inside data structures that require stable binary ordering unless the locale is fixed and part of the data model.
Do not allocate suffix strings unnecessarily. Compare by index.
Do not use full lexicographic comparison when prefix-limited comparison is required.
Do not assume byte length equals character length.
Do not change normalization rules after building an index.
Takeaway
Lexicographic order is the ordering rule that makes many string structures work. It places related strings next to each other, turns prefix search into interval search, and gives suffix arrays their power. The rule itself is simple, but correctness depends on using one representation and one comparison policy consistently from construction through query.