16.15 Lexicographic Order

Many string algorithms depend on a precise ordering of strings.

16.15 Lexicographic Order

Problem

Many string algorithms depend on a precise ordering of strings.

Suffix arrays sort suffixes. Dictionaries sort words. Search systems compare terms. Databases compare keys. Compression algorithms order repeated fragments. In all of these cases, the algorithm must answer a basic question:

Which string comes first?

For example:

apple < banana
app   < apple
cat   < catalog

The challenge is to define this ordering exactly and implement comparisons without hidden ambiguity.

Solution

Use lexicographic order.

Lexicographic order compares strings from left to right. At the first position where the strings differ, the smaller character determines the smaller string.

If one string ends before any mismatch occurs, the shorter string is smaller.

This is the same principle used in dictionary ordering, with one important caveat: actual program behavior depends on the character representation and comparison rules.

Basic Rule

Compare:

algorithm
algebra

Step by step:

a = a
l = l
g > g? no, equal
o > e

The first differing position is:

o
e

Since:

e < o

we have:

algebra < algorithm

The comparison stops at the first mismatch.

Prefix Rule

If one string is a prefix of another, the shorter string comes first.

app < apple

because:

app
app le

The first three characters match, then the first string ends.

This rule is essential for suffix arrays.

For example, in:

banana

the suffix:

a

comes before:

ana

because a is a prefix of ana.

Comparator Contract

A string comparator should satisfy three properties.

Property Meaning
Antisymmetry If a < b, then b < a is false
Transitivity If a < b and b < c, then a < c
Totality Any two strings can be compared

Sorting algorithms assume this contract. If a comparator violates it, sort results can become unstable, incorrect, or implementation-dependent.

A reliable comparator returns one of three outcomes:

less
equal
greater

rather than only a Boolean.

Lean Sketch

A comparison type makes the contract explicit.

inductive Ordering where
  | lt
  | eq
  | gt

A lexicographic comparison can then be described recursively:

def compareChars
  (a b : List Char)
  : Ordering :=
by
  -- Compare heads.
  -- Recurse on tails when equal.
  -- Shorter list wins when one side ends.
  sorry

The shape of the algorithm is simple:

compare([], [])       = equal
compare([], _ :: _)   = less
compare(_ :: _, [])   = greater
compare(a::as, b::bs) =
    if a < b then less
    else if a > b then greater
    else compare(as, bs)

Comparing Suffixes

For suffix arrays, strings are not usually copied. Instead, suffixes are represented by starting positions.

Given:

text = banana

suffix position 1 means:

anana

suffix position 3 means:

ana

A suffix comparator compares:

text[1..]
text[3..]

without allocating new strings.

The result is:

ana < anana

because ana is a prefix of anana.

This distinction matters. A production suffix-array implementation stores indexes, not substrings.

Prefix-Limited Comparison

Pattern search with a suffix array uses a slightly different comparison.

Suppose:

pattern = ana
suffix  = anana

For search, this should count as a match, because the suffix begins with the pattern.

So the comparison is not:

ana < anana

Instead, it asks:

Does suffix[0..pattern.length) equal pattern?

That gives:

equal for search purposes

This prefix-limited comparison is central to correct suffix-array queries.

Example: Sorting Suffixes

Text:

mississippi

Suffixes:

Position Suffix
0 mississippi
1 ississippi
2 ssissippi
3 sissippi
4 issippi
5 ssippi
6 sippi
7 ippi
8 ppi
9 pi
10 i

Sorted order begins:

Rank Position Suffix
0 10 i
1 7 ippi
2 4 issippi
3 1 ississippi
4 0 mississippi

The order is determined by the first differing character, then by the prefix rule.

Case Sensitivity

Lexicographic order depends on the character ordering.

In many encodings:

A < Z < a < z

Therefore:

Apple < apple

under raw code-point ordering.

For user-facing text, this may be undesirable. You may need case folding:

Apple
apple

both become:

apple

before comparison.

Algorithmic string structures usually prefer raw, deterministic ordering. User interfaces often prefer locale-aware ordering.

Do not mix the two casually.

Unicode and Locale

Unicode complicates lexicographic order.

These may look similar:

é
e + combining acute accent

but they can have different internal representations.

Locale also matters. Sorting rules for names and words may differ across languages.

For algorithmic structures such as suffix arrays, choose a normalized representation first:

  1. Decide the unit: bytes, code points, or grapheme clusters.
  2. Normalize if required.
  3. Compare using a deterministic rule.
  4. Use the same rule everywhere.

Consistency is more important than cultural correctness inside low-level indexes.

Lexicographic Order in Algorithms

Lexicographic order appears throughout string processing.

Use Case Role
Suffix arrays Sort suffixes
Tries Order outgoing edges
Search indexes Sort dictionary terms
Compression Group repeated substrings
Range queries Find prefix intervals
Databases Compare text keys

The most important property is contiguity.

All strings with a common prefix occupy one contiguous interval in sorted lexicographic order.

For example:

car
carbon
card
care
cat
dog

All strings beginning with:

car

are adjacent.

This is why binary search can find prefix ranges.

Prefix Ranges

Suppose a sorted dictionary contains:

app
apple
application
apply
banana
band

All words beginning with:

app

form a contiguous interval:

app
apple
application
apply

A prefix query becomes two binary searches:

  1. First word not less than app.
  2. First word greater than all strings beginning with app.

This technique underlies autocomplete, term dictionaries, and suffix-array pattern search.

Complexity Analysis

Let m be the length of the shorter string.

Operation Complexity
Compare two strings O(m)
Sort n strings naively O(n log n * m)
Prefix-range query O(m log n)
Compare suffixes naively O(n) worst case

The cost of comparison is often the hidden factor in string algorithms. Sorting suffixes with naive comparisons can be much more expensive than the sort complexity suggests.

Correctness

Lexicographic comparison is correct because it follows the first position at which two strings can be distinguished.

If such a position exists, the character ordering decides the result.

If no such position exists before one string ends, then the shorter string must precede the longer string because it has no additional characters.

If neither string ends first and no mismatch exists, the strings are equal.

These cases are exhaustive.

Common Pitfalls

Do not use locale-aware comparison inside data structures that require stable binary ordering unless the locale is fixed and part of the data model.

Do not allocate suffix strings unnecessarily. Compare by index.

Do not use full lexicographic comparison when prefix-limited comparison is required.

Do not assume byte length equals character length.

Do not change normalization rules after building an index.

Takeaway

Lexicographic order is the ordering rule that makes many string structures work. It places related strings next to each other, turns prefix search into interval search, and gives suffix arrays their power. The rule itself is simple, but correctness depends on using one representation and one comparison policy consistently from construction through query.