String sorting method that groups strings by prefixes in a trie and sorts leaf buckets for cache efficient lexicographic order.
Burstsort string sorting is a trie based method for lexicographic string sorting. It stores strings in buckets attached to trie leaves. When a bucket grows too large, the bucket is split by the next character position.
The method is designed for large string collections where comparison sorting repeatedly examines the same prefixes and causes poor cache behavior.
Problem
Given an array of strings, sort the strings in lexicographic order.
For two strings and , lexicographic order compares the first position where they differ. If one string is a prefix of the other, the shorter string comes first.
Idea
Burstsort avoids repeated full string comparisons by grouping strings according to prefixes.
At a trie node at depth :
- strings in the node share the same prefix of length
- the next character decides the child bucket
- leaf buckets store unsorted strings until they exceed a threshold
- large buckets burst into child buckets
After all strings are inserted, an ordered trie traversal emits buckets in lexicographic order.
Algorithm
burstsort_strings(A, threshold):
root = new leaf node
for s in A:
insert(root, s, 0, threshold)
result = empty list
collect(root, result)
return result
insert(node, s, d, threshold):
if node is leaf:
append s to node.bucket
if size(node.bucket) > threshold:
burst(node, d, threshold)
return
c = char_at(s, d)
insert(node.child[c], s, d + 1, threshold)
burst(node, d, threshold):
old = node.bucket
node.bucket = empty
node becomes internal node
for s in old:
c = char_at(s, d)
create child c if missing
append s to child c bucket
collect(node, result):
if node is leaf:
sort node.bucket lexicographically
append node.bucket to result
return
for c in characters in sorted order:
collect(node.child[c], result)Sentinel Character
Variable length strings require a sentinel character that sorts before all regular characters. This handles prefix relationships correctly.
For example:
because after matching c, a, and r, the shorter string reaches the sentinel first.
Example
Sort:
After prefix grouping:
| prefix | bucket |
|---|---|
| a | absolute, acorn |
| b | banana, band, bee |
The a bucket is sorted before the b bucket. Inside the b bucket, further prefix grouping separates ban and be.
Final order:
Correctness
Each trie edge corresponds to one character position. Therefore, an ordered traversal visits prefixes in lexicographic order. All strings in an earlier child are smaller than all strings in a later child.
Leaf buckets contain strings with a common prefix. Sorting each bucket completes the order among strings that were not split further. Since every string is inserted into exactly one leaf and every leaf is visited in lexicographic order, the final output is sorted.
Complexity
Let:
- be the number of strings
- be the average string length
- be the bucket threshold
- be the alphabet size
Typical time:
plus the cost of sorting leaf buckets.
Space:
where is the number of trie nodes.
Properties
| property | value |
|---|---|
| stable | no |
| in place | no |
| comparison based | hybrid |
| cache efficient | yes |
| best for | large string datasets |
When to Use
Use burstsort string sorting when sorting many strings with shared prefixes, especially when ordinary comparison sorts spend too much time revisiting the same characters.
Avoid it for small arrays, short strings, or memory constrained settings where a simple comparison sort is easier and sufficient.
Implementation Sketch
class Node:
def __init__(self):
self.children = None
self.bucket = []
def burstsort_strings(a, threshold=32):
root = Node()
for s in a:
insert(root, s, 0, threshold)
out = []
collect(root, out)
return out
def char_at(s, d):
if d == len(s):
return ""
return s[d]
def insert(node, s, d, threshold):
if node.children is None:
node.bucket.append(s)
if len(node.bucket) > threshold:
burst(node, d)
return
c = char_at(s, d)
if c not in node.children:
node.children[c] = Node()
insert(node.children[c], s, d + 1, threshold)
def burst(node, d):
old = node.bucket
node.bucket = []
node.children = {}
for s in old:
c = char_at(s, d)
if c not in node.children:
node.children[c] = Node()
node.children[c].bucket.append(s)
def collect(node, out):
if node.children is None:
out.extend(sorted(node.bucket))
return
for c in sorted(node.children):
collect(node.children[c], out)