A suffix-removal solution for finding the shortest reference string that can encode every word.
Problem Restatement
We are given an array of words.
A valid encoding uses:
| Item | Meaning |
|---|---|
s | A reference string ending with # |
indices | Starting positions inside s |
For each word, if we start reading from its index in s and stop before the next #, we must recover that word.
For example:
s = "time#bell#"
indices = [0, 2, 5]This encodes:
"time" from index 0
"me" from index 2
"bell" from index 5The word "me" does not need its own "me#" because it is already a suffix of "time".
We need to return the length of the shortest possible reference string. The official statement defines a valid encoding this way and asks for the minimum length of s.
Input and Output
| Item | Meaning |
|---|---|
| Input | An array words |
| Output | Length of the shortest valid reference string |
| Separator | Each encoded word ends with # |
| Main saving rule | A word that is a suffix of another word does not need separate storage |
Examples
Example 1:
words = ["time", "me", "bell"]The shortest encoding is:
"time#bell#"Its length is:
10So the answer is:
10Example 2:
words = ["t"]The shortest encoding is:
"t#"The answer is:
2First Thought: Encode Every Word
The simplest valid encoding is to write every word followed by #.
For example:
words = ["time", "me", "bell"]could become:
"time#me#bell#"This has length 13.
But "me" is already inside "time#" as a suffix ending at the same #.
So writing "me#" separately is unnecessary.
Key Insight
Only suffix relationships matter.
If word a is a suffix of word b, then a can be represented by starting inside b in the encoded string.
For example:
"time#"can encode both:
"time"
"me"But if a word appears only as a middle substring, it does not help.
For example:
"im"appears inside "time", but it does not end at the same #, so it cannot be recovered from "time#".
Therefore, the final encoding only needs words that are not suffixes of any other word.
Algorithm
Use a set to remove duplicates:
unique = set(words)Then for each word, remove all of its non-empty suffixes from the set.
For a word like:
"time"the removable suffixes are:
"ime"
"me"
"e"The full word itself should stay unless it is removed as a suffix of some longer word.
After processing all words, every remaining word must appear explicitly in the encoding.
The answer is:
sum(len(word) + 1 for word in unique)The + 1 counts the trailing #.
Correctness
If a word is a suffix of another word, the algorithm removes it from the set. Such a word does not need its own encoded segment because it can be recovered by starting inside the longer word’s segment and reading until the same #.
If a word remains in the set, then it was not found as a suffix of any other word. No other encoded word can represent it, because the only way to recover a word from inside another encoded segment is for it to be a suffix of that segment. Therefore, it must be written explicitly.
The algorithm removes exactly the words that can be represented by longer words and keeps exactly the words that must be written. Summing len(word) + 1 over the remaining words gives the minimum reference string length.
Complexity
Let n = len(words) and let L be the maximum word length.
| Metric | Value | Why |
|---|---|---|
| Time | O(n * L^2) | For each word, we create up to L - 1 suffix strings |
| Space | O(n * L) | The set stores unique words |
In this problem, L <= 7, so this approach is easily fast enough.
Implementation
class Solution:
def minimumLengthEncoding(self, words: list[str]) -> int:
unique = set(words)
for word in words:
for i in range(1, len(word)):
unique.discard(word[i:])
answer = 0
for word in unique:
answer += len(word) + 1
return answerCode Explanation
We start by removing duplicate words:
unique = set(words)Duplicate words only need to be encoded once.
Then we scan every word and remove its suffixes:
for word in words:
for i in range(1, len(word)):
unique.discard(word[i:])The loop starts at 1 so we do not remove the whole word from itself.
The method discard is useful because it does nothing if the suffix is not in the set.
After suffix removal, the set contains only words that need their own encoded segment.
Each such segment contributes:
len(word) + 1because of the final #.
Testing
def run_tests():
s = Solution()
assert s.minimumLengthEncoding(["time", "me", "bell"]) == 10
assert s.minimumLengthEncoding(["t"]) == 2
assert s.minimumLengthEncoding(["time", "time", "me"]) == 5
assert s.minimumLengthEncoding(["me", "time"]) == 5
assert s.minimumLengthEncoding(["time", "atime", "btime"]) == 12
assert s.minimumLengthEncoding(["a", "ba", "cba"]) == 4
print("all tests passed")
run_tests()| Test | Why |
|---|---|
["time", "me", "bell"] | Official suffix-sharing example |
["t"] | Single word |
| Duplicate words | Encodes duplicates once |
| Suffix appears before full word | Order should not matter |
| Several words sharing same suffix | Keeps only non-suffix words |
| Chain of suffixes | Only longest word needs encoding |