Skip to content

LeetCode 820: Short Encoding of Words

A suffix-removal solution for finding the shortest reference string that can encode every word.

Problem Restatement

We are given an array of words.

A valid encoding uses:

ItemMeaning
sA reference string ending with #
indicesStarting positions inside s

For each word, if we start reading from its index in s and stop before the next #, we must recover that word.

For example:

s = "time#bell#"
indices = [0, 2, 5]

This encodes:

"time" from index 0
"me" from index 2
"bell" from index 5

The word "me" does not need its own "me#" because it is already a suffix of "time".

We need to return the length of the shortest possible reference string. The official statement defines a valid encoding this way and asks for the minimum length of s.

Input and Output

ItemMeaning
InputAn array words
OutputLength of the shortest valid reference string
SeparatorEach encoded word ends with #
Main saving ruleA word that is a suffix of another word does not need separate storage

Examples

Example 1:

words = ["time", "me", "bell"]

The shortest encoding is:

"time#bell#"

Its length is:

10

So the answer is:

10

Example 2:

words = ["t"]

The shortest encoding is:

"t#"

The answer is:

2

First Thought: Encode Every Word

The simplest valid encoding is to write every word followed by #.

For example:

words = ["time", "me", "bell"]

could become:

"time#me#bell#"

This has length 13.

But "me" is already inside "time#" as a suffix ending at the same #.

So writing "me#" separately is unnecessary.

Key Insight

Only suffix relationships matter.

If word a is a suffix of word b, then a can be represented by starting inside b in the encoded string.

For example:

"time#"

can encode both:

"time"
"me"

But if a word appears only as a middle substring, it does not help.

For example:

"im"

appears inside "time", but it does not end at the same #, so it cannot be recovered from "time#".

Therefore, the final encoding only needs words that are not suffixes of any other word.

Algorithm

Use a set to remove duplicates:

unique = set(words)

Then for each word, remove all of its non-empty suffixes from the set.

For a word like:

"time"

the removable suffixes are:

"ime"
"me"
"e"

The full word itself should stay unless it is removed as a suffix of some longer word.

After processing all words, every remaining word must appear explicitly in the encoding.

The answer is:

sum(len(word) + 1 for word in unique)

The + 1 counts the trailing #.

Correctness

If a word is a suffix of another word, the algorithm removes it from the set. Such a word does not need its own encoded segment because it can be recovered by starting inside the longer word’s segment and reading until the same #.

If a word remains in the set, then it was not found as a suffix of any other word. No other encoded word can represent it, because the only way to recover a word from inside another encoded segment is for it to be a suffix of that segment. Therefore, it must be written explicitly.

The algorithm removes exactly the words that can be represented by longer words and keeps exactly the words that must be written. Summing len(word) + 1 over the remaining words gives the minimum reference string length.

Complexity

Let n = len(words) and let L be the maximum word length.

MetricValueWhy
TimeO(n * L^2)For each word, we create up to L - 1 suffix strings
SpaceO(n * L)The set stores unique words

In this problem, L <= 7, so this approach is easily fast enough.

Implementation

class Solution:
    def minimumLengthEncoding(self, words: list[str]) -> int:
        unique = set(words)

        for word in words:
            for i in range(1, len(word)):
                unique.discard(word[i:])

        answer = 0

        for word in unique:
            answer += len(word) + 1

        return answer

Code Explanation

We start by removing duplicate words:

unique = set(words)

Duplicate words only need to be encoded once.

Then we scan every word and remove its suffixes:

for word in words:
    for i in range(1, len(word)):
        unique.discard(word[i:])

The loop starts at 1 so we do not remove the whole word from itself.

The method discard is useful because it does nothing if the suffix is not in the set.

After suffix removal, the set contains only words that need their own encoded segment.

Each such segment contributes:

len(word) + 1

because of the final #.

Testing

def run_tests():
    s = Solution()

    assert s.minimumLengthEncoding(["time", "me", "bell"]) == 10
    assert s.minimumLengthEncoding(["t"]) == 2
    assert s.minimumLengthEncoding(["time", "time", "me"]) == 5
    assert s.minimumLengthEncoding(["me", "time"]) == 5
    assert s.minimumLengthEncoding(["time", "atime", "btime"]) == 12
    assert s.minimumLengthEncoding(["a", "ba", "cba"]) == 4

    print("all tests passed")

run_tests()
TestWhy
["time", "me", "bell"]Official suffix-sharing example
["t"]Single word
Duplicate wordsEncodes duplicates once
Suffix appears before full wordOrder should not matter
Several words sharing same suffixKeeps only non-suffix words
Chain of suffixesOnly longest word needs encoding