LeetCode 820: Short Encoding of Words

Problem Restatement

We are given an array of words.

A valid encoding uses:

Item	Meaning
`s`	A reference string ending with `#`
`indices`	Starting positions inside `s`

For each word, if we start reading from its index in s and stop before the next #, we must recover that word.

For example:

s = "time#bell#"
indices = [0, 2, 5]

This encodes:

"time" from index 0
"me" from index 2
"bell" from index 5

The word "me" does not need its own "me#" because it is already a suffix of "time".

We need to return the length of the shortest possible reference string. The official statement defines a valid encoding this way and asks for the minimum length of s.

Input and Output

Item	Meaning
Input	An array `words`
Output	Length of the shortest valid reference string
Separator	Each encoded word ends with `#`
Main saving rule	A word that is a suffix of another word does not need separate storage

Examples

Example 1:

words = ["time", "me", "bell"]

The shortest encoding is:

"time#bell#"

Its length is:

So the answer is:

Example 2:

words = ["t"]

The shortest encoding is:

"t#"

The answer is:

First Thought: Encode Every Word

The simplest valid encoding is to write every word followed by #.

For example:

words = ["time", "me", "bell"]

could become:

"time#me#bell#"

This has length 13.

But "me" is already inside "time#" as a suffix ending at the same #.

So writing "me#" separately is unnecessary.

Key Insight

Only suffix relationships matter.

If word a is a suffix of word b, then a can be represented by starting inside b in the encoded string.

For example:

"time#"

can encode both:

"time"
"me"

But if a word appears only as a middle substring, it does not help.

For example:

"im"

appears inside "time", but it does not end at the same #, so it cannot be recovered from "time#".

Therefore, the final encoding only needs words that are not suffixes of any other word.

Algorithm

Use a set to remove duplicates:

unique = set(words)

Then for each word, remove all of its non-empty suffixes from the set.

For a word like:

"time"

the removable suffixes are:

"ime"
"me"
"e"

The full word itself should stay unless it is removed as a suffix of some longer word.

After processing all words, every remaining word must appear explicitly in the encoding.

The answer is:

sum(len(word) + 1 for word in unique)

The + 1 counts the trailing #.

Correctness

If a word is a suffix of another word, the algorithm removes it from the set. Such a word does not need its own encoded segment because it can be recovered by starting inside the longer word’s segment and reading until the same #.

If a word remains in the set, then it was not found as a suffix of any other word. No other encoded word can represent it, because the only way to recover a word from inside another encoded segment is for it to be a suffix of that segment. Therefore, it must be written explicitly.

The algorithm removes exactly the words that can be represented by longer words and keeps exactly the words that must be written. Summing len(word) + 1 over the remaining words gives the minimum reference string length.

Complexity

Let n = len(words) and let L be the maximum word length.

Metric	Value	Why
Time	`O(n * L^2)`	For each word, we create up to `L - 1` suffix strings
Space	`O(n * L)`	The set stores unique words

In this problem, L <= 7, so this approach is easily fast enough.

Implementation

class Solution:
    def minimumLengthEncoding(self, words: list[str]) -> int:
        unique = set(words)

        for word in words:
            for i in range(1, len(word)):
                unique.discard(word[i:])

        answer = 0

        for word in unique:
            answer += len(word) + 1

        return answer

Code Explanation

We start by removing duplicate words:

unique = set(words)

Duplicate words only need to be encoded once.

Then we scan every word and remove its suffixes:

for word in words:
    for i in range(1, len(word)):
        unique.discard(word[i:])

The loop starts at 1 so we do not remove the whole word from itself.

The method discard is useful because it does nothing if the suffix is not in the set.

After suffix removal, the set contains only words that need their own encoded segment.

Each such segment contributes:

len(word) + 1

because of the final #.

Testing

def run_tests():
    s = Solution()

    assert s.minimumLengthEncoding(["time", "me", "bell"]) == 10
    assert s.minimumLengthEncoding(["t"]) == 2
    assert s.minimumLengthEncoding(["time", "time", "me"]) == 5
    assert s.minimumLengthEncoding(["me", "time"]) == 5
    assert s.minimumLengthEncoding(["time", "atime", "btime"]) == 12
    assert s.minimumLengthEncoding(["a", "ba", "cba"]) == 4

    print("all tests passed")

run_tests()

Test	Why
`["time", "me", "bell"]`	Official suffix-sharing example
`["t"]`	Single word
Duplicate words	Encodes duplicates once
Suffix appears before full word	Order should not matter
Several words sharing same suffix	Keeps only non-suffix words
Chain of suffixes	Only longest word needs encoding