LeetCode 609: Find Duplicate File in System

Problem Restatement

We are given a list of directory information strings.

Each string has this format:

"directory_path file1.txt(content1) file2.txt(content2) ..."

For example:

"root/a 1.txt(abcd) 2.txt(efgh)"

means directory root/a contains:

File	Content
`1.txt`	`abcd`
`2.txt`	`efgh`

We need to return groups of duplicate files.

Two files are duplicates if they have exactly the same content, even if they have different names or live in different directories. Each returned group must contain at least two full file paths. The answer may be returned in any order.

Input and Output

Function signature:

def findDuplicate(paths: list[str]) -> list[list[str]]:
    ...

Input:

Parameter	Meaning
`paths`	List of directory information strings

Output:

Type	Meaning
`list[list[str]]`	Groups of file paths whose files have identical content

A full file path is formed as:

directory_path/file_name

Example

Input:

paths = [
    "root/a 1.txt(abcd) 2.txt(efgh)",
    "root/c 3.txt(abcd)",
    "root/c/d 4.txt(efgh)",
    "root 4.txt(efgh)"
]

The files are:

Full Path	Content
`root/a/1.txt`	`abcd`
`root/a/2.txt`	`efgh`
`root/c/3.txt`	`abcd`
`root/c/d/4.txt`	`efgh`
`root/4.txt`	`efgh`

Group by content:

Content	File Paths
`abcd`	`root/a/1.txt`, `root/c/3.txt`
`efgh`	`root/a/2.txt`, `root/c/d/4.txt`, `root/4.txt`

Output:

[
    ["root/a/1.txt", "root/c/3.txt"],
    ["root/a/2.txt", "root/c/d/4.txt", "root/4.txt"]
]

The order of groups and the order inside each group do not matter.

First Thought: Compare Every Pair of Files

A direct approach is to parse every file and compare its content with every other file.

If there are m files, this can require many comparisons:

file1 with file2
file1 with file3
file1 with file4
...

This is unnecessary.

We only need to know which files share the same content. A hash map gives that grouping directly.

Key Insight

Use the file content as the hash map key.

For each parsed file:

Key	Value
File content	List of full paths with that content

For example:

{
    "abcd": ["root/a/1.txt", "root/c/3.txt"],
    "efgh": ["root/a/2.txt", "root/c/d/4.txt", "root/4.txt"]
}

After processing all files, every map value with length at least 2 is a duplicate group.

Algorithm

Create a dictionary:

content_to_paths = defaultdict(list)

For each directory string:

Split it by spaces.
The first part is the directory path.
Every later part is one file entry, such as 1.txt(abcd).
Find the index of "(".
The substring before "(" is the file name.
The substring inside parentheses is the content.
Build the full path with directory + "/" + file_name.
Append the full path to content_to_paths[content].

Finally, return all groups whose length is greater than 1.

Implementation

from collections import defaultdict
from typing import List

class Solution:
    def findDuplicate(self, paths: List[str]) -> List[List[str]]:
        content_to_paths = defaultdict(list)

        for path_info in paths:
            parts = path_info.split()
            directory = parts[0]

            for file_info in parts[1:]:
                open_paren = file_info.find("(")

                file_name = file_info[:open_paren]
                content = file_info[open_paren + 1:-1]

                full_path = directory + "/" + file_name
                content_to_paths[content].append(full_path)

        result = []

        for group in content_to_paths.values():
            if len(group) > 1:
                result.append(group)

        return result

Code Explanation

We use a dictionary from content to file paths:

content_to_paths = defaultdict(list)

Each directory string is split by spaces:

parts = path_info.split()
directory = parts[0]

For this input:

"root/a 1.txt(abcd) 2.txt(efgh)"

the split result is:

Index	Value
`0`	`root/a`
`1`	`1.txt(abcd)`
`2`	`2.txt(efgh)`

Each file entry contains a file name followed by content in parentheses.

We find the opening parenthesis:

open_paren = file_info.find("(")

Then parse the file name:

file_name = file_info[:open_paren]

and parse the content:

content = file_info[open_paren + 1:-1]

The -1 removes the closing parenthesis.

Then we build the full path:

full_path = directory + "/" + file_name

Finally, we group it by content:

content_to_paths[content].append(full_path)

After all files are processed, only groups with at least two paths are duplicates:

for group in content_to_paths.values():
    if len(group) > 1:
        result.append(group)

Correctness

Each file entry is parsed into exactly one full file path and one content string.

The algorithm inserts that full path into the dictionary entry for its content. Therefore, after all input is processed, every dictionary group contains exactly the files with the same content.

A duplicate file group is defined as at least two files with identical content. The final loop returns exactly the dictionary values whose size is greater than one.

Therefore, every returned group is a valid duplicate group, and every valid duplicate group is returned.

Complexity

Let L be the total length of all strings in paths.

Metric	Value	Why
Time	`O(L)`	Each input character is parsed a constant number of times
Space	`O(L)`	The dictionary stores contents and full paths

The number of files is bounded by the total input size, so grouping all file paths also fits within O(L) space.

Follow-up: Real File System

In a real file system, we should not immediately read every whole file into memory.

A practical duplicate-file search usually works in stages:

Stage	Purpose
Group by file size	Files with different sizes cannot be identical
Hash candidate files	Compute a content hash only for files with the same size
Verify byte-by-byte	Avoid false positives from hash collisions

If files are very large, read them in chunks, for example 1 KB or 4 KB at a time.

The most expensive part is usually reading file contents from disk. Grouping by file size first avoids unnecessary reads.

Testing

def normalize(groups):
    return sorted(sorted(group) for group in groups)

def run_tests():
    s = Solution()

    paths1 = [
        "root/a 1.txt(abcd) 2.txt(efgh)",
        "root/c 3.txt(abcd)",
        "root/c/d 4.txt(efgh)",
        "root 4.txt(efgh)"
    ]

    expected1 = [
        ["root/a/1.txt", "root/c/3.txt"],
        ["root/a/2.txt", "root/c/d/4.txt", "root/4.txt"]
    ]

    assert normalize(s.findDuplicate(paths1)) == normalize(expected1)

    paths2 = [
        "root/a 1.txt(abcd) 2.txt(efgh)",
        "root/c 3.txt(abcd)",
        "root/c/d 4.txt(efgh)"
    ]

    expected2 = [
        ["root/a/1.txt", "root/c/3.txt"],
        ["root/a/2.txt", "root/c/d/4.txt"]
    ]

    assert normalize(s.findDuplicate(paths2)) == normalize(expected2)

    paths3 = [
        "root/a a.txt(x)",
        "root/b b.txt(y)"
    ]

    assert s.findDuplicate(paths3) == []

    print("all tests passed")

run_tests()

Test coverage:

Case	Why
Multiple duplicate groups	Confirms grouping by content
Duplicate group of size two	Confirms minimum valid group
No duplicates	Confirms singleton groups are removed