Skip to content

LeetCode 609: Find Duplicate File in System

A hash map guide for grouping file paths by identical file content and returning only duplicate groups.

Problem Restatement

We are given a list of directory information strings.

Each string has this format:

"directory_path file1.txt(content1) file2.txt(content2) ..."

For example:

"root/a 1.txt(abcd) 2.txt(efgh)"

means directory root/a contains:

FileContent
1.txtabcd
2.txtefgh

We need to return groups of duplicate files.

Two files are duplicates if they have exactly the same content, even if they have different names or live in different directories. Each returned group must contain at least two full file paths. The answer may be returned in any order.

Input and Output

Function signature:

def findDuplicate(paths: list[str]) -> list[list[str]]:
    ...

Input:

ParameterMeaning
pathsList of directory information strings

Output:

TypeMeaning
list[list[str]]Groups of file paths whose files have identical content

A full file path is formed as:

directory_path/file_name

Example

Input:

paths = [
    "root/a 1.txt(abcd) 2.txt(efgh)",
    "root/c 3.txt(abcd)",
    "root/c/d 4.txt(efgh)",
    "root 4.txt(efgh)"
]

The files are:

Full PathContent
root/a/1.txtabcd
root/a/2.txtefgh
root/c/3.txtabcd
root/c/d/4.txtefgh
root/4.txtefgh

Group by content:

ContentFile Paths
abcdroot/a/1.txt, root/c/3.txt
efghroot/a/2.txt, root/c/d/4.txt, root/4.txt

Output:

[
    ["root/a/1.txt", "root/c/3.txt"],
    ["root/a/2.txt", "root/c/d/4.txt", "root/4.txt"]
]

The order of groups and the order inside each group do not matter.

First Thought: Compare Every Pair of Files

A direct approach is to parse every file and compare its content with every other file.

If there are m files, this can require many comparisons:

file1 with file2
file1 with file3
file1 with file4
...

This is unnecessary.

We only need to know which files share the same content. A hash map gives that grouping directly.

Key Insight

Use the file content as the hash map key.

For each parsed file:

KeyValue
File contentList of full paths with that content

For example:

{
    "abcd": ["root/a/1.txt", "root/c/3.txt"],
    "efgh": ["root/a/2.txt", "root/c/d/4.txt", "root/4.txt"]
}

After processing all files, every map value with length at least 2 is a duplicate group.

Algorithm

Create a dictionary:

content_to_paths = defaultdict(list)

For each directory string:

  1. Split it by spaces.
  2. The first part is the directory path.
  3. Every later part is one file entry, such as 1.txt(abcd).
  4. Find the index of "(".
  5. The substring before "(" is the file name.
  6. The substring inside parentheses is the content.
  7. Build the full path with directory + "/" + file_name.
  8. Append the full path to content_to_paths[content].

Finally, return all groups whose length is greater than 1.

Implementation

from collections import defaultdict
from typing import List

class Solution:
    def findDuplicate(self, paths: List[str]) -> List[List[str]]:
        content_to_paths = defaultdict(list)

        for path_info in paths:
            parts = path_info.split()
            directory = parts[0]

            for file_info in parts[1:]:
                open_paren = file_info.find("(")

                file_name = file_info[:open_paren]
                content = file_info[open_paren + 1:-1]

                full_path = directory + "/" + file_name
                content_to_paths[content].append(full_path)

        result = []

        for group in content_to_paths.values():
            if len(group) > 1:
                result.append(group)

        return result

Code Explanation

We use a dictionary from content to file paths:

content_to_paths = defaultdict(list)

Each directory string is split by spaces:

parts = path_info.split()
directory = parts[0]

For this input:

"root/a 1.txt(abcd) 2.txt(efgh)"

the split result is:

IndexValue
0root/a
11.txt(abcd)
22.txt(efgh)

Each file entry contains a file name followed by content in parentheses.

We find the opening parenthesis:

open_paren = file_info.find("(")

Then parse the file name:

file_name = file_info[:open_paren]

and parse the content:

content = file_info[open_paren + 1:-1]

The -1 removes the closing parenthesis.

Then we build the full path:

full_path = directory + "/" + file_name

Finally, we group it by content:

content_to_paths[content].append(full_path)

After all files are processed, only groups with at least two paths are duplicates:

for group in content_to_paths.values():
    if len(group) > 1:
        result.append(group)

Correctness

Each file entry is parsed into exactly one full file path and one content string.

The algorithm inserts that full path into the dictionary entry for its content. Therefore, after all input is processed, every dictionary group contains exactly the files with the same content.

A duplicate file group is defined as at least two files with identical content. The final loop returns exactly the dictionary values whose size is greater than one.

Therefore, every returned group is a valid duplicate group, and every valid duplicate group is returned.

Complexity

Let L be the total length of all strings in paths.

MetricValueWhy
TimeO(L)Each input character is parsed a constant number of times
SpaceO(L)The dictionary stores contents and full paths

The number of files is bounded by the total input size, so grouping all file paths also fits within O(L) space.

Follow-up: Real File System

In a real file system, we should not immediately read every whole file into memory.

A practical duplicate-file search usually works in stages:

StagePurpose
Group by file sizeFiles with different sizes cannot be identical
Hash candidate filesCompute a content hash only for files with the same size
Verify byte-by-byteAvoid false positives from hash collisions

If files are very large, read them in chunks, for example 1 KB or 4 KB at a time.

The most expensive part is usually reading file contents from disk. Grouping by file size first avoids unnecessary reads.

Testing

def normalize(groups):
    return sorted(sorted(group) for group in groups)

def run_tests():
    s = Solution()

    paths1 = [
        "root/a 1.txt(abcd) 2.txt(efgh)",
        "root/c 3.txt(abcd)",
        "root/c/d 4.txt(efgh)",
        "root 4.txt(efgh)"
    ]

    expected1 = [
        ["root/a/1.txt", "root/c/3.txt"],
        ["root/a/2.txt", "root/c/d/4.txt", "root/4.txt"]
    ]

    assert normalize(s.findDuplicate(paths1)) == normalize(expected1)

    paths2 = [
        "root/a 1.txt(abcd) 2.txt(efgh)",
        "root/c 3.txt(abcd)",
        "root/c/d 4.txt(efgh)"
    ]

    expected2 = [
        ["root/a/1.txt", "root/c/3.txt"],
        ["root/a/2.txt", "root/c/d/4.txt"]
    ]

    assert normalize(s.findDuplicate(paths2)) == normalize(expected2)

    paths3 = [
        "root/a a.txt(x)",
        "root/b b.txt(y)"
    ]

    assert s.findDuplicate(paths3) == []

    print("all tests passed")

run_tests()

Test coverage:

CaseWhy
Multiple duplicate groupsConfirms grouping by content
Duplicate group of size twoConfirms minimum valid group
No duplicatesConfirms singleton groups are removed