A hash map guide for grouping file paths by identical file content and returning only duplicate groups.
Problem Restatement
We are given a list of directory information strings.
Each string has this format:
"directory_path file1.txt(content1) file2.txt(content2) ..."For example:
"root/a 1.txt(abcd) 2.txt(efgh)"means directory root/a contains:
| File | Content |
|---|---|
1.txt | abcd |
2.txt | efgh |
We need to return groups of duplicate files.
Two files are duplicates if they have exactly the same content, even if they have different names or live in different directories. Each returned group must contain at least two full file paths. The answer may be returned in any order.
Input and Output
Function signature:
def findDuplicate(paths: list[str]) -> list[list[str]]:
...Input:
| Parameter | Meaning |
|---|---|
paths | List of directory information strings |
Output:
| Type | Meaning |
|---|---|
list[list[str]] | Groups of file paths whose files have identical content |
A full file path is formed as:
directory_path/file_nameExample
Input:
paths = [
"root/a 1.txt(abcd) 2.txt(efgh)",
"root/c 3.txt(abcd)",
"root/c/d 4.txt(efgh)",
"root 4.txt(efgh)"
]The files are:
| Full Path | Content |
|---|---|
root/a/1.txt | abcd |
root/a/2.txt | efgh |
root/c/3.txt | abcd |
root/c/d/4.txt | efgh |
root/4.txt | efgh |
Group by content:
| Content | File Paths |
|---|---|
abcd | root/a/1.txt, root/c/3.txt |
efgh | root/a/2.txt, root/c/d/4.txt, root/4.txt |
Output:
[
["root/a/1.txt", "root/c/3.txt"],
["root/a/2.txt", "root/c/d/4.txt", "root/4.txt"]
]The order of groups and the order inside each group do not matter.
First Thought: Compare Every Pair of Files
A direct approach is to parse every file and compare its content with every other file.
If there are m files, this can require many comparisons:
file1 with file2
file1 with file3
file1 with file4
...This is unnecessary.
We only need to know which files share the same content. A hash map gives that grouping directly.
Key Insight
Use the file content as the hash map key.
For each parsed file:
| Key | Value |
|---|---|
| File content | List of full paths with that content |
For example:
{
"abcd": ["root/a/1.txt", "root/c/3.txt"],
"efgh": ["root/a/2.txt", "root/c/d/4.txt", "root/4.txt"]
}After processing all files, every map value with length at least 2 is a duplicate group.
Algorithm
Create a dictionary:
content_to_paths = defaultdict(list)For each directory string:
- Split it by spaces.
- The first part is the directory path.
- Every later part is one file entry, such as
1.txt(abcd). - Find the index of
"(". - The substring before
"("is the file name. - The substring inside parentheses is the content.
- Build the full path with
directory + "/" + file_name. - Append the full path to
content_to_paths[content].
Finally, return all groups whose length is greater than 1.
Implementation
from collections import defaultdict
from typing import List
class Solution:
def findDuplicate(self, paths: List[str]) -> List[List[str]]:
content_to_paths = defaultdict(list)
for path_info in paths:
parts = path_info.split()
directory = parts[0]
for file_info in parts[1:]:
open_paren = file_info.find("(")
file_name = file_info[:open_paren]
content = file_info[open_paren + 1:-1]
full_path = directory + "/" + file_name
content_to_paths[content].append(full_path)
result = []
for group in content_to_paths.values():
if len(group) > 1:
result.append(group)
return resultCode Explanation
We use a dictionary from content to file paths:
content_to_paths = defaultdict(list)Each directory string is split by spaces:
parts = path_info.split()
directory = parts[0]For this input:
"root/a 1.txt(abcd) 2.txt(efgh)"the split result is:
| Index | Value |
|---|---|
0 | root/a |
1 | 1.txt(abcd) |
2 | 2.txt(efgh) |
Each file entry contains a file name followed by content in parentheses.
We find the opening parenthesis:
open_paren = file_info.find("(")Then parse the file name:
file_name = file_info[:open_paren]and parse the content:
content = file_info[open_paren + 1:-1]The -1 removes the closing parenthesis.
Then we build the full path:
full_path = directory + "/" + file_nameFinally, we group it by content:
content_to_paths[content].append(full_path)After all files are processed, only groups with at least two paths are duplicates:
for group in content_to_paths.values():
if len(group) > 1:
result.append(group)Correctness
Each file entry is parsed into exactly one full file path and one content string.
The algorithm inserts that full path into the dictionary entry for its content. Therefore, after all input is processed, every dictionary group contains exactly the files with the same content.
A duplicate file group is defined as at least two files with identical content. The final loop returns exactly the dictionary values whose size is greater than one.
Therefore, every returned group is a valid duplicate group, and every valid duplicate group is returned.
Complexity
Let L be the total length of all strings in paths.
| Metric | Value | Why |
|---|---|---|
| Time | O(L) | Each input character is parsed a constant number of times |
| Space | O(L) | The dictionary stores contents and full paths |
The number of files is bounded by the total input size, so grouping all file paths also fits within O(L) space.
Follow-up: Real File System
In a real file system, we should not immediately read every whole file into memory.
A practical duplicate-file search usually works in stages:
| Stage | Purpose |
|---|---|
| Group by file size | Files with different sizes cannot be identical |
| Hash candidate files | Compute a content hash only for files with the same size |
| Verify byte-by-byte | Avoid false positives from hash collisions |
If files are very large, read them in chunks, for example 1 KB or 4 KB at a time.
The most expensive part is usually reading file contents from disk. Grouping by file size first avoids unnecessary reads.
Testing
def normalize(groups):
return sorted(sorted(group) for group in groups)
def run_tests():
s = Solution()
paths1 = [
"root/a 1.txt(abcd) 2.txt(efgh)",
"root/c 3.txt(abcd)",
"root/c/d 4.txt(efgh)",
"root 4.txt(efgh)"
]
expected1 = [
["root/a/1.txt", "root/c/3.txt"],
["root/a/2.txt", "root/c/d/4.txt", "root/4.txt"]
]
assert normalize(s.findDuplicate(paths1)) == normalize(expected1)
paths2 = [
"root/a 1.txt(abcd) 2.txt(efgh)",
"root/c 3.txt(abcd)",
"root/c/d 4.txt(efgh)"
]
expected2 = [
["root/a/1.txt", "root/c/3.txt"],
["root/a/2.txt", "root/c/d/4.txt"]
]
assert normalize(s.findDuplicate(paths2)) == normalize(expected2)
paths3 = [
"root/a a.txt(x)",
"root/b b.txt(y)"
]
assert s.findDuplicate(paths3) == []
print("all tests passed")
run_tests()Test coverage:
| Case | Why |
|---|---|
| Multiple duplicate groups | Confirms grouping by content |
| Duplicate group of size two | Confirms minimum valid group |
| No duplicates | Confirms singleton groups are removed |