# LeetCode 609: Find Duplicate File in System

## Problem Restatement

We are given a list of directory information strings.

Each string has this format:

```text
"directory_path file1.txt(content1) file2.txt(content2) ..."
```

For example:

```text
"root/a 1.txt(abcd) 2.txt(efgh)"
```

means directory `root/a` contains:

| File | Content |
|---|---|
| `1.txt` | `abcd` |
| `2.txt` | `efgh` |

We need to return groups of duplicate files.

Two files are duplicates if they have exactly the same content, even if they have different names or live in different directories. Each returned group must contain at least two full file paths. The answer may be returned in any order.

## Input and Output

Function signature:

```python
def findDuplicate(paths: list[str]) -> list[list[str]]:
    ...
```

Input:

| Parameter | Meaning |
|---|---|
| `paths` | List of directory information strings |

Output:

| Type | Meaning |
|---|---|
| `list[list[str]]` | Groups of file paths whose files have identical content |

A full file path is formed as:

```text
directory_path/file_name
```

## Example

Input:

```python
paths = [
    "root/a 1.txt(abcd) 2.txt(efgh)",
    "root/c 3.txt(abcd)",
    "root/c/d 4.txt(efgh)",
    "root 4.txt(efgh)"
]
```

The files are:

| Full Path | Content |
|---|---|
| `root/a/1.txt` | `abcd` |
| `root/a/2.txt` | `efgh` |
| `root/c/3.txt` | `abcd` |
| `root/c/d/4.txt` | `efgh` |
| `root/4.txt` | `efgh` |

Group by content:

| Content | File Paths |
|---|---|
| `abcd` | `root/a/1.txt`, `root/c/3.txt` |
| `efgh` | `root/a/2.txt`, `root/c/d/4.txt`, `root/4.txt` |

Output:

```python
[
    ["root/a/1.txt", "root/c/3.txt"],
    ["root/a/2.txt", "root/c/d/4.txt", "root/4.txt"]
]
```

The order of groups and the order inside each group do not matter.

## First Thought: Compare Every Pair of Files

A direct approach is to parse every file and compare its content with every other file.

If there are `m` files, this can require many comparisons:

```text
file1 with file2
file1 with file3
file1 with file4
...
```

This is unnecessary.

We only need to know which files share the same content. A hash map gives that grouping directly.

## Key Insight

Use the file content as the hash map key.

For each parsed file:

| Key | Value |
|---|---|
| File content | List of full paths with that content |

For example:

```python
{
    "abcd": ["root/a/1.txt", "root/c/3.txt"],
    "efgh": ["root/a/2.txt", "root/c/d/4.txt", "root/4.txt"]
}
```

After processing all files, every map value with length at least `2` is a duplicate group.

## Algorithm

Create a dictionary:

```python
content_to_paths = defaultdict(list)
```

For each directory string:

1. Split it by spaces.
2. The first part is the directory path.
3. Every later part is one file entry, such as `1.txt(abcd)`.
4. Find the index of `"("`.
5. The substring before `"("` is the file name.
6. The substring inside parentheses is the content.
7. Build the full path with `directory + "/" + file_name`.
8. Append the full path to `content_to_paths[content]`.

Finally, return all groups whose length is greater than `1`.

## Implementation

```python
from collections import defaultdict
from typing import List

class Solution:
    def findDuplicate(self, paths: List[str]) -> List[List[str]]:
        content_to_paths = defaultdict(list)

        for path_info in paths:
            parts = path_info.split()
            directory = parts[0]

            for file_info in parts[1:]:
                open_paren = file_info.find("(")

                file_name = file_info[:open_paren]
                content = file_info[open_paren + 1:-1]

                full_path = directory + "/" + file_name
                content_to_paths[content].append(full_path)

        result = []

        for group in content_to_paths.values():
            if len(group) > 1:
                result.append(group)

        return result
```

## Code Explanation

We use a dictionary from content to file paths:

```python
content_to_paths = defaultdict(list)
```

Each directory string is split by spaces:

```python
parts = path_info.split()
directory = parts[0]
```

For this input:

```text
"root/a 1.txt(abcd) 2.txt(efgh)"
```

the split result is:

| Index | Value |
|---:|---|
| `0` | `root/a` |
| `1` | `1.txt(abcd)` |
| `2` | `2.txt(efgh)` |

Each file entry contains a file name followed by content in parentheses.

We find the opening parenthesis:

```python
open_paren = file_info.find("(")
```

Then parse the file name:

```python
file_name = file_info[:open_paren]
```

and parse the content:

```python
content = file_info[open_paren + 1:-1]
```

The `-1` removes the closing parenthesis.

Then we build the full path:

```python
full_path = directory + "/" + file_name
```

Finally, we group it by content:

```python
content_to_paths[content].append(full_path)
```

After all files are processed, only groups with at least two paths are duplicates:

```python
for group in content_to_paths.values():
    if len(group) > 1:
        result.append(group)
```

## Correctness

Each file entry is parsed into exactly one full file path and one content string.

The algorithm inserts that full path into the dictionary entry for its content. Therefore, after all input is processed, every dictionary group contains exactly the files with the same content.

A duplicate file group is defined as at least two files with identical content. The final loop returns exactly the dictionary values whose size is greater than one.

Therefore, every returned group is a valid duplicate group, and every valid duplicate group is returned.

## Complexity

Let `L` be the total length of all strings in `paths`.

| Metric | Value | Why |
|---|---:|---|
| Time | `O(L)` | Each input character is parsed a constant number of times |
| Space | `O(L)` | The dictionary stores contents and full paths |

The number of files is bounded by the total input size, so grouping all file paths also fits within `O(L)` space.

## Follow-up: Real File System

In a real file system, we should not immediately read every whole file into memory.

A practical duplicate-file search usually works in stages:

| Stage | Purpose |
|---|---|
| Group by file size | Files with different sizes cannot be identical |
| Hash candidate files | Compute a content hash only for files with the same size |
| Verify byte-by-byte | Avoid false positives from hash collisions |

If files are very large, read them in chunks, for example 1 KB or 4 KB at a time.

The most expensive part is usually reading file contents from disk. Grouping by file size first avoids unnecessary reads.

## Testing

```python
def normalize(groups):
    return sorted(sorted(group) for group in groups)

def run_tests():
    s = Solution()

    paths1 = [
        "root/a 1.txt(abcd) 2.txt(efgh)",
        "root/c 3.txt(abcd)",
        "root/c/d 4.txt(efgh)",
        "root 4.txt(efgh)"
    ]

    expected1 = [
        ["root/a/1.txt", "root/c/3.txt"],
        ["root/a/2.txt", "root/c/d/4.txt", "root/4.txt"]
    ]

    assert normalize(s.findDuplicate(paths1)) == normalize(expected1)

    paths2 = [
        "root/a 1.txt(abcd) 2.txt(efgh)",
        "root/c 3.txt(abcd)",
        "root/c/d 4.txt(efgh)"
    ]

    expected2 = [
        ["root/a/1.txt", "root/c/3.txt"],
        ["root/a/2.txt", "root/c/d/4.txt"]
    ]

    assert normalize(s.findDuplicate(paths2)) == normalize(expected2)

    paths3 = [
        "root/a a.txt(x)",
        "root/b b.txt(y)"
    ]

    assert s.findDuplicate(paths3) == []

    print("all tests passed")

run_tests()
```

Test coverage:

| Case | Why |
|---|---|
| Multiple duplicate groups | Confirms grouping by content |
| Duplicate group of size two | Confirms minimum valid group |
| No duplicates | Confirms singleton groups are removed |