# LeetCode 196: Delete Duplicate Emails

## Problem Restatement

We are given a table named `Person`.

Schema:

| Column | Type |
|---|---|
| `id` | int |
| `email` | varchar |

The `id` column is unique.

Some rows may contain duplicate email addresses.

We need to delete duplicate rows so that:

- Every email appears only once.
- The row with the smallest `id` is kept.
- All other duplicate rows are removed.

Example table:

| id | email |
|---|---|
| 1 | john@example.com |
| 2 | bob@example.com |
| 3 | john@example.com |

After deletion:

| id | email |
|---|---|
| 1 | john@example.com |
| 2 | bob@example.com |

The row with `id = 3` is deleted because its email already exists with a smaller `id`.

## Input and Output

| Item | Meaning |
|---|---|
| Input | Table `Person` |
| Output | Delete duplicate email rows |
| Keep | Smallest `id` for each email |
| Database | MySQL |

## Examples

Initial table:

| id | email |
|---|---|
| 1 | a@b.com |
| 2 | c@d.com |
| 3 | a@b.com |

The email `a@b.com` appears twice.

We keep:

| id | email |
|---|---|
| 1 | a@b.com |

and delete:

| id | email |
|---|---|
| 3 | a@b.com |

because `1 < 3`.

Another example:

| id | email |
|---|---|
| 1 | x@test.com |
| 2 | x@test.com |
| 3 | x@test.com |

We keep only:

| id | email |
|---|---|
| 1 | x@test.com |

and delete rows `2` and `3`.

## First Thought

We need to identify rows that:

1. Have the same email as another row.
2. Have a larger `id`.

Those larger rows are duplicates and should be deleted.

This suggests comparing rows against other rows in the same table.

That is exactly what a self join does.

## Key Insight

Suppose we compare the table with itself.

Call the first copy `p1`.

Call the second copy `p2`.

If:

```sql
p1.email = p2.email
```

then both rows represent the same email.

If:

```sql
p1.id > p2.id
```

then `p1` is the duplicate row we should remove.

So every row satisfying both conditions should be deleted.

## Algorithm

Join the table with itself:

```sql
Person p1
JOIN Person p2
```

Match rows with the same email:

```sql
p1.email = p2.email
```

But keep only rows where `p1.id` is larger:

```sql
p1.id > p2.id
```

Then delete `p1`.

That removes all duplicate rows while preserving the smallest `id`.

## Correctness

Consider any email value.

Let the smallest `id` for that email be:

```text
min_id
```

For the row with `id = min_id`:

- There is no smaller row with the same email.
- Therefore, the condition:

```sql
p1.id > p2.id
```

is never true for that row.
- So it is never deleted.

Now consider any other row with the same email.

Its `id` is larger than `min_id`.

So there exists another row with:

- The same email
- A smaller `id`

Therefore, the row satisfies both join conditions and is deleted.

So:

- Exactly one row per email remains.
- The remaining row always has the smallest `id`.

## Complexity

Suppose the table contains `n` rows.

| Metric | Value | Why |
|---|---|---|
| Time | `O(n²)` worst case | The self join compares rows |
| Space | `O(1)` extra | Deletion happens in place |

Real database engines may optimize the join using indexes.

## Implementation

```sql
DELETE p1
FROM Person p1
JOIN Person p2
ON p1.email = p2.email
AND p1.id > p2.id;
```

## Code Explanation

The query starts with:

```sql
DELETE p1
```

This means rows from alias `p1` will be deleted.

We then join the table with itself:

```sql
FROM Person p1
JOIN Person p2
```

The condition:

```sql
p1.email = p2.email
```

matches rows having the same email.

The condition:

```sql
p1.id > p2.id
```

selects only rows whose `id` is larger.

So if two rows share the same email:

| p1.id | p2.id | Action |
|---|---|---|
| 3 | 1 | Delete row 3 |
| 1 | 3 | Do not delete row 1 |

Only the larger duplicate rows are removed.

## Testing

Create the table:

```sql
CREATE TABLE Person (
    id INT,
    email VARCHAR(255)
);
```

Insert sample data:

```sql
INSERT INTO Person (id, email) VALUES
(1, 'john@example.com'),
(2, 'bob@example.com'),
(3, 'john@example.com');
```

Before deletion:

| id | email |
|---|---|
| 1 | john@example.com |
| 2 | bob@example.com |
| 3 | john@example.com |

Run the query:

```sql
DELETE p1
FROM Person p1
JOIN Person p2
ON p1.email = p2.email
AND p1.id > p2.id;
```

After deletion:

| id | email |
|---|---|
| 1 | john@example.com |
| 2 | bob@example.com |

Another test:

Input:

| id | email |
|---|---|
| 1 | a@test.com |
| 2 | a@test.com |
| 3 | a@test.com |
| 4 | b@test.com |

Expected result:

| id | email |
|---|---|
| 1 | a@test.com |
| 4 | b@test.com |