LeetCode 196: Delete Duplicate Emails

Problem Restatement

We are given a table named Person.

Schema:

Column	Type
`id`	int
`email`	varchar

The id column is unique.

Some rows may contain duplicate email addresses.

We need to delete duplicate rows so that:

Every email appears only once.
The row with the smallest id is kept.
All other duplicate rows are removed.

Example table:

id	email
1	[email protected]
2	[email protected]
3	[email protected]

After deletion:

id	email
1	[email protected]
2	[email protected]

The row with id = 3 is deleted because its email already exists with a smaller id.

Input and Output

Item	Meaning
Input	Table `Person`
Output	Delete duplicate email rows
Keep	Smallest `id` for each email
Database	MySQL

Examples

Initial table:

id	email
1	[email protected]
2	[email protected]
3	[email protected]

The email [email protected] appears twice.

We keep:

id	email
1	[email protected]

and delete:

id	email
3	[email protected]

because 1 < 3.

Another example:

id	email
1	[email protected]
2	[email protected]
3	[email protected]

We keep only:

id	email
1	[email protected]

and delete rows 2 and 3.

First Thought

We need to identify rows that:

Have the same email as another row.
Have a larger id.

Those larger rows are duplicates and should be deleted.

This suggests comparing rows against other rows in the same table.

That is exactly what a self join does.

Key Insight

Suppose we compare the table with itself.

Call the first copy p1.

Call the second copy p2.

If:

p1.email = p2.email

then both rows represent the same email.

If:

p1.id > p2.id

then p1 is the duplicate row we should remove.

So every row satisfying both conditions should be deleted.

Algorithm

Join the table with itself:

Person p1
JOIN Person p2

Match rows with the same email:

p1.email = p2.email

But keep only rows where p1.id is larger:

p1.id > p2.id

Then delete p1.

That removes all duplicate rows while preserving the smallest id.

Correctness

Consider any email value.

Let the smallest id for that email be:

min_id

For the row with id = min_id:

There is no smaller row with the same email.
Therefore, the condition:

p1.id > p2.id

is never true for that row.

So it is never deleted.

Now consider any other row with the same email.

Its id is larger than min_id.

So there exists another row with:

The same email
A smaller id

Therefore, the row satisfies both join conditions and is deleted.

So:

Exactly one row per email remains.
The remaining row always has the smallest id.

Complexity

Suppose the table contains n rows.

Metric	Value	Why
Time	`O(n²)` worst case	The self join compares rows
Space	`O(1)` extra	Deletion happens in place

Real database engines may optimize the join using indexes.

Implementation

DELETE p1
FROM Person p1
JOIN Person p2
ON p1.email = p2.email
AND p1.id > p2.id;

Code Explanation

The query starts with:

DELETE p1

This means rows from alias p1 will be deleted.

We then join the table with itself:

FROM Person p1
JOIN Person p2

The condition:

p1.email = p2.email

matches rows having the same email.

The condition:

p1.id > p2.id

selects only rows whose id is larger.

So if two rows share the same email:

p1.id	p2.id	Action
3	1	Delete row 3
1	3	Do not delete row 1

Only the larger duplicate rows are removed.

Testing

Create the table:

CREATE TABLE Person (
    id INT,
    email VARCHAR(255)
);

Insert sample data:

INSERT INTO Person (id, email) VALUES
(1, '[email protected]'),
(2, '[email protected]'),
(3, '[email protected]');

Before deletion:

id	email
1	[email protected]
2	[email protected]
3	[email protected]

Run the query:

DELETE p1
FROM Person p1
JOIN Person p2
ON p1.email = p2.email
AND p1.id > p2.id;

After deletion:

id	email
1	[email protected]
2	[email protected]

Another test:

Input:

id	email
1	[email protected]
2	[email protected]
3	[email protected]
4	[email protected]

Expected result:

id	email
1	[email protected]
4	[email protected]