Skip to content

LeetCode 196: Delete Duplicate Emails

A clear explanation of the Delete Duplicate Emails SQL problem using DELETE with a self join.

Problem Restatement

We are given a table named Person.

Schema:

ColumnType
idint
emailvarchar

The id column is unique.

Some rows may contain duplicate email addresses.

We need to delete duplicate rows so that:

  • Every email appears only once.
  • The row with the smallest id is kept.
  • All other duplicate rows are removed.

Example table:

idemail
1[email protected]
2[email protected]
3[email protected]

After deletion:

idemail
1[email protected]
2[email protected]

The row with id = 3 is deleted because its email already exists with a smaller id.

Input and Output

ItemMeaning
InputTable Person
OutputDelete duplicate email rows
KeepSmallest id for each email
DatabaseMySQL

Examples

Initial table:

idemail
1[email protected]
2[email protected]
3[email protected]

The email [email protected] appears twice.

We keep:

idemail
1[email protected]

and delete:

idemail
3[email protected]

because 1 < 3.

Another example:

idemail
1[email protected]
2[email protected]
3[email protected]

We keep only:

idemail
1[email protected]

and delete rows 2 and 3.

First Thought

We need to identify rows that:

  1. Have the same email as another row.
  2. Have a larger id.

Those larger rows are duplicates and should be deleted.

This suggests comparing rows against other rows in the same table.

That is exactly what a self join does.

Key Insight

Suppose we compare the table with itself.

Call the first copy p1.

Call the second copy p2.

If:

p1.email = p2.email

then both rows represent the same email.

If:

p1.id > p2.id

then p1 is the duplicate row we should remove.

So every row satisfying both conditions should be deleted.

Algorithm

Join the table with itself:

Person p1
JOIN Person p2

Match rows with the same email:

p1.email = p2.email

But keep only rows where p1.id is larger:

p1.id > p2.id

Then delete p1.

That removes all duplicate rows while preserving the smallest id.

Correctness

Consider any email value.

Let the smallest id for that email be:

min_id

For the row with id = min_id:

  • There is no smaller row with the same email.
  • Therefore, the condition:
p1.id > p2.id

is never true for that row.

  • So it is never deleted.

Now consider any other row with the same email.

Its id is larger than min_id.

So there exists another row with:

  • The same email
  • A smaller id

Therefore, the row satisfies both join conditions and is deleted.

So:

  • Exactly one row per email remains.
  • The remaining row always has the smallest id.

Complexity

Suppose the table contains n rows.

MetricValueWhy
TimeO(n²) worst caseThe self join compares rows
SpaceO(1) extraDeletion happens in place

Real database engines may optimize the join using indexes.

Implementation

DELETE p1
FROM Person p1
JOIN Person p2
ON p1.email = p2.email
AND p1.id > p2.id;

Code Explanation

The query starts with:

DELETE p1

This means rows from alias p1 will be deleted.

We then join the table with itself:

FROM Person p1
JOIN Person p2

The condition:

p1.email = p2.email

matches rows having the same email.

The condition:

p1.id > p2.id

selects only rows whose id is larger.

So if two rows share the same email:

p1.idp2.idAction
31Delete row 3
13Do not delete row 1

Only the larger duplicate rows are removed.

Testing

Create the table:

CREATE TABLE Person (
    id INT,
    email VARCHAR(255)
);

Insert sample data:

INSERT INTO Person (id, email) VALUES
(1, '[email protected]'),
(2, '[email protected]'),
(3, '[email protected]');

Before deletion:

idemail
1[email protected]
2[email protected]
3[email protected]

Run the query:

DELETE p1
FROM Person p1
JOIN Person p2
ON p1.email = p2.email
AND p1.id > p2.id;

After deletion:

idemail
1[email protected]
2[email protected]

Another test:

Input:

idemail
1[email protected]
2[email protected]
3[email protected]
4[email protected]

Expected result:

idemail
1[email protected]
4[email protected]