A clear explanation of the Delete Duplicate Emails SQL problem using DELETE with a self join.
Problem Restatement
We are given a table named Person.
Schema:
| Column | Type |
|---|---|
id | int |
email | varchar |
The id column is unique.
Some rows may contain duplicate email addresses.
We need to delete duplicate rows so that:
- Every email appears only once.
- The row with the smallest
idis kept. - All other duplicate rows are removed.
Example table:
| id | |
|---|---|
| 1 | [email protected] |
| 2 | [email protected] |
| 3 | [email protected] |
After deletion:
| id | |
|---|---|
| 1 | [email protected] |
| 2 | [email protected] |
The row with id = 3 is deleted because its email already exists with a smaller id.
Input and Output
| Item | Meaning |
|---|---|
| Input | Table Person |
| Output | Delete duplicate email rows |
| Keep | Smallest id for each email |
| Database | MySQL |
Examples
Initial table:
| id | |
|---|---|
| 1 | [email protected] |
| 2 | [email protected] |
| 3 | [email protected] |
The email [email protected] appears twice.
We keep:
| id | |
|---|---|
| 1 | [email protected] |
and delete:
| id | |
|---|---|
| 3 | [email protected] |
because 1 < 3.
Another example:
| id | |
|---|---|
| 1 | [email protected] |
| 2 | [email protected] |
| 3 | [email protected] |
We keep only:
| id | |
|---|---|
| 1 | [email protected] |
and delete rows 2 and 3.
First Thought
We need to identify rows that:
- Have the same email as another row.
- Have a larger
id.
Those larger rows are duplicates and should be deleted.
This suggests comparing rows against other rows in the same table.
That is exactly what a self join does.
Key Insight
Suppose we compare the table with itself.
Call the first copy p1.
Call the second copy p2.
If:
p1.email = p2.emailthen both rows represent the same email.
If:
p1.id > p2.idthen p1 is the duplicate row we should remove.
So every row satisfying both conditions should be deleted.
Algorithm
Join the table with itself:
Person p1
JOIN Person p2Match rows with the same email:
p1.email = p2.emailBut keep only rows where p1.id is larger:
p1.id > p2.idThen delete p1.
That removes all duplicate rows while preserving the smallest id.
Correctness
Consider any email value.
Let the smallest id for that email be:
min_idFor the row with id = min_id:
- There is no smaller row with the same email.
- Therefore, the condition:
p1.id > p2.idis never true for that row.
- So it is never deleted.
Now consider any other row with the same email.
Its id is larger than min_id.
So there exists another row with:
- The same email
- A smaller
id
Therefore, the row satisfies both join conditions and is deleted.
So:
- Exactly one row per email remains.
- The remaining row always has the smallest
id.
Complexity
Suppose the table contains n rows.
| Metric | Value | Why |
|---|---|---|
| Time | O(n²) worst case | The self join compares rows |
| Space | O(1) extra | Deletion happens in place |
Real database engines may optimize the join using indexes.
Implementation
DELETE p1
FROM Person p1
JOIN Person p2
ON p1.email = p2.email
AND p1.id > p2.id;Code Explanation
The query starts with:
DELETE p1This means rows from alias p1 will be deleted.
We then join the table with itself:
FROM Person p1
JOIN Person p2The condition:
p1.email = p2.emailmatches rows having the same email.
The condition:
p1.id > p2.idselects only rows whose id is larger.
So if two rows share the same email:
| p1.id | p2.id | Action |
|---|---|---|
| 3 | 1 | Delete row 3 |
| 1 | 3 | Do not delete row 1 |
Only the larger duplicate rows are removed.
Testing
Create the table:
CREATE TABLE Person (
id INT,
email VARCHAR(255)
);Insert sample data:
INSERT INTO Person (id, email) VALUES
(1, '[email protected]'),
(2, '[email protected]'),
(3, '[email protected]');Before deletion:
| id | |
|---|---|
| 1 | [email protected] |
| 2 | [email protected] |
| 3 | [email protected] |
Run the query:
DELETE p1
FROM Person p1
JOIN Person p2
ON p1.email = p2.email
AND p1.id > p2.id;After deletion:
| id | |
|---|---|
| 1 | [email protected] |
| 2 | [email protected] |
Another test:
Input:
| id | |
|---|---|
| 1 | [email protected] |
| 2 | [email protected] |
| 3 | [email protected] |
| 4 | [email protected] |
Expected result:
| id | |
|---|---|
| 1 | [email protected] |
| 4 | [email protected] |