20.18 Building a Plagiarism Detector
Design a plagiarism detector that compares documents and reports suspicious similarity.
20.18 Building a Plagiarism Detector
Problem
Design a plagiarism detector that compares documents and reports suspicious similarity.
Plagiarism detection appears in education platforms, publishing systems, code review tools, legal workflows, hiring assessments, and content moderation systems. The input is a collection of documents. The output should identify pairs or passages that are unusually similar.
A naive implementation compares every document against every other document character by character. That approach is slow, noisy, and poor at detecting lightly edited text.
The practical goal is narrower:
Find documents that share substantial overlapping content.
The algorithmic problem is similarity search over text. The engineering problem is reducing the number of comparisons while keeping enough evidence to explain the result.
Solution
Use shingling and hashing.
A shingle is a contiguous sequence of tokens. For example, with word shingles of size 4:
the quick brown fox jumps over the lazy dog
produces:
the quick brown fox
quick brown fox jumps
brown fox jumps over
fox jumps over the
jumps over the lazy
over the lazy dog
If two documents share many shingles, they likely share text.
Tokenizing Documents
Start with normalized word tokens.
import re
from collections import defaultdict, Counter
import hashlib
TOKEN_RE = re.compile(r"[a-z0-9]+")
def tokenize(text):
return TOKEN_RE.findall(text.lower())
Example:
text = "The quick brown fox jumps over the lazy dog."
print(tokenize(text))
Output:
['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
Normalization makes comparison more stable. Case and punctuation differences should not hide copied text.
Building Shingles
Generate fixed-length token windows.
def shingles(tokens, size=5):
if len(tokens) < size:
return []
result = []
for i in range(len(tokens) - size + 1):
result.append(tuple(tokens[i:i + size]))
return result
Example:
tokens = tokenize("the quick brown fox jumps over the lazy dog")
print(shingles(tokens, size=4))
Output:
[
('the', 'quick', 'brown', 'fox'),
('quick', 'brown', 'fox', 'jumps'),
('brown', 'fox', 'jumps', 'over'),
('fox', 'jumps', 'over', 'the'),
('jumps', 'over', 'the', 'lazy'),
('over', 'the', 'lazy', 'dog')
]
Small shingles catch shorter overlaps but create more false positives. Larger shingles are more precise but miss heavily edited passages.
Hashing Shingles
Store hashes instead of full shingle tuples.
def hash_shingle(shingle):
text = "\x1f".join(shingle)
digest = hashlib.sha256(text.encode("utf-8")).digest()
return int.from_bytes(digest[:8], "little")
Build a document signature as a set of shingle hashes.
def shingle_hashes(text, size=5):
tokens = tokenize(text)
return {
hash_shingle(shingle)
for shingle in shingles(tokens, size)
}
A set removes duplicate shingles inside the same document. That is useful for measuring overlap, although frequency can be useful later for passage-level evidence.
Measuring Similarity
Use Jaccard similarity:
intersection size / union size
Implementation:
def jaccard(left, right):
if not left and not right:
return 0.0
return len(left & right) / len(left | right)
Example:
a = shingle_hashes(
"the quick brown fox jumps over the lazy dog",
size=4,
)
b = shingle_hashes(
"the quick brown fox jumps over the sleepy dog",
size=4,
)
print(jaccard(a, b))
The score will be less than 1.0 because some shingles differ, but the shared prefix creates overlap.
Comparing All Pairs
A direct pairwise comparison is simple.
def compare_all(documents, shingle_size=5, threshold=0.3):
signatures = {
document_id: shingle_hashes(text, shingle_size)
for document_id, text in documents.items()
}
ids = list(documents)
matches = []
for i, left_id in enumerate(ids):
for right_id in ids[i + 1:]:
score = jaccard(
signatures[left_id],
signatures[right_id],
)
if score >= threshold:
matches.append({
"left": left_id,
"right": right_id,
"score": score,
})
return matches
This is fine for hundreds of documents. It fails for millions.
Complexity:
O(n² × s)
where n is the number of documents and s is the average signature size.
Candidate Generation with an Inverted Index
Avoid comparing every pair. Build an inverted index from shingle hash to documents containing that shingle.
def build_shingle_index(signatures):
index = defaultdict(list)
for document_id, signature in signatures.items():
for shingle_hash in signature:
index[shingle_hash].append(document_id)
return dict(index)
Generate candidate pairs that share at least one shingle.
def candidate_pairs(index):
counts = Counter()
for document_ids in index.values():
document_ids = sorted(document_ids)
for i, left in enumerate(document_ids):
for right in document_ids[i + 1:]:
counts[(left, right)] += 1
return counts
Now compare only pairs with shared shingles.
def detect_plagiarism(documents, shingle_size=5, threshold=0.3):
signatures = {
document_id: shingle_hashes(text, shingle_size)
for document_id, text in documents.items()
}
index = build_shingle_index(signatures)
candidates = candidate_pairs(index)
matches = []
for left_id, right_id in candidates:
score = jaccard(
signatures[left_id],
signatures[right_id],
)
if score >= threshold:
matches.append({
"left": left_id,
"right": right_id,
"score": score,
"shared_shingles": candidates[(left_id, right_id)],
})
matches.sort(key=lambda item: item["score"], reverse=True)
return matches
This is the same pattern used in search systems: cheap candidate generation first, more expensive scoring second.
Filtering Common Shingles
Some shingles are common and uninformative.
Example:
in conclusion this essay
as a result of
the purpose of this
If a shingle appears in thousands of documents, it creates too many candidate pairs and adds little evidence.
Filter shingles by document frequency.
def filter_common_shingles(index, max_document_frequency):
return {
shingle_hash: document_ids
for shingle_hash, document_ids in index.items()
if len(document_ids) <= max_document_frequency
}
Use it before generating candidates:
index = filter_common_shingles(
index,
max_document_frequency=100,
)
This improves both performance and precision.
MinHash
For very large corpora, even storing and comparing all shingle sets can be expensive. MinHash compresses a large set into a small signature that approximates Jaccard similarity.
The key property is:
P(minhash(A) == minhash(B)) = Jaccard(A, B)
A simple MinHash implementation uses multiple salted hash functions.
def minhash_signature(values, num_hashes=64):
if not values:
return [0] * num_hashes
signature = []
for seed in range(num_hashes):
minimum = None
for value in values:
data = f"{seed}:{value}".encode("utf-8")
digest = hashlib.sha256(data).digest()
hashed = int.from_bytes(digest[:8], "little")
if minimum is None or hashed < minimum:
minimum = hashed
signature.append(minimum)
return signature
Estimate similarity:
def minhash_similarity(left_signature, right_signature):
matches = 0
for left, right in zip(left_signature, right_signature):
if left == right:
matches += 1
return matches / len(left_signature)
MinHash trades exactness for compactness. It is especially useful when documents have many shingles.
Locality-Sensitive Hashing
Locality-sensitive hashing, or LSH, uses MinHash signatures to find likely similar documents without comparing every pair.
Split each signature into bands.
def lsh_buckets(signatures, bands=16):
buckets = defaultdict(list)
for document_id, signature in signatures.items():
rows_per_band = len(signature) // bands
for band in range(bands):
start = band * rows_per_band
end = start + rows_per_band
band_values = tuple(signature[start:end])
buckets[(band, band_values)].append(document_id)
return dict(buckets)
Documents that share an entire band become candidates.
def lsh_candidate_pairs(signatures, bands=16):
buckets = lsh_buckets(signatures, bands)
pairs = set()
for document_ids in buckets.values():
document_ids = sorted(document_ids)
for i, left in enumerate(document_ids):
for right in document_ids[i + 1:]:
pairs.add((left, right))
return pairs
LSH gives a tunable tradeoff. More bands increase recall. Fewer bands increase precision.
Passage-Level Evidence
A plagiarism detector should not only report a score. It should show matching passages.
Store the original shingles and their positions.
def positioned_shingles(text, size=5):
tokens = tokenize(text)
result = defaultdict(list)
for i, shingle in enumerate(shingles(tokens, size)):
key = hash_shingle(shingle)
result[key].append(i)
return result
Find shared positions:
def shared_positions(left_text, right_text, size=5):
left = positioned_shingles(left_text, size)
right = positioned_shingles(right_text, size)
shared = []
for key in left.keys() & right.keys():
for left_pos in left[key]:
for right_pos in right[key]:
shared.append((left_pos, right_pos))
return sorted(shared)
Shared positions can be merged into longer matching regions.
Merging Adjacent Matches
If consecutive shingles match, they likely represent a longer copied passage.
def merge_adjacent_matches(matches):
if not matches:
return []
matches = sorted(matches)
regions = []
start_left, start_right = matches[0]
end_left, end_right = matches[0]
for left, right in matches[1:]:
if left == end_left + 1 and right == end_right + 1:
end_left = left
end_right = right
else:
regions.append(
(start_left, end_left, start_right, end_right)
)
start_left, start_right = left, right
end_left, end_right = left, right
regions.append(
(start_left, end_left, start_right, end_right)
)
return regions
This converts scattered shingle matches into evidence blocks. A reviewer can inspect these blocks directly.
Handling Paraphrases
Shingle matching detects copied or lightly edited text. It performs poorly on paraphrases.
Original:
Dijkstra's algorithm computes shortest paths with non-negative edge weights.
Paraphrase:
For graphs without negative edges, Dijkstra finds minimum-cost routes.
The meaning is similar, but exact shingle overlap is low.
Possible extensions:
| Technique | Use |
|---|---|
| Stemming | Match related word forms |
| Synonym expansion | Catch simple word substitutions |
| Character shingles | Catch spelling changes |
| Embeddings | Detect semantic similarity |
| Citation-aware checks | Ignore quoted or cited text |
| Code AST comparison | Detect renamed variables in code |
Semantic methods are useful but less explainable. A production detector often combines exact overlap evidence with semantic similarity scores.
Code Plagiarism
Source code requires different normalization.
Common transformations:
- Rename variables.
- Reformat whitespace.
- Reorder helper functions.
- Change comments.
- Replace loops with equivalent constructs.
A code-oriented detector should tokenize by language syntax, not plain words.
Example normalization:
for i in range(n):
total += a[i]
can become token categories:
FOR IDENT IN IDENT LPAREN IDENT RPAREN IDENT PLUS_EQ IDENT LBRACKET IDENT RBRACKET
For stronger detection, parse code into ASTs and compare structure. This reduces sensitivity to formatting and identifier renaming.
Thresholds
A similarity score needs interpretation.
Example policy:
| Score | Interpretation |
|---|---|
| 0.00 to 0.10 | Likely unrelated |
| 0.10 to 0.30 | Weak overlap |
| 0.30 to 0.60 | Review recommended |
| 0.60 to 1.00 | Strong similarity |
These thresholds are workload-dependent. Short documents naturally produce unstable scores. Common templates can inflate similarity. Technical assignments with shared starter code require exclusion rules.
Always tune thresholds against labeled examples.
Exclusions
A good plagiarism detector excludes known shared material.
Examples:
- Assignment prompt.
- Starter code.
- Boilerplate headers.
- Legal notices.
- Bibliography.
- Quoted passages.
- Common templates.
Implement exclusions by subtracting known shingle hashes.
def remove_excluded_shingles(signature, excluded):
return signature - excluded
Build the exclusion set from trusted template documents.
excluded = shingle_hashes(template_text, size=5)
student_signature = remove_excluded_shingles(
shingle_hashes(student_text, size=5),
excluded,
)
Exclusions reduce false positives dramatically.
Complexity
For exact shingle indexing:
| Operation | Complexity |
|---|---|
| Tokenize documents | O(total tokens) |
| Build shingles | O(total tokens) |
| Build inverted index | O(total shingles) |
| Candidate generation | Depends on common shingles |
| Exact pair scoring | O(candidate pairs × signature size) |
For MinHash:
| Operation | Complexity |
|---|---|
| Build signatures | O(total shingles × num_hashes) |
| LSH bucketing | O(documents × signature length) |
| Candidate scoring | O(candidate pairs × signature length) |
Exact shingle indexing is simpler. MinHash and LSH become valuable when the corpus is large enough that exact candidate generation is too expensive.
Testing
Test identical documents.
def test_identical_documents_match():
documents = {
"a": "the quick brown fox jumps over the lazy dog",
"b": "the quick brown fox jumps over the lazy dog",
}
matches = detect_plagiarism(
documents,
shingle_size=4,
threshold=0.9,
)
assert len(matches) == 1
assert matches[0]["left"] == "a"
assert matches[0]["right"] == "b"
Test unrelated documents.
def test_unrelated_documents_do_not_match():
documents = {
"a": "graph algorithms and shortest paths",
"b": "baking bread with flour and water",
}
matches = detect_plagiarism(
documents,
shingle_size=3,
threshold=0.3,
)
assert matches == []
Test template exclusion.
def test_template_exclusion_reduces_similarity():
template = "this assignment asks you to implement a graph algorithm"
left = template + " using dijkstra with a priority queue"
right = template + " using dynamic programming"
excluded = shingle_hashes(template, size=4)
left_sig = remove_excluded_shingles(
shingle_hashes(left, size=4),
excluded,
)
right_sig = remove_excluded_shingles(
shingle_hashes(right, size=4),
excluded,
)
assert jaccard(left_sig, right_sig) < 0.5
These tests verify exact overlap, non-overlap, and false-positive reduction.
Common Bugs
The most common bug is comparing raw strings. Punctuation, case, and formatting differences hide obvious matches.
Another common bug is choosing shingles that are too small. Small shingles create many accidental overlaps.
Common boilerplate can dominate similarity scores. Exclude templates and shared prompts.
Pairwise comparison does not scale. Use an inverted index, MinHash, or LSH.
Short documents are hard to score reliably because a few shared shingles can produce high similarity.
Hash collisions are rare with strong hashes, but weak hashes can create false evidence.
A detector that reports only a score is hard to trust. Always include matching passages or shared regions.
Recipe
Build the detector in layers.
Normalize text into tokens. Generate word shingles. Hash shingles into document signatures. Use Jaccard similarity for exact comparison. Add a shingle inverted index to reduce candidate pairs. Filter common shingles. Add MinHash and LSH when the corpus becomes large. Store shingle positions so you can report matching passages. Exclude templates, boilerplate, and quoted material. Tune thresholds against real examples.
The main lesson is that plagiarism detection is similarity search with evidence. Hashes and indexes make search efficient, but the result is only useful when the system can show why two documents were flagged.