20.18 Building a Plagiarism Detector

Problem

Design a plagiarism detector that compares documents and reports suspicious similarity.

Plagiarism detection appears in education platforms, publishing systems, code review tools, legal workflows, hiring assessments, and content moderation systems. The input is a collection of documents. The output should identify pairs or passages that are unusually similar.

A naive implementation compares every document against every other document character by character. That approach is slow, noisy, and poor at detecting lightly edited text.

The practical goal is narrower:

Find documents that share substantial overlapping content.

The algorithmic problem is similarity search over text. The engineering problem is reducing the number of comparisons while keeping enough evidence to explain the result.

Solution

Use shingling and hashing.

A shingle is a contiguous sequence of tokens. For example, with word shingles of size 4:

the quick brown fox jumps over the lazy dog

produces:

the quick brown fox
quick brown fox jumps
brown fox jumps over
fox jumps over the
jumps over the lazy
over the lazy dog

If two documents share many shingles, they likely share text.

Tokenizing Documents

Start with normalized word tokens.

import re
from collections import defaultdict, Counter
import hashlib

TOKEN_RE = re.compile(r"[a-z0-9]+")

def tokenize(text):
    return TOKEN_RE.findall(text.lower())

Example:

text = "The quick brown fox jumps over the lazy dog."

print(tokenize(text))

Output:

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

Normalization makes comparison more stable. Case and punctuation differences should not hide copied text.

Building Shingles

Generate fixed-length token windows.

def shingles(tokens, size=5):
    if len(tokens) < size:
        return []

    result = []

    for i in range(len(tokens) - size + 1):
        result.append(tuple(tokens[i:i + size]))

    return result

Example:

tokens = tokenize("the quick brown fox jumps over the lazy dog")

print(shingles(tokens, size=4))

Output:

[
  ('the', 'quick', 'brown', 'fox'),
  ('quick', 'brown', 'fox', 'jumps'),
  ('brown', 'fox', 'jumps', 'over'),
  ('fox', 'jumps', 'over', 'the'),
  ('jumps', 'over', 'the', 'lazy'),
  ('over', 'the', 'lazy', 'dog')
]

Small shingles catch shorter overlaps but create more false positives. Larger shingles are more precise but miss heavily edited passages.

Hashing Shingles

Store hashes instead of full shingle tuples.

def hash_shingle(shingle):
    text = "\x1f".join(shingle)
    digest = hashlib.sha256(text.encode("utf-8")).digest()

    return int.from_bytes(digest[:8], "little")

Build a document signature as a set of shingle hashes.

def shingle_hashes(text, size=5):
    tokens = tokenize(text)

    return {
        hash_shingle(shingle)
        for shingle in shingles(tokens, size)
    }

A set removes duplicate shingles inside the same document. That is useful for measuring overlap, although frequency can be useful later for passage-level evidence.

Measuring Similarity

Use Jaccard similarity:

intersection size / union size

Implementation:

def jaccard(left, right):
    if not left and not right:
        return 0.0

    return len(left & right) / len(left | right)

Example:

a = shingle_hashes(
    "the quick brown fox jumps over the lazy dog",
    size=4,
)

b = shingle_hashes(
    "the quick brown fox jumps over the sleepy dog",
    size=4,
)

print(jaccard(a, b))

The score will be less than 1.0 because some shingles differ, but the shared prefix creates overlap.

Comparing All Pairs

A direct pairwise comparison is simple.

def compare_all(documents, shingle_size=5, threshold=0.3):
    signatures = {
        document_id: shingle_hashes(text, shingle_size)
        for document_id, text in documents.items()
    }

    ids = list(documents)
    matches = []

    for i, left_id in enumerate(ids):
        for right_id in ids[i + 1:]:
            score = jaccard(
                signatures[left_id],
                signatures[right_id],
            )

            if score >= threshold:
                matches.append({
                    "left": left_id,
                    "right": right_id,
                    "score": score,
                })

    return matches

This is fine for hundreds of documents. It fails for millions.

Complexity:

O(n² × s)

where n is the number of documents and s is the average signature size.

Candidate Generation with an Inverted Index

Avoid comparing every pair. Build an inverted index from shingle hash to documents containing that shingle.

def build_shingle_index(signatures):
    index = defaultdict(list)

    for document_id, signature in signatures.items():
        for shingle_hash in signature:
            index[shingle_hash].append(document_id)

    return dict(index)

Generate candidate pairs that share at least one shingle.

def candidate_pairs(index):
    counts = Counter()

    for document_ids in index.values():
        document_ids = sorted(document_ids)

        for i, left in enumerate(document_ids):
            for right in document_ids[i + 1:]:
                counts[(left, right)] += 1

    return counts

Now compare only pairs with shared shingles.

def detect_plagiarism(documents, shingle_size=5, threshold=0.3):
    signatures = {
        document_id: shingle_hashes(text, shingle_size)
        for document_id, text in documents.items()
    }

    index = build_shingle_index(signatures)
    candidates = candidate_pairs(index)
    matches = []

    for left_id, right_id in candidates:
        score = jaccard(
            signatures[left_id],
            signatures[right_id],
        )

        if score >= threshold:
            matches.append({
                "left": left_id,
                "right": right_id,
                "score": score,
                "shared_shingles": candidates[(left_id, right_id)],
            })

    matches.sort(key=lambda item: item["score"], reverse=True)

    return matches

This is the same pattern used in search systems: cheap candidate generation first, more expensive scoring second.

Filtering Common Shingles

Some shingles are common and uninformative.

Example:

in conclusion this essay
as a result of
the purpose of this

If a shingle appears in thousands of documents, it creates too many candidate pairs and adds little evidence.

Filter shingles by document frequency.

def filter_common_shingles(index, max_document_frequency):
    return {
        shingle_hash: document_ids
        for shingle_hash, document_ids in index.items()
        if len(document_ids) <= max_document_frequency
    }

Use it before generating candidates:

index = filter_common_shingles(
    index,
    max_document_frequency=100,
)

This improves both performance and precision.

MinHash

For very large corpora, even storing and comparing all shingle sets can be expensive. MinHash compresses a large set into a small signature that approximates Jaccard similarity.

The key property is:

P(minhash(A) == minhash(B)) = Jaccard(A, B)

A simple MinHash implementation uses multiple salted hash functions.

def minhash_signature(values, num_hashes=64):
    if not values:
        return [0] * num_hashes

    signature = []

    for seed in range(num_hashes):
        minimum = None

        for value in values:
            data = f"{seed}:{value}".encode("utf-8")
            digest = hashlib.sha256(data).digest()
            hashed = int.from_bytes(digest[:8], "little")

            if minimum is None or hashed < minimum:
                minimum = hashed

        signature.append(minimum)

    return signature

Estimate similarity:

def minhash_similarity(left_signature, right_signature):
    matches = 0

    for left, right in zip(left_signature, right_signature):
        if left == right:
            matches += 1

    return matches / len(left_signature)

MinHash trades exactness for compactness. It is especially useful when documents have many shingles.

Locality-Sensitive Hashing

Locality-sensitive hashing, or LSH, uses MinHash signatures to find likely similar documents without comparing every pair.

Split each signature into bands.

def lsh_buckets(signatures, bands=16):
    buckets = defaultdict(list)

    for document_id, signature in signatures.items():
        rows_per_band = len(signature) // bands

        for band in range(bands):
            start = band * rows_per_band
            end = start + rows_per_band
            band_values = tuple(signature[start:end])

            buckets[(band, band_values)].append(document_id)

    return dict(buckets)

Documents that share an entire band become candidates.

def lsh_candidate_pairs(signatures, bands=16):
    buckets = lsh_buckets(signatures, bands)
    pairs = set()

    for document_ids in buckets.values():
        document_ids = sorted(document_ids)

        for i, left in enumerate(document_ids):
            for right in document_ids[i + 1:]:
                pairs.add((left, right))

    return pairs

LSH gives a tunable tradeoff. More bands increase recall. Fewer bands increase precision.

Passage-Level Evidence

A plagiarism detector should not only report a score. It should show matching passages.

Store the original shingles and their positions.

def positioned_shingles(text, size=5):
    tokens = tokenize(text)
    result = defaultdict(list)

    for i, shingle in enumerate(shingles(tokens, size)):
        key = hash_shingle(shingle)
        result[key].append(i)

    return result

Find shared positions:

def shared_positions(left_text, right_text, size=5):
    left = positioned_shingles(left_text, size)
    right = positioned_shingles(right_text, size)

    shared = []

    for key in left.keys() & right.keys():
        for left_pos in left[key]:
            for right_pos in right[key]:
                shared.append((left_pos, right_pos))

    return sorted(shared)

Shared positions can be merged into longer matching regions.

Merging Adjacent Matches

If consecutive shingles match, they likely represent a longer copied passage.

def merge_adjacent_matches(matches):
    if not matches:
        return []

    matches = sorted(matches)
    regions = []

    start_left, start_right = matches[0]
    end_left, end_right = matches[0]

    for left, right in matches[1:]:
        if left == end_left + 1 and right == end_right + 1:
            end_left = left
            end_right = right
        else:
            regions.append(
                (start_left, end_left, start_right, end_right)
            )
            start_left, start_right = left, right
            end_left, end_right = left, right

    regions.append(
        (start_left, end_left, start_right, end_right)
    )

    return regions

This converts scattered shingle matches into evidence blocks. A reviewer can inspect these blocks directly.

Handling Paraphrases

Shingle matching detects copied or lightly edited text. It performs poorly on paraphrases.

Original:

Dijkstra's algorithm computes shortest paths with non-negative edge weights.

Paraphrase:

For graphs without negative edges, Dijkstra finds minimum-cost routes.

The meaning is similar, but exact shingle overlap is low.

Possible extensions:

Technique	Use
Stemming	Match related word forms
Synonym expansion	Catch simple word substitutions
Character shingles	Catch spelling changes
Embeddings	Detect semantic similarity
Citation-aware checks	Ignore quoted or cited text
Code AST comparison	Detect renamed variables in code

Semantic methods are useful but less explainable. A production detector often combines exact overlap evidence with semantic similarity scores.

Code Plagiarism

Source code requires different normalization.

Common transformations:

Rename variables.
Reformat whitespace.
Reorder helper functions.
Change comments.
Replace loops with equivalent constructs.

A code-oriented detector should tokenize by language syntax, not plain words.

Example normalization:

for i in range(n):
    total += a[i]

can become token categories:

FOR IDENT IN IDENT LPAREN IDENT RPAREN IDENT PLUS_EQ IDENT LBRACKET IDENT RBRACKET

For stronger detection, parse code into ASTs and compare structure. This reduces sensitivity to formatting and identifier renaming.

Thresholds

A similarity score needs interpretation.

Example policy:

Score	Interpretation
0.00 to 0.10	Likely unrelated
0.10 to 0.30	Weak overlap
0.30 to 0.60	Review recommended
0.60 to 1.00	Strong similarity

These thresholds are workload-dependent. Short documents naturally produce unstable scores. Common templates can inflate similarity. Technical assignments with shared starter code require exclusion rules.

Always tune thresholds against labeled examples.

Exclusions

A good plagiarism detector excludes known shared material.

Examples:

Assignment prompt.
Starter code.
Boilerplate headers.
Legal notices.
Bibliography.
Quoted passages.
Common templates.

Implement exclusions by subtracting known shingle hashes.

def remove_excluded_shingles(signature, excluded):
    return signature - excluded

Build the exclusion set from trusted template documents.

excluded = shingle_hashes(template_text, size=5)

student_signature = remove_excluded_shingles(
    shingle_hashes(student_text, size=5),
    excluded,
)

Exclusions reduce false positives dramatically.

Complexity

For exact shingle indexing:

Operation	Complexity
Tokenize documents	O(total tokens)
Build shingles	O(total tokens)
Build inverted index	O(total shingles)
Candidate generation	Depends on common shingles
Exact pair scoring	O(candidate pairs × signature size)

For MinHash:

Operation	Complexity
Build signatures	O(total shingles × num_hashes)
LSH bucketing	O(documents × signature length)
Candidate scoring	O(candidate pairs × signature length)

Exact shingle indexing is simpler. MinHash and LSH become valuable when the corpus is large enough that exact candidate generation is too expensive.

Testing

Test identical documents.

def test_identical_documents_match():
    documents = {
        "a": "the quick brown fox jumps over the lazy dog",
        "b": "the quick brown fox jumps over the lazy dog",
    }

    matches = detect_plagiarism(
        documents,
        shingle_size=4,
        threshold=0.9,
    )

    assert len(matches) == 1
    assert matches[0]["left"] == "a"
    assert matches[0]["right"] == "b"

Test unrelated documents.

def test_unrelated_documents_do_not_match():
    documents = {
        "a": "graph algorithms and shortest paths",
        "b": "baking bread with flour and water",
    }

    matches = detect_plagiarism(
        documents,
        shingle_size=3,
        threshold=0.3,
    )

    assert matches == []

Test template exclusion.

def test_template_exclusion_reduces_similarity():
    template = "this assignment asks you to implement a graph algorithm"

    left = template + " using dijkstra with a priority queue"
    right = template + " using dynamic programming"

    excluded = shingle_hashes(template, size=4)

    left_sig = remove_excluded_shingles(
        shingle_hashes(left, size=4),
        excluded,
    )

    right_sig = remove_excluded_shingles(
        shingle_hashes(right, size=4),
        excluded,
    )

    assert jaccard(left_sig, right_sig) < 0.5

These tests verify exact overlap, non-overlap, and false-positive reduction.

Common Bugs

The most common bug is comparing raw strings. Punctuation, case, and formatting differences hide obvious matches.

Another common bug is choosing shingles that are too small. Small shingles create many accidental overlaps.

Common boilerplate can dominate similarity scores. Exclude templates and shared prompts.

Pairwise comparison does not scale. Use an inverted index, MinHash, or LSH.

Short documents are hard to score reliably because a few shared shingles can produce high similarity.

Hash collisions are rare with strong hashes, but weak hashes can create false evidence.

A detector that reports only a score is hard to trust. Always include matching passages or shared regions.

Recipe

Build the detector in layers.

Normalize text into tokens. Generate word shingles. Hash shingles into document signatures. Use Jaccard similarity for exact comparison. Add a shingle inverted index to reduce candidate pairs. Filter common shingles. Add MinHash and LSH when the corpus becomes large. Store shingle positions so you can report matching passages. Exclude templates, boilerplate, and quoted material. Tune thresholds against real examples.

The main lesson is that plagiarism detection is similarity search with evidence. Hashes and indexes make search efficient, but the result is only useful when the system can show why two documents were flagged.