20.13 Building a Web Crawler Frontier

Design a crawler frontier that decides which URLs a web crawler should visit next.

20.13 Building a Web Crawler Frontier

Problem

Design a crawler frontier that decides which URLs a web crawler should visit next.

A crawler frontier is the scheduling core of a web crawler. It stores discovered URLs, avoids duplicates, respects host politeness, prioritizes important pages, and feeds workers with crawl tasks. The crawler may discover millions or billions of URLs, but each worker only needs one answer at a time:

What URL should I fetch next?

A naive crawler uses a single FIFO queue:

push discovered URLs
pop next URL
fetch
extract links
repeat

This works for a toy crawler, but it fails quickly in real systems. It may crawl the same URL repeatedly, overload one host, ignore high-value pages, or let unbounded queues consume memory.

The frontier needs algorithms for deduplication, priority scheduling, rate limiting, and partitioning.

Solution

Use three cooperating structures:

  1. A seen set to avoid duplicate URLs.
  2. Per-host queues to enforce politeness.
  3. A priority queue to select the next host that may be crawled.

At a high level:

discovered URL
  -> normalize URL
  -> check seen set
  -> enqueue by host
  -> schedule host by next allowed fetch time

Workers request URLs from the frontier. The frontier returns only URLs whose host is ready.

URL Normalization

The same page may appear under several syntactic forms.

https://example.com
https://example.com/
https://example.com/index.html
https://EXAMPLE.com/

Before deduplication, normalize URLs.

A minimal normalizer:

from urllib.parse import urlsplit, urlunsplit, urldefrag

def normalize_url(url):
    url, _fragment = urldefrag(url)
    parts = urlsplit(url)

    scheme = parts.scheme.lower() or "https"
    netloc = parts.netloc.lower()
    path = parts.path or "/"

    if scheme == "http" and netloc.endswith(":80"):
        netloc = netloc[:-3]

    if scheme == "https" and netloc.endswith(":443"):
        netloc = netloc[:-4]

    return urlunsplit((
        scheme,
        netloc,
        path,
        parts.query,
        "",
    ))

This handles case, fragments, missing paths, and default ports. Production crawlers add more rules, but every crawler should normalize before deduplication.

Deduplicating URLs

Use a hash set for exact deduplication.

class SeenSet:
    def __init__(self):
        self.seen = set()

    def add_if_new(self, url):
        if url in self.seen:
            return False

        self.seen.add(url)
        return True

Use it in the frontier:

def add_url(frontier, seen, url):
    normalized = normalize_url(url)

    if not seen.add_if_new(normalized):
        return False

    frontier.enqueue(normalized)
    return True

For very large crawls, an exact set may become too large. A Bloom filter can reduce memory, but it introduces false positives. A false positive means the crawler may skip a page it has not actually seen. That is acceptable for broad discovery crawls and unacceptable for crawls that require completeness.

Extracting the Host

Politeness rules usually apply per host.

from urllib.parse import urlsplit

def host_of(url):
    return urlsplit(url).netloc.lower()

The frontier groups URLs by host:

example.com -> /a, /b, /c
docs.example.com -> /guide, /api
another.org -> /index

This grouping lets the crawler avoid hitting the same host too aggressively.

Per-Host Queues

Each host gets its own FIFO queue.

from collections import defaultdict, deque

class HostQueues:
    def __init__(self):
        self.queues = defaultdict(deque)

    def push(self, url):
        host = host_of(url)
        self.queues[host].append(url)

    def pop(self, host):
        queue = self.queues[host]

        if not queue:
            return None

        return queue.popleft()

    def has_urls(self, host):
        return bool(self.queues[host])

A single global queue cannot enforce host-level fairness. Per-host queues make host scheduling explicit.

Scheduling Hosts

To respect politeness, each host has a next allowed fetch time.

Use a min-heap keyed by ready_at.

import heapq
import time

class HostScheduler:
    def __init__(self, delay_seconds):
        self.delay_seconds = delay_seconds
        self.heap = []
        self.scheduled = set()

    def schedule_host(self, host, ready_at=None):
        if host in self.scheduled:
            return

        if ready_at is None:
            ready_at = time.time()

        heapq.heappush(self.heap, (ready_at, host))
        self.scheduled.add(host)

    def pop_ready_host(self, now=None):
        if now is None:
            now = time.time()

        if not self.heap:
            return None

        ready_at, host = self.heap[0]

        if ready_at > now:
            return None

        heapq.heappop(self.heap)
        self.scheduled.remove(host)

        return host

    def reschedule_after_fetch(self, host, now=None):
        if now is None:
            now = time.time()

        self.schedule_host(host, now + self.delay_seconds)

The heap lets the frontier find the next crawlable host efficiently.

Combining the Structures

The frontier owns the seen set, host queues, and host scheduler.

class CrawlFrontier:
    def __init__(self, delay_seconds=1.0):
        self.seen = SeenSet()
        self.host_queues = HostQueues()
        self.scheduler = HostScheduler(delay_seconds)

    def add(self, url):
        normalized = normalize_url(url)

        if not self.seen.add_if_new(normalized):
            return False

        host = host_of(normalized)

        self.host_queues.push(normalized)
        self.scheduler.schedule_host(host)

        return True

    def next_url(self, now=None):
        host = self.scheduler.pop_ready_host(now)

        if host is None:
            return None

        url = self.host_queues.pop(host)

        if self.host_queues.has_urls(host):
            self.scheduler.reschedule_after_fetch(host, now)

        return url

Example:

frontier = CrawlFrontier(delay_seconds=2.0)

frontier.add("https://example.com/a")
frontier.add("https://example.com/b")
frontier.add("https://another.org/")

print(frontier.next_url(now=0))
print(frontier.next_url(now=0))

The first two URLs may come from different hosts, depending on scheduling order. A second URL from example.com should not be returned until the politeness delay has passed.

Priority

Not all URLs are equally valuable.

A crawler may prefer:

  • URLs near the seed set.
  • Pages from authoritative hosts.
  • Fresh pages.
  • Sitemaps.
  • Product pages.
  • Documentation pages.
  • URLs with high predicted content value.

Priority can be added within each host queue.

Instead of FIFO queues, use per-host priority queues.

class PriorityHostQueues:
    def __init__(self):
        self.queues = defaultdict(list)
        self.counter = 0

    def push(self, url, priority=0):
        host = host_of(url)
        heapq.heappush(
            self.queues[host],
            (-priority, self.counter, url),
        )
        self.counter += 1

    def pop(self, host):
        queue = self.queues[host]

        if not queue:
            return None

        _, _, url = heapq.heappop(queue)
        return url

    def has_urls(self, host):
        return bool(self.queues[host])

The negative priority makes larger priority values come out first.

Priority should not bypass politeness. It should decide which URL to fetch within a host, while host scheduling decides when that host may be fetched.

Depth Control

A crawler often starts from seed URLs and follows links to a limited depth.

seed page: depth 0
links from seed: depth 1
links from depth 1 pages: depth 2

Represent crawl tasks:

from dataclasses import dataclass

@dataclass(frozen=True)
class CrawlTask:
    url: str
    depth: int

Add a max depth policy:

class DepthLimitedFrontier:
    def __init__(self, max_depth, delay_seconds=1.0):
        self.max_depth = max_depth
        self.seen = SeenSet()
        self.host_queues = defaultdict(deque)
        self.scheduler = HostScheduler(delay_seconds)

    def add(self, url, depth):
        if depth > self.max_depth:
            return False

        normalized = normalize_url(url)

        if not self.seen.add_if_new(normalized):
            return False

        host = host_of(normalized)
        self.host_queues[host].append(CrawlTask(normalized, depth))
        self.scheduler.schedule_host(host)

        return True

Depth control prevents the crawler from expanding forever into low-value areas.

Robots and Exclusion Rules

A responsible crawler checks host rules before fetching URLs. These rules are usually expressed in robots.txt.

A full implementation is outside this recipe, but the frontier should expose a policy hook.

class CrawlPolicy:
    def allowed(self, url):
        return True

    def priority(self, url):
        return 0

Use it when adding URLs:

def add_with_policy(frontier, policy, url):
    normalized = normalize_url(url)

    if not policy.allowed(normalized):
        return False

    priority = policy.priority(normalized)

    return frontier.add(normalized, priority=priority)

The key design point is separation. The frontier should schedule URLs. Policy code should decide whether a URL is eligible.

Failure and Retry

Fetches fail. The frontier should support retries without producing infinite loops.

Represent attempts:

@dataclass(frozen=True)
class CrawlTask:
    url: str
    depth: int
    attempt: int = 0

Backoff policy:

def retry_delay(attempt):
    delays = [5, 30, 300]

    if attempt >= len(delays):
        return None

    return delays[attempt]

On failure:

def handle_fetch_failure(frontier, task, now):
    delay = retry_delay(task.attempt)

    if delay is None:
        return False

    retry = CrawlTask(
        url=task.url,
        depth=task.depth,
        attempt=task.attempt + 1,
    )

    frontier.requeue(task.url, retry, ready_at=now + delay)

    return True

Retries are scheduled events. They should obey both backoff and host politeness.

Distributed Frontier

A large crawler runs many workers. The frontier must distribute work while preserving deduplication and host politeness.

A common partitioning strategy assigns hosts to shards:

import hashlib

def shard_for_host(host, shard_count):
    digest = hashlib.sha256(host.encode("utf-8")).digest()
    value = int.from_bytes(digest[:8], "little")

    return value % shard_count

All URLs from the same host go to the same shard. That shard can enforce host-level politeness locally.

URL -> normalize -> host -> host shard -> frontier shard

Partitioning by full URL would spread the same host across multiple shards, breaking politeness.

Storage

A frontier may need persistence.

In-memory queues are fine for small crawlers, but large crawls require durable storage.

Typical storage:

State Storage option
Seen URLs Key-value store, Bloom filter, database table
Host queues Embedded log, message queue, database table
Host schedule Priority queue, sorted set
Fetch history Database or append-only log
Failures Retry table or dead-letter queue

A Redis sorted set can store hosts by next allowed fetch time. A database can store crawl tasks with status, host, priority, and ready_at.

Backpressure

A crawler must avoid discovering URLs faster than it can fetch them.

Backpressure policies include:

Policy Effect
Per-host queue limit Avoids one host dominating memory
Global queue limit Bounds total frontier size
Drop low-priority URLs Keeps high-value URLs
Delay link extraction Slows discovery
Spill to disk Trades latency for capacity

Without backpressure, a few pages with many links can flood the frontier.

Metrics

Track frontier health.

Metric Meaning
queued_urls Total pending URLs
queued_hosts Hosts with pending URLs
duplicate_urls URLs skipped as already seen
ready_hosts Hosts eligible now
delayed_hosts Hosts waiting for politeness
fetch_failures Failed fetches
retry_count Scheduled retries
frontier_latency Time from discovery to fetch
per_host_queue_size Skew indicator

These metrics reveal whether the crawler is CPU-bound, network-bound, storage-bound, or blocked by politeness.

Testing

Use a fake clock to test scheduling.

def test_same_host_respects_delay():
    frontier = CrawlFrontier(delay_seconds=10.0)

    frontier.add("https://example.com/a")
    frontier.add("https://example.com/b")

    assert frontier.next_url(now=0) == "https://example.com/a"
    assert frontier.next_url(now=0) is None
    assert frontier.next_url(now=10) == "https://example.com/b"

Test deduplication:

def test_duplicate_url_is_not_added():
    frontier = CrawlFrontier()

    assert frontier.add("https://example.com/a") is True
    assert frontier.add("https://example.com/a") is False

Test normalization:

def test_fragment_is_ignored():
    frontier = CrawlFrontier()

    assert frontier.add("https://example.com/a#top") is True
    assert frontier.add("https://example.com/a#bottom") is False

These tests catch the core frontier contract: no duplicates, no host overload, deterministic scheduling.

Common Bugs

The most common crawler bug is deduplicating raw URLs instead of normalized URLs. Fragments, case, default ports, and trailing paths can create duplicates.

Another common bug is enforcing politeness globally rather than per host. That slows the crawler unnecessarily while still allowing one host to be overloaded if scheduling is wrong.

Priority can accidentally bypass politeness if URLs are stored only in a global priority queue.

Distributed crawlers often break politeness by partitioning on URL hash instead of host hash.

Retry systems can create infinite loops if attempt counts are not bounded.

Frontiers can leak memory when seen sets and queues grow without limits.

URL normalization can also be too aggressive. Removing meaningful query parameters may collapse distinct pages.

Recipe

Build the frontier in layers.

Normalize URLs before deduplication. Store seen URLs in a hash set or Bloom filter. Group pending URLs by host. Schedule hosts with a min-heap keyed by next allowed fetch time. Add priority inside each host queue. Add depth limits and policy hooks. Add bounded retries with backoff. Partition by host for distributed crawlers. Persist frontier state when crashes matter. Track duplicate rate, queue size, host skew, retry count, and frontier latency.

The main lesson is that a crawler frontier is a scheduler, not just a queue. It combines deduplication, fairness, priority, delay, and durability into one control point. A clean frontier makes the rest of the crawler simpler because workers can fetch the next task without knowing the full scheduling policy.