20.13 Building a Web Crawler Frontier
Design a crawler frontier that decides which URLs a web crawler should visit next.
20.13 Building a Web Crawler Frontier
Problem
Design a crawler frontier that decides which URLs a web crawler should visit next.
A crawler frontier is the scheduling core of a web crawler. It stores discovered URLs, avoids duplicates, respects host politeness, prioritizes important pages, and feeds workers with crawl tasks. The crawler may discover millions or billions of URLs, but each worker only needs one answer at a time:
What URL should I fetch next?
A naive crawler uses a single FIFO queue:
push discovered URLs
pop next URL
fetch
extract links
repeat
This works for a toy crawler, but it fails quickly in real systems. It may crawl the same URL repeatedly, overload one host, ignore high-value pages, or let unbounded queues consume memory.
The frontier needs algorithms for deduplication, priority scheduling, rate limiting, and partitioning.
Solution
Use three cooperating structures:
- A seen set to avoid duplicate URLs.
- Per-host queues to enforce politeness.
- A priority queue to select the next host that may be crawled.
At a high level:
discovered URL
-> normalize URL
-> check seen set
-> enqueue by host
-> schedule host by next allowed fetch time
Workers request URLs from the frontier. The frontier returns only URLs whose host is ready.
URL Normalization
The same page may appear under several syntactic forms.
https://example.com
https://example.com/
https://example.com/index.html
https://EXAMPLE.com/
Before deduplication, normalize URLs.
A minimal normalizer:
from urllib.parse import urlsplit, urlunsplit, urldefrag
def normalize_url(url):
url, _fragment = urldefrag(url)
parts = urlsplit(url)
scheme = parts.scheme.lower() or "https"
netloc = parts.netloc.lower()
path = parts.path or "/"
if scheme == "http" and netloc.endswith(":80"):
netloc = netloc[:-3]
if scheme == "https" and netloc.endswith(":443"):
netloc = netloc[:-4]
return urlunsplit((
scheme,
netloc,
path,
parts.query,
"",
))
This handles case, fragments, missing paths, and default ports. Production crawlers add more rules, but every crawler should normalize before deduplication.
Deduplicating URLs
Use a hash set for exact deduplication.
class SeenSet:
def __init__(self):
self.seen = set()
def add_if_new(self, url):
if url in self.seen:
return False
self.seen.add(url)
return True
Use it in the frontier:
def add_url(frontier, seen, url):
normalized = normalize_url(url)
if not seen.add_if_new(normalized):
return False
frontier.enqueue(normalized)
return True
For very large crawls, an exact set may become too large. A Bloom filter can reduce memory, but it introduces false positives. A false positive means the crawler may skip a page it has not actually seen. That is acceptable for broad discovery crawls and unacceptable for crawls that require completeness.
Extracting the Host
Politeness rules usually apply per host.
from urllib.parse import urlsplit
def host_of(url):
return urlsplit(url).netloc.lower()
The frontier groups URLs by host:
example.com -> /a, /b, /c
docs.example.com -> /guide, /api
another.org -> /index
This grouping lets the crawler avoid hitting the same host too aggressively.
Per-Host Queues
Each host gets its own FIFO queue.
from collections import defaultdict, deque
class HostQueues:
def __init__(self):
self.queues = defaultdict(deque)
def push(self, url):
host = host_of(url)
self.queues[host].append(url)
def pop(self, host):
queue = self.queues[host]
if not queue:
return None
return queue.popleft()
def has_urls(self, host):
return bool(self.queues[host])
A single global queue cannot enforce host-level fairness. Per-host queues make host scheduling explicit.
Scheduling Hosts
To respect politeness, each host has a next allowed fetch time.
Use a min-heap keyed by ready_at.
import heapq
import time
class HostScheduler:
def __init__(self, delay_seconds):
self.delay_seconds = delay_seconds
self.heap = []
self.scheduled = set()
def schedule_host(self, host, ready_at=None):
if host in self.scheduled:
return
if ready_at is None:
ready_at = time.time()
heapq.heappush(self.heap, (ready_at, host))
self.scheduled.add(host)
def pop_ready_host(self, now=None):
if now is None:
now = time.time()
if not self.heap:
return None
ready_at, host = self.heap[0]
if ready_at > now:
return None
heapq.heappop(self.heap)
self.scheduled.remove(host)
return host
def reschedule_after_fetch(self, host, now=None):
if now is None:
now = time.time()
self.schedule_host(host, now + self.delay_seconds)
The heap lets the frontier find the next crawlable host efficiently.
Combining the Structures
The frontier owns the seen set, host queues, and host scheduler.
class CrawlFrontier:
def __init__(self, delay_seconds=1.0):
self.seen = SeenSet()
self.host_queues = HostQueues()
self.scheduler = HostScheduler(delay_seconds)
def add(self, url):
normalized = normalize_url(url)
if not self.seen.add_if_new(normalized):
return False
host = host_of(normalized)
self.host_queues.push(normalized)
self.scheduler.schedule_host(host)
return True
def next_url(self, now=None):
host = self.scheduler.pop_ready_host(now)
if host is None:
return None
url = self.host_queues.pop(host)
if self.host_queues.has_urls(host):
self.scheduler.reschedule_after_fetch(host, now)
return url
Example:
frontier = CrawlFrontier(delay_seconds=2.0)
frontier.add("https://example.com/a")
frontier.add("https://example.com/b")
frontier.add("https://another.org/")
print(frontier.next_url(now=0))
print(frontier.next_url(now=0))
The first two URLs may come from different hosts, depending on scheduling order. A second URL from example.com should not be returned until the politeness delay has passed.
Priority
Not all URLs are equally valuable.
A crawler may prefer:
- URLs near the seed set.
- Pages from authoritative hosts.
- Fresh pages.
- Sitemaps.
- Product pages.
- Documentation pages.
- URLs with high predicted content value.
Priority can be added within each host queue.
Instead of FIFO queues, use per-host priority queues.
class PriorityHostQueues:
def __init__(self):
self.queues = defaultdict(list)
self.counter = 0
def push(self, url, priority=0):
host = host_of(url)
heapq.heappush(
self.queues[host],
(-priority, self.counter, url),
)
self.counter += 1
def pop(self, host):
queue = self.queues[host]
if not queue:
return None
_, _, url = heapq.heappop(queue)
return url
def has_urls(self, host):
return bool(self.queues[host])
The negative priority makes larger priority values come out first.
Priority should not bypass politeness. It should decide which URL to fetch within a host, while host scheduling decides when that host may be fetched.
Depth Control
A crawler often starts from seed URLs and follows links to a limited depth.
seed page: depth 0
links from seed: depth 1
links from depth 1 pages: depth 2
Represent crawl tasks:
from dataclasses import dataclass
@dataclass(frozen=True)
class CrawlTask:
url: str
depth: int
Add a max depth policy:
class DepthLimitedFrontier:
def __init__(self, max_depth, delay_seconds=1.0):
self.max_depth = max_depth
self.seen = SeenSet()
self.host_queues = defaultdict(deque)
self.scheduler = HostScheduler(delay_seconds)
def add(self, url, depth):
if depth > self.max_depth:
return False
normalized = normalize_url(url)
if not self.seen.add_if_new(normalized):
return False
host = host_of(normalized)
self.host_queues[host].append(CrawlTask(normalized, depth))
self.scheduler.schedule_host(host)
return True
Depth control prevents the crawler from expanding forever into low-value areas.
Robots and Exclusion Rules
A responsible crawler checks host rules before fetching URLs. These rules are usually expressed in robots.txt.
A full implementation is outside this recipe, but the frontier should expose a policy hook.
class CrawlPolicy:
def allowed(self, url):
return True
def priority(self, url):
return 0
Use it when adding URLs:
def add_with_policy(frontier, policy, url):
normalized = normalize_url(url)
if not policy.allowed(normalized):
return False
priority = policy.priority(normalized)
return frontier.add(normalized, priority=priority)
The key design point is separation. The frontier should schedule URLs. Policy code should decide whether a URL is eligible.
Failure and Retry
Fetches fail. The frontier should support retries without producing infinite loops.
Represent attempts:
@dataclass(frozen=True)
class CrawlTask:
url: str
depth: int
attempt: int = 0
Backoff policy:
def retry_delay(attempt):
delays = [5, 30, 300]
if attempt >= len(delays):
return None
return delays[attempt]
On failure:
def handle_fetch_failure(frontier, task, now):
delay = retry_delay(task.attempt)
if delay is None:
return False
retry = CrawlTask(
url=task.url,
depth=task.depth,
attempt=task.attempt + 1,
)
frontier.requeue(task.url, retry, ready_at=now + delay)
return True
Retries are scheduled events. They should obey both backoff and host politeness.
Distributed Frontier
A large crawler runs many workers. The frontier must distribute work while preserving deduplication and host politeness.
A common partitioning strategy assigns hosts to shards:
import hashlib
def shard_for_host(host, shard_count):
digest = hashlib.sha256(host.encode("utf-8")).digest()
value = int.from_bytes(digest[:8], "little")
return value % shard_count
All URLs from the same host go to the same shard. That shard can enforce host-level politeness locally.
URL -> normalize -> host -> host shard -> frontier shard
Partitioning by full URL would spread the same host across multiple shards, breaking politeness.
Storage
A frontier may need persistence.
In-memory queues are fine for small crawlers, but large crawls require durable storage.
Typical storage:
| State | Storage option |
|---|---|
| Seen URLs | Key-value store, Bloom filter, database table |
| Host queues | Embedded log, message queue, database table |
| Host schedule | Priority queue, sorted set |
| Fetch history | Database or append-only log |
| Failures | Retry table or dead-letter queue |
A Redis sorted set can store hosts by next allowed fetch time. A database can store crawl tasks with status, host, priority, and ready_at.
Backpressure
A crawler must avoid discovering URLs faster than it can fetch them.
Backpressure policies include:
| Policy | Effect |
|---|---|
| Per-host queue limit | Avoids one host dominating memory |
| Global queue limit | Bounds total frontier size |
| Drop low-priority URLs | Keeps high-value URLs |
| Delay link extraction | Slows discovery |
| Spill to disk | Trades latency for capacity |
Without backpressure, a few pages with many links can flood the frontier.
Metrics
Track frontier health.
| Metric | Meaning |
|---|---|
| queued_urls | Total pending URLs |
| queued_hosts | Hosts with pending URLs |
| duplicate_urls | URLs skipped as already seen |
| ready_hosts | Hosts eligible now |
| delayed_hosts | Hosts waiting for politeness |
| fetch_failures | Failed fetches |
| retry_count | Scheduled retries |
| frontier_latency | Time from discovery to fetch |
| per_host_queue_size | Skew indicator |
These metrics reveal whether the crawler is CPU-bound, network-bound, storage-bound, or blocked by politeness.
Testing
Use a fake clock to test scheduling.
def test_same_host_respects_delay():
frontier = CrawlFrontier(delay_seconds=10.0)
frontier.add("https://example.com/a")
frontier.add("https://example.com/b")
assert frontier.next_url(now=0) == "https://example.com/a"
assert frontier.next_url(now=0) is None
assert frontier.next_url(now=10) == "https://example.com/b"
Test deduplication:
def test_duplicate_url_is_not_added():
frontier = CrawlFrontier()
assert frontier.add("https://example.com/a") is True
assert frontier.add("https://example.com/a") is False
Test normalization:
def test_fragment_is_ignored():
frontier = CrawlFrontier()
assert frontier.add("https://example.com/a#top") is True
assert frontier.add("https://example.com/a#bottom") is False
These tests catch the core frontier contract: no duplicates, no host overload, deterministic scheduling.
Common Bugs
The most common crawler bug is deduplicating raw URLs instead of normalized URLs. Fragments, case, default ports, and trailing paths can create duplicates.
Another common bug is enforcing politeness globally rather than per host. That slows the crawler unnecessarily while still allowing one host to be overloaded if scheduling is wrong.
Priority can accidentally bypass politeness if URLs are stored only in a global priority queue.
Distributed crawlers often break politeness by partitioning on URL hash instead of host hash.
Retry systems can create infinite loops if attempt counts are not bounded.
Frontiers can leak memory when seen sets and queues grow without limits.
URL normalization can also be too aggressive. Removing meaningful query parameters may collapse distinct pages.
Recipe
Build the frontier in layers.
Normalize URLs before deduplication. Store seen URLs in a hash set or Bloom filter. Group pending URLs by host. Schedule hosts with a min-heap keyed by next allowed fetch time. Add priority inside each host queue. Add depth limits and policy hooks. Add bounded retries with backoff. Partition by host for distributed crawlers. Persist frontier state when crashes matter. Track duplicate rate, queue size, host skew, retry count, and frontier latency.
The main lesson is that a crawler frontier is a scheduler, not just a queue. It combines deduplication, fairness, priority, delay, and durability into one control point. A clean frontier makes the rest of the crawler simpler because workers can fetch the next task without knowing the full scheduling policy.