16. Dictionaries and Sets

Dictionaries and sets are CPython’s primary hash-table containers. A dictionary maps keys to values. A set stores keys without associated values.

They are used everywhere in Python:

module globals
class namespaces
instance attributes
keyword arguments
function annotations
import caches
memoization tables
membership indexes
deduplication sets

A large part of Python’s runtime performance depends on dictionary performance.

16.1 Dictionary Semantics

A dictionary maps hashable keys to values.

d = {
    "name": "Ada",
    "age": 36,
}

print(d["name"])    # Ada

Keys must be hashable. Values can be any object.

d = {}

d["x"] = []
d[42] = object()
d[(1, 2)] = "tuple key"

A key is hashable when it has a stable hash value and supports equality comparison compatible with that hash.

hash("name")        # works
hash((1, 2))        # works
hash([1, 2])        # TypeError

Lists are unhashable because their contents can change.

16.2 Set Semantics

A set stores unique hashable elements.

seen = set()

seen.add("x")
seen.add("x")

print(seen)         # {'x'}

A set is useful when you need membership testing, deduplication, or set algebra.

a = {"red", "green", "blue"}
b = {"green", "yellow"}

print(a & b)        # intersection
print(a | b)        # union
print(a - b)        # difference

A frozenset is immutable and hashable if all elements are hashable.

key = frozenset(["read", "write"])
d = {key: "permissions"}

16.3 Hash Tables

Dictionaries and sets are hash tables.

A hash table uses a hash value to find where an object should live in an internal table.

For dictionary lookup:

hash key
find candidate slot
compare stored key with lookup key
return value if equal

For set membership:

hash element
find candidate slot
compare stored element with lookup element
return present or absent

Average-case lookup is constant time:

d[key]      average O(1)
key in d    average O(1)
x in set    average O(1)

Worst-case behavior can degrade when many keys collide, but CPython includes collision handling and hash randomization defenses.

16.4 Hash and Equality Contract

The hash contract is:

if a == b, then hash(a) == hash(b)

The reverse is not required:

if hash(a) == hash(b), a may still be different from b

A collision happens when two unequal objects have the same hash.

Dictionaries handle collisions by comparing keys.

Example custom key:

class Key:
    def __init__(self, value):
        self.value = value

    def __eq__(self, other):
        return isinstance(other, Key) and self.value == other.value

    def __hash__(self):
        return hash(self.value)

If you define __eq__, define a compatible __hash__ only when the object is immutable enough for hashing.

16.5 Mutable Keys Are Dangerous

A key’s hash must remain stable while the key is in a dictionary or set.

This is bad:

class BadKey:
    def __init__(self, value):
        self.value = value

    def __eq__(self, other):
        return isinstance(other, BadKey) and self.value == other.value

    def __hash__(self):
        return hash(self.value)

k = BadKey("a")
d = {k: "value"}

k.value = "b"

After mutation, the object may no longer be findable in the table.

print(d.get(k))     # may fail unexpectedly

The dictionary placed the key according to the old hash. Lookup uses the new hash.

Use immutable key data.

key = ("user", 123)

16.6 Dictionary Object Layout

A CPython dictionary object stores metadata and points to table data.

Conceptually:

PyDictObject
    object header
    used count
    version tag
    keys table pointer
    values pointer for split-table dicts

The internal layout is optimized and version-specific, but the key idea is:

dictionary object
    controls a hash table
    stores keys and values
    resizes as needed

A dictionary does not store key and value objects inline as raw values. It stores references to Python objects.

dict entry
    key pointer   ---> Python object
    value pointer ---> Python object
    hash value

16.7 Combined and Split Tables

CPython dictionaries can use different internal layouts.

A combined table stores keys and values together.

entry
    hash
    key pointer
    value pointer

A split table separates shared keys from per-instance values.

This is useful for instance dictionaries of objects of the same class.

Example:

class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

a = Point(1, 2)
b = Point(3, 4)

Both instances have attributes x and y.

CPython can share key metadata for the class’s instance layout while each instance stores its own values.

Conceptually:

shared keys table
    "x"
    "y"

instance a values
    1
    2

instance b values
    3
    4

This reduces memory usage for many instances with the same attribute names.

16.8 Insertion Order

Dictionaries preserve insertion order.

d = {}
d["a"] = 1
d["b"] = 2
d["c"] = 3

print(list(d))      # ['a', 'b', 'c']

Updating an existing key does not move it:

d["b"] = 20
print(list(d))      # ['a', 'b', 'c']

Deleting and reinserting creates a new insertion position:

del d["b"]
d["b"] = 200

print(list(d))      # ['a', 'c', 'b']

Insertion order is language behavior for dictionaries in modern Python, not merely an accidental CPython detail.

16.9 Dictionary Lookup

Dictionary lookup has several stages.

value = d[key]

Conceptually:

compute hash(key)
choose initial table slot
inspect slot
    if empty: key missing
    if stored hash differs: continue probing
    if stored key is key: found
    if stored key == key: found
    otherwise continue probing

Identity is checked before equality when possible. If the same object is used as the key, lookup can avoid equality work.

key = "name"
d = {key: "Ada"}

print(d[key])

For strings, cached hashes and interning can make lookup especially fast.

16.10 Collision Handling

Different keys can have the same hash.

class Collide:
    def __init__(self, value):
        self.value = value

    def __hash__(self):
        return 1

    def __eq__(self, other):
        return isinstance(other, Collide) and self.value == other.value

All instances hash to 1.

d = {}
for i in range(100):
    d[Collide(i)] = i

The dictionary still works, but performance suffers because many probes and equality checks are needed.

Hash randomization helps protect string-like keys from predictable collision attacks in network-facing programs.

16.11 Resizing

A dictionary resizes when it becomes too full.

The goal is to keep lookup efficient by maintaining empty space in the table.

Conceptually:

insert key
    if table load is high:
        allocate larger table
        reinsert active entries
    insert new key

Resizing costs O(n) at the moment it happens, but insertions remain efficient on average.

This is similar in spirit to list over-allocation. The container uses extra memory to keep common operations fast.

16.12 Deletion and Dummy Entries

Deleting a dictionary key cannot always mark a slot as simply empty.

Why?

Because lookup may rely on probe chains. If deletion broke the chain, later keys could become unreachable.

So hash tables often use dummy markers for deleted slots.

Conceptually:

active slot
    contains key and value

empty slot
    never used

dummy slot
    used before, now deleted

Insertion can reuse dummy slots. Lookup must skip over them.

Too many dummy slots can hurt performance, so resizing or rebuilding can clean them up.

16.13 Dictionary Views

Dictionary view objects expose dynamic views of keys, values, or items.

d = {"a": 1, "b": 2}

keys = d.keys()
values = d.values()
items = d.items()

Views reflect later mutations:

keys = d.keys()

d["c"] = 3

print(keys)     # dict_keys(['a', 'b', 'c'])

Views avoid copying. They are lightweight objects pointing back to the dictionary.

To make a snapshot, explicitly copy:

keys_snapshot = list(d.keys())

16.14 Iterating a Dictionary

Iterating a dictionary yields keys.

for key in d:
    print(key)

Equivalent:

for key in d.keys():
    print(key)

During iteration, changing the size of the dictionary raises an error:

for key in d:
    d["new"] = 1      # RuntimeError

Mutating values without changing keys is usually allowed:

for key in d:
    d[key] += 1

The iterator relies on dictionary structure staying consistent.

16.15 Dictionary Version Tags

CPython dictionaries can maintain version information that changes when the dictionary changes.

Version tags are useful for runtime optimizations, especially caches.

For example, attribute lookup and global lookup can cache information when a dictionary remains unchanged.

Conceptually:

dict version = 100
cache lookup result for name "x"

later:
    if dict version still 100:
        cached result may be valid
    else:
        redo lookup

This is a major tool for speeding up dynamic lookup without changing Python semantics.

16.16 Globals Are Dictionaries

A module’s global namespace is a dictionary.

x = 10

globals()["x"] = 20

print(x)        # 20

When Python code reads a global variable, CPython performs a dictionary lookup in the module globals, with fallback to builtins.

Conceptually:

LOAD_GLOBAL name
    look in globals dict
    if missing, look in builtins dict

Because globals are dictionaries, global variable access depends on dict performance. CPython uses inline caches and version tags to speed this path.

16.17 Instance Attributes Are Often Dictionaries

Ordinary object instances usually store attributes in a dictionary.

class User:
    pass

u = User()
u.name = "Ada"
u.age = 36

print(u.__dict__)

Output:

{'name': 'Ada', 'age': 36}

Attribute access:

u.name

often becomes a dictionary lookup, after descriptor checks and type-level lookup.

CPython optimizes this heavily with split dictionaries, shared keys, and attribute caches.

16.18 Class Namespaces Are Dictionaries

A class body executes in a namespace mapping.

class User:
    species = "human"

    def greet(self):
        return "hello"

The class namespace collects:

species
greet
__module__
__qualname__

Then the metaclass creates the class object from that namespace.

The resulting class has a dictionary-like mapping of attributes.

print(User.__dict__)

Class dictionaries are exposed through a read-only mappingproxy view.

16.19 `mappingproxy`

A mappingproxy is a read-only view of a mapping.

class C:
    x = 1

print(C.__dict__)

You cannot directly assign through the mappingproxy:

C.__dict__["y"] = 2      # TypeError

But you can assign through normal class attribute syntax:

C.y = 2

The proxy prevents direct mutation of the internal class dictionary while still allowing introspection.

16.20 Keyword Arguments

Keyword arguments are represented using mapping-like mechanisms.

def f(**kwargs):
    print(kwargs)

f(a=1, b=2)

Inside the function, kwargs is a dictionary.

Keyword argument passing is performance-sensitive because calls are common.

CPython has specialized call conventions to avoid unnecessary dictionary creation when possible, but when **kwargs is needed, a dictionary is materialized.

16.21 Sets Use Hash Tables Too

A set is similar to a dictionary without values.

s = {"a", "b", "c"}

Conceptually:

set table entry
    hash
    key pointer

Membership:

"x" in s

uses hashing and probing, like dict lookup.

Sets resize and handle collisions similarly in concept.

16.22 Set Operations

Set operations use hash-table membership internally.

a = {1, 2, 3}
b = {3, 4, 5}

print(a & b)    # {3}
print(a | b)    # {1, 2, 3, 4, 5}
print(a - b)    # {1, 2}

Performance depends on input sizes.

For intersection, it is usually better to iterate over the smaller set and test membership in the larger set.

Conceptually:

intersection(a, b)
    for each item in smaller set:
        if item in larger set:
            add to result

CPython’s implementation uses these kinds of size-aware strategies.

16.23 Set Iteration Order

Sets do not promise insertion order.

s = {"a", "b", "c"}
print(list(s))

The order can depend on hashes, table size, insertion history, and runtime hash randomization.

Do not write code that depends on set iteration order.

If order matters, use a dictionary as an ordered membership map or sort explicitly.

ordered_unique = list(dict.fromkeys(items))

16.24 `dict.fromkeys`

A common deduplication pattern:

items = ["b", "a", "b", "c", "a"]

unique = list(dict.fromkeys(items))

print(unique)   # ['b', 'a', 'c']

This uses dictionary insertion order to preserve first occurrence.

A set would deduplicate but lose order:

set(items)

Use the container whose semantics match the requirement.

16.25 Default Values and Missing Keys

Several dictionary APIs handle missing keys.

d.get("key", default)

get returns a default without inserting.

d.setdefault("key", [])

setdefault inserts the default if the key is missing.

Common pattern:

groups = {}

for item in items:
    groups.setdefault(item.category, []).append(item)

For repeated grouping, collections.defaultdict is often clearer:

from collections import defaultdict

groups = defaultdict(list)

for item in items:
    groups[item.category].append(item)

16.26 Dictionary Comprehensions

A dictionary comprehension builds a dictionary from an iterable.

squares = {x: x * x for x in range(10)}

If duplicate keys appear, later values replace earlier values while the insertion position of the first key is retained.

d = {x % 2: x for x in range(5)}

print(d)        # {0: 4, 1: 3}

The keys 0 and 1 were inserted early, then values were updated.

16.27 Set Comprehensions

A set comprehension builds a set.

mods = {x % 10 for x in range(100)}

Duplicates collapse.

Set comprehensions are useful for extracting unique computed values.

domains = {url.split("/")[2] for url in urls}

Again, iteration order is not a semantic guarantee.

16.28 Hash Randomization

CPython uses hash randomization for some hashable types such as strings and bytes.

This means hash values can differ between processes.

print(hash("name"))

Run the program twice and the value may differ.

Do not persist hash() values across processes. Do not use them as stable IDs.

For stable hashing, use a defined hash function:

import hashlib

digest = hashlib.sha256(b"name").hexdigest()

hash() is for in-process hash tables, not durable identity.

16.29 Common Performance Patterns

Task	Container	Reason
Key-value lookup	`dict`	Hash-table lookup
Membership testing	`set`	Average `O(1)`
Preserve first unique values	`dict.fromkeys`	Ordered keys
Count items	`collections.Counter`	Specialized dict subclass
Group items	`defaultdict(list)`	Avoid repeated missing-key checks
Fixed composite key	`tuple`	Hashable if elements hashable
Ordered sorted keys	`sorted(d)`	Explicit ordering
Read-only class namespace view	`mappingproxy`	Prevent direct mutation

16.30 Common Mistakes

Using a list for frequent membership:

if user_id in user_ids:
    ...

If user_ids is large and lookup happens often, use a set.

user_ids = set(user_ids)

Using mutable objects as conceptual keys:

key = [1, 2]
d[key] = "value"     # TypeError

Use a tuple:

key = (1, 2)
d[key] = "value"

Depending on set order:

first = next(iter(my_set))

This selects an arbitrary element, not a stable first element.

Persisting hash():

id_value = hash(username)

Use a stable hash function instead.

16.31 C API Sketch

Creating a dictionary:

PyObject *d = PyDict_New();
if (d == NULL) {
    return NULL;
}

Setting an item:

PyObject *key = PyUnicode_FromString("name");
PyObject *value = PyUnicode_FromString("Ada");

if (key == NULL || value == NULL) {
    Py_XDECREF(key);
    Py_XDECREF(value);
    Py_DECREF(d);
    return NULL;
}

if (PyDict_SetItem(d, key, value) < 0) {
    Py_DECREF(key);
    Py_DECREF(value);
    Py_DECREF(d);
    return NULL;
}

Py_DECREF(key);
Py_DECREF(value);

PyDict_SetItem does not steal references. It increments what it stores. The caller must release its owned references.

Getting an item can return a borrowed reference depending on the API:

PyObject *value = PyDict_GetItemWithError(d, key);
if (value == NULL) {
    if (PyErr_Occurred()) {
        return NULL;
    }
    Py_RETURN_NONE;
}

/* borrowed reference */
Py_INCREF(value);
return value;

Reference ownership rules matter.

16.32 Set C API Sketch

Creating a set:

PyObject *s = PySet_New(NULL);
if (s == NULL) {
    return NULL;
}

Adding an item:

PyObject *item = PyUnicode_FromString("x");
if (item == NULL) {
    Py_DECREF(s);
    return NULL;
}

if (PySet_Add(s, item) < 0) {
    Py_DECREF(item);
    Py_DECREF(s);
    return NULL;
}

Py_DECREF(item);

The set owns its own reference after insertion. The local reference is still released.

Membership:

int present = PySet_Contains(s, item);
if (present < 0) {
    return NULL;
}

The result is:

1 if present
0 if absent
-1 on error

16.33 Mental Model

Use this model:

dict
    hash table mapping keys to values
    keys must be hashable
    insertion order preserved
    stores references to keys and values
    used throughout CPython runtime

set
    hash table storing keys only
    keys must be hashable
    no insertion-order guarantee
    optimized for membership and set algebra

For both containers:

hash chooses where to look
equality confirms the key
collisions are handled by probing
resizing keeps lookup efficient
mutation changes table structure
iteration depends on table state

16.34 Summary

Dictionaries and sets are hash-table containers. Dictionaries map hashable keys to values and preserve insertion order. Sets store unique hashable elements and optimize membership testing and set operations.

CPython uses dictionaries as a core runtime structure for globals, attributes, classes, modules, keyword arguments, and caches. Their implementation includes hash caching, collision handling, resizing, split-table layouts, version tags, and order preservation.

Understanding dictionaries and sets explains much of Python’s speed, object model, attribute lookup, namespace behavior, and common performance patterns.

16. Dictionaries and Sets

16. Dictionaries and Sets

16.1 Dictionary Semantics

16.2 Set Semantics

16.3 Hash Tables

16.4 Hash and Equality Contract

16.5 Mutable Keys Are Dangerous

16.6 Dictionary Object Layout

16.7 Combined and Split Tables

16.8 Insertion Order

16.9 Dictionary Lookup

16.10 Collision Handling

16.11 Resizing

16.12 Deletion and Dummy Entries

16.13 Dictionary Views

16.14 Iterating a Dictionary

16.15 Dictionary Version Tags

16.16 Globals Are Dictionaries

16.17 Instance Attributes Are Often Dictionaries

16.18 Class Namespaces Are Dictionaries

16.19 mappingproxy

16.20 Keyword Arguments

16.21 Sets Use Hash Tables Too

16.22 Set Operations

16.23 Set Iteration Order

16.24 dict.fromkeys

16.25 Default Values and Missing Keys

16.26 Dictionary Comprehensions

16.27 Set Comprehensions

16.28 Hash Randomization

16.29 Common Performance Patterns

16.30 Common Mistakes

16.31 C API Sketch

16.32 Set C API Sketch

16.33 Mental Model

16.34 Summary

16.19 `mappingproxy`

16.24 `dict.fromkeys`