# 16. Dictionaries and Sets

# 16. Dictionaries and Sets

Dictionaries and sets are CPython’s primary hash-table containers. A dictionary maps keys to values. A set stores keys without associated values.

They are used everywhere in Python:

```text id="cp7dx7"
module globals
class namespaces
instance attributes
keyword arguments
function annotations
import caches
memoization tables
membership indexes
deduplication sets
```

A large part of Python’s runtime performance depends on dictionary performance.

## 16.1 Dictionary Semantics

A dictionary maps hashable keys to values.

```python id="cvrlq9"
d = {
    "name": "Ada",
    "age": 36,
}

print(d["name"])    # Ada
```

Keys must be hashable. Values can be any object.

```python id="mb4tqx"
d = {}

d["x"] = []
d[42] = object()
d[(1, 2)] = "tuple key"
```

A key is hashable when it has a stable hash value and supports equality comparison compatible with that hash.

```python id="100k2v"
hash("name")        # works
hash((1, 2))        # works
hash([1, 2])        # TypeError
```

Lists are unhashable because their contents can change.

## 16.2 Set Semantics

A set stores unique hashable elements.

```python id="a9i1rs"
seen = set()

seen.add("x")
seen.add("x")

print(seen)         # {'x'}
```

A set is useful when you need membership testing, deduplication, or set algebra.

```python id="utkml3"
a = {"red", "green", "blue"}
b = {"green", "yellow"}

print(a & b)        # intersection
print(a | b)        # union
print(a - b)        # difference
```

A `frozenset` is immutable and hashable if all elements are hashable.

```python id="8f3u4m"
key = frozenset(["read", "write"])
d = {key: "permissions"}
```

## 16.3 Hash Tables

Dictionaries and sets are hash tables.

A hash table uses a hash value to find where an object should live in an internal table.

For dictionary lookup:

```text id="kobvyn"
hash key
find candidate slot
compare stored key with lookup key
return value if equal
```

For set membership:

```text id="hbs3i1"
hash element
find candidate slot
compare stored element with lookup element
return present or absent
```

Average-case lookup is constant time:

```text id="qme3op"
d[key]      average O(1)
key in d    average O(1)
x in set    average O(1)
```

Worst-case behavior can degrade when many keys collide, but CPython includes collision handling and hash randomization defenses.

## 16.4 Hash and Equality Contract

The hash contract is:

```text id="u4q6w3"
if a == b, then hash(a) == hash(b)
```

The reverse is not required:

```text id="2ibsgd"
if hash(a) == hash(b), a may still be different from b
```

A collision happens when two unequal objects have the same hash.

Dictionaries handle collisions by comparing keys.

Example custom key:

```python id="7h6xvk"
class Key:
    def __init__(self, value):
        self.value = value

    def __eq__(self, other):
        return isinstance(other, Key) and self.value == other.value

    def __hash__(self):
        return hash(self.value)
```

If you define `__eq__`, define a compatible `__hash__` only when the object is immutable enough for hashing.

## 16.5 Mutable Keys Are Dangerous

A key’s hash must remain stable while the key is in a dictionary or set.

This is bad:

```python id="51z0wo"
class BadKey:
    def __init__(self, value):
        self.value = value

    def __eq__(self, other):
        return isinstance(other, BadKey) and self.value == other.value

    def __hash__(self):
        return hash(self.value)

k = BadKey("a")
d = {k: "value"}

k.value = "b"
```

After mutation, the object may no longer be findable in the table.

```python id="u2wkp9"
print(d.get(k))     # may fail unexpectedly
```

The dictionary placed the key according to the old hash. Lookup uses the new hash.

Use immutable key data.

```python id="mzfzuj"
key = ("user", 123)
```

## 16.6 Dictionary Object Layout

A CPython dictionary object stores metadata and points to table data.

Conceptually:

```text id="36zah2"
PyDictObject
    object header
    used count
    version tag
    keys table pointer
    values pointer for split-table dicts
```

The internal layout is optimized and version-specific, but the key idea is:

```text id="kpjby5"
dictionary object
    controls a hash table
    stores keys and values
    resizes as needed
```

A dictionary does not store key and value objects inline as raw values. It stores references to Python objects.

```text id="jxgd18"
dict entry
    key pointer   ---> Python object
    value pointer ---> Python object
    hash value
```

## 16.7 Combined and Split Tables

CPython dictionaries can use different internal layouts.

A combined table stores keys and values together.

```text id="kswrs2"
entry
    hash
    key pointer
    value pointer
```

A split table separates shared keys from per-instance values.

This is useful for instance dictionaries of objects of the same class.

Example:

```python id="pr3ahp"
class Point:
    def __init__(self, x, y):
        self.x = x
        self.y = y

a = Point(1, 2)
b = Point(3, 4)
```

Both instances have attributes `x` and `y`.

CPython can share key metadata for the class’s instance layout while each instance stores its own values.

Conceptually:

```text id="4gdmgx"
shared keys table
    "x"
    "y"

instance a values
    1
    2

instance b values
    3
    4
```

This reduces memory usage for many instances with the same attribute names.

## 16.8 Insertion Order

Dictionaries preserve insertion order.

```python id="jf7ccx"
d = {}
d["a"] = 1
d["b"] = 2
d["c"] = 3

print(list(d))      # ['a', 'b', 'c']
```

Updating an existing key does not move it:

```python id="2d4s8h"
d["b"] = 20
print(list(d))      # ['a', 'b', 'c']
```

Deleting and reinserting creates a new insertion position:

```python id="2tmw4q"
del d["b"]
d["b"] = 200

print(list(d))      # ['a', 'c', 'b']
```

Insertion order is language behavior for dictionaries in modern Python, not merely an accidental CPython detail.

## 16.9 Dictionary Lookup

Dictionary lookup has several stages.

```python id="fq8zvc"
value = d[key]
```

Conceptually:

```text id="5w58mp"
compute hash(key)
choose initial table slot
inspect slot
    if empty: key missing
    if stored hash differs: continue probing
    if stored key is key: found
    if stored key == key: found
    otherwise continue probing
```

Identity is checked before equality when possible. If the same object is used as the key, lookup can avoid equality work.

```python id="b3vf1x"
key = "name"
d = {key: "Ada"}

print(d[key])
```

For strings, cached hashes and interning can make lookup especially fast.

## 16.10 Collision Handling

Different keys can have the same hash.

```python id="u61im9"
class Collide:
    def __init__(self, value):
        self.value = value

    def __hash__(self):
        return 1

    def __eq__(self, other):
        return isinstance(other, Collide) and self.value == other.value
```

All instances hash to `1`.

```python id="sn5hoa"
d = {}
for i in range(100):
    d[Collide(i)] = i
```

The dictionary still works, but performance suffers because many probes and equality checks are needed.

Hash randomization helps protect string-like keys from predictable collision attacks in network-facing programs.

## 16.11 Resizing

A dictionary resizes when it becomes too full.

The goal is to keep lookup efficient by maintaining empty space in the table.

Conceptually:

```text id="km324o"
insert key
    if table load is high:
        allocate larger table
        reinsert active entries
    insert new key
```

Resizing costs `O(n)` at the moment it happens, but insertions remain efficient on average.

This is similar in spirit to list over-allocation. The container uses extra memory to keep common operations fast.

## 16.12 Deletion and Dummy Entries

Deleting a dictionary key cannot always mark a slot as simply empty.

Why?

Because lookup may rely on probe chains. If deletion broke the chain, later keys could become unreachable.

So hash tables often use dummy markers for deleted slots.

Conceptually:

```text id="o8jbeg"
active slot
    contains key and value

empty slot
    never used

dummy slot
    used before, now deleted
```

Insertion can reuse dummy slots. Lookup must skip over them.

Too many dummy slots can hurt performance, so resizing or rebuilding can clean them up.

## 16.13 Dictionary Views

Dictionary view objects expose dynamic views of keys, values, or items.

```python id="ovth2c"
d = {"a": 1, "b": 2}

keys = d.keys()
values = d.values()
items = d.items()
```

Views reflect later mutations:

```python id="bjim9l"
keys = d.keys()

d["c"] = 3

print(keys)     # dict_keys(['a', 'b', 'c'])
```

Views avoid copying. They are lightweight objects pointing back to the dictionary.

To make a snapshot, explicitly copy:

```python id="k65165"
keys_snapshot = list(d.keys())
```

## 16.14 Iterating a Dictionary

Iterating a dictionary yields keys.

```python id="ra7i88"
for key in d:
    print(key)
```

Equivalent:

```python id="4hpn52"
for key in d.keys():
    print(key)
```

During iteration, changing the size of the dictionary raises an error:

```python id="4085f6"
for key in d:
    d["new"] = 1      # RuntimeError
```

Mutating values without changing keys is usually allowed:

```python id="t0n9jz"
for key in d:
    d[key] += 1
```

The iterator relies on dictionary structure staying consistent.

## 16.15 Dictionary Version Tags

CPython dictionaries can maintain version information that changes when the dictionary changes.

Version tags are useful for runtime optimizations, especially caches.

For example, attribute lookup and global lookup can cache information when a dictionary remains unchanged.

Conceptually:

```text id="pfdsrf"
dict version = 100
cache lookup result for name "x"

later:
    if dict version still 100:
        cached result may be valid
    else:
        redo lookup
```

This is a major tool for speeding up dynamic lookup without changing Python semantics.

## 16.16 Globals Are Dictionaries

A module’s global namespace is a dictionary.

```python id="h0m5e5"
x = 10

globals()["x"] = 20

print(x)        # 20
```

When Python code reads a global variable, CPython performs a dictionary lookup in the module globals, with fallback to builtins.

Conceptually:

```text id="v5m31f"
LOAD_GLOBAL name
    look in globals dict
    if missing, look in builtins dict
```

Because globals are dictionaries, global variable access depends on dict performance. CPython uses inline caches and version tags to speed this path.

## 16.17 Instance Attributes Are Often Dictionaries

Ordinary object instances usually store attributes in a dictionary.

```python id="zdzzxk"
class User:
    pass

u = User()
u.name = "Ada"
u.age = 36

print(u.__dict__)
```

Output:

```text id="ijc7p2"
{'name': 'Ada', 'age': 36}
```

Attribute access:

```python id="6pnbk6"
u.name
```

often becomes a dictionary lookup, after descriptor checks and type-level lookup.

CPython optimizes this heavily with split dictionaries, shared keys, and attribute caches.

## 16.18 Class Namespaces Are Dictionaries

A class body executes in a namespace mapping.

```python id="d3f5st"
class User:
    species = "human"

    def greet(self):
        return "hello"
```

The class namespace collects:

```text id="y3ded7"
species
greet
__module__
__qualname__
```

Then the metaclass creates the class object from that namespace.

The resulting class has a dictionary-like mapping of attributes.

```python id="mtwgd9"
print(User.__dict__)
```

Class dictionaries are exposed through a read-only `mappingproxy` view.

## 16.19 `mappingproxy`

A `mappingproxy` is a read-only view of a mapping.

```python id="t40d69"
class C:
    x = 1

print(C.__dict__)
```

You cannot directly assign through the mappingproxy:

```python id="a5po6e"
C.__dict__["y"] = 2      # TypeError
```

But you can assign through normal class attribute syntax:

```python id="j36rbw"
C.y = 2
```

The proxy prevents direct mutation of the internal class dictionary while still allowing introspection.

## 16.20 Keyword Arguments

Keyword arguments are represented using mapping-like mechanisms.

```python id="7yetex"
def f(**kwargs):
    print(kwargs)

f(a=1, b=2)
```

Inside the function, `kwargs` is a dictionary.

Keyword argument passing is performance-sensitive because calls are common.

CPython has specialized call conventions to avoid unnecessary dictionary creation when possible, but when `**kwargs` is needed, a dictionary is materialized.

## 16.21 Sets Use Hash Tables Too

A set is similar to a dictionary without values.

```python id="j5dt5t"
s = {"a", "b", "c"}
```

Conceptually:

```text id="1c2tu2"
set table entry
    hash
    key pointer
```

Membership:

```python id="gc0h7j"
"x" in s
```

uses hashing and probing, like dict lookup.

Sets resize and handle collisions similarly in concept.

## 16.22 Set Operations

Set operations use hash-table membership internally.

```python id="d85y3h"
a = {1, 2, 3}
b = {3, 4, 5}

print(a & b)    # {3}
print(a | b)    # {1, 2, 3, 4, 5}
print(a - b)    # {1, 2}
```

Performance depends on input sizes.

For intersection, it is usually better to iterate over the smaller set and test membership in the larger set.

Conceptually:

```text id="s5h98o"
intersection(a, b)
    for each item in smaller set:
        if item in larger set:
            add to result
```

CPython’s implementation uses these kinds of size-aware strategies.

## 16.23 Set Iteration Order

Sets do not promise insertion order.

```python id="fmu4b8"
s = {"a", "b", "c"}
print(list(s))
```

The order can depend on hashes, table size, insertion history, and runtime hash randomization.

Do not write code that depends on set iteration order.

If order matters, use a dictionary as an ordered membership map or sort explicitly.

```python id="jc3ogv"
ordered_unique = list(dict.fromkeys(items))
```

## 16.24 `dict.fromkeys`

A common deduplication pattern:

```python id="0og08r"
items = ["b", "a", "b", "c", "a"]

unique = list(dict.fromkeys(items))

print(unique)   # ['b', 'a', 'c']
```

This uses dictionary insertion order to preserve first occurrence.

A set would deduplicate but lose order:

```python id="nozz18"
set(items)
```

Use the container whose semantics match the requirement.

## 16.25 Default Values and Missing Keys

Several dictionary APIs handle missing keys.

```python id="860ztd"
d.get("key", default)
```

`get` returns a default without inserting.

```python id="kkfpw8"
d.setdefault("key", [])
```

`setdefault` inserts the default if the key is missing.

Common pattern:

```python id="t4jd1i"
groups = {}

for item in items:
    groups.setdefault(item.category, []).append(item)
```

For repeated grouping, `collections.defaultdict` is often clearer:

```python id="27x3y1"
from collections import defaultdict

groups = defaultdict(list)

for item in items:
    groups[item.category].append(item)
```

## 16.26 Dictionary Comprehensions

A dictionary comprehension builds a dictionary from an iterable.

```python id="uxw8zk"
squares = {x: x * x for x in range(10)}
```

If duplicate keys appear, later values replace earlier values while the insertion position of the first key is retained.

```python id="yct6sa"
d = {x % 2: x for x in range(5)}

print(d)        # {0: 4, 1: 3}
```

The keys `0` and `1` were inserted early, then values were updated.

## 16.27 Set Comprehensions

A set comprehension builds a set.

```python id="3sgagh"
mods = {x % 10 for x in range(100)}
```

Duplicates collapse.

Set comprehensions are useful for extracting unique computed values.

```python id="n3vtrw"
domains = {url.split("/")[2] for url in urls}
```

Again, iteration order is not a semantic guarantee.

## 16.28 Hash Randomization

CPython uses hash randomization for some hashable types such as strings and bytes.

This means hash values can differ between processes.

```python id="og1g48"
print(hash("name"))
```

Run the program twice and the value may differ.

Do not persist `hash()` values across processes. Do not use them as stable IDs.

For stable hashing, use a defined hash function:

```python id="0clx65"
import hashlib

digest = hashlib.sha256(b"name").hexdigest()
```

`hash()` is for in-process hash tables, not durable identity.

## 16.29 Common Performance Patterns

| Task                           | Container             | Reason                            |
| ------------------------------ | --------------------- | --------------------------------- |
| Key-value lookup               | `dict`                | Hash-table lookup                 |
| Membership testing             | `set`                 | Average `O(1)`                    |
| Preserve first unique values   | `dict.fromkeys`       | Ordered keys                      |
| Count items                    | `collections.Counter` | Specialized dict subclass         |
| Group items                    | `defaultdict(list)`   | Avoid repeated missing-key checks |
| Fixed composite key            | `tuple`               | Hashable if elements hashable     |
| Ordered sorted keys            | `sorted(d)`           | Explicit ordering                 |
| Read-only class namespace view | `mappingproxy`        | Prevent direct mutation           |

## 16.30 Common Mistakes

Using a list for frequent membership:

```python id="89pu6v"
if user_id in user_ids:
    ...
```

If `user_ids` is large and lookup happens often, use a set.

```python id="2yd104"
user_ids = set(user_ids)
```

Using mutable objects as conceptual keys:

```python id="wotz0s"
key = [1, 2]
d[key] = "value"     # TypeError
```

Use a tuple:

```python id="lpvy37"
key = (1, 2)
d[key] = "value"
```

Depending on set order:

```python id="hmuq1q"
first = next(iter(my_set))
```

This selects an arbitrary element, not a stable first element.

Persisting `hash()`:

```python id="pchqlg"
id_value = hash(username)
```

Use a stable hash function instead.

## 16.31 C API Sketch

Creating a dictionary:

```c id="eclt1e"
PyObject *d = PyDict_New();
if (d == NULL) {
    return NULL;
}
```

Setting an item:

```c id="lk7i25"
PyObject *key = PyUnicode_FromString("name");
PyObject *value = PyUnicode_FromString("Ada");

if (key == NULL || value == NULL) {
    Py_XDECREF(key);
    Py_XDECREF(value);
    Py_DECREF(d);
    return NULL;
}

if (PyDict_SetItem(d, key, value) < 0) {
    Py_DECREF(key);
    Py_DECREF(value);
    Py_DECREF(d);
    return NULL;
}

Py_DECREF(key);
Py_DECREF(value);
```

`PyDict_SetItem` does not steal references. It increments what it stores. The caller must release its owned references.

Getting an item can return a borrowed reference depending on the API:

```c id="m18ydq"
PyObject *value = PyDict_GetItemWithError(d, key);
if (value == NULL) {
    if (PyErr_Occurred()) {
        return NULL;
    }
    Py_RETURN_NONE;
}

/* borrowed reference */
Py_INCREF(value);
return value;
```

Reference ownership rules matter.

## 16.32 Set C API Sketch

Creating a set:

```c id="l7rkf0"
PyObject *s = PySet_New(NULL);
if (s == NULL) {
    return NULL;
}
```

Adding an item:

```c id="gjqb0m"
PyObject *item = PyUnicode_FromString("x");
if (item == NULL) {
    Py_DECREF(s);
    return NULL;
}

if (PySet_Add(s, item) < 0) {
    Py_DECREF(item);
    Py_DECREF(s);
    return NULL;
}

Py_DECREF(item);
```

The set owns its own reference after insertion. The local reference is still released.

Membership:

```c id="b4gg5h"
int present = PySet_Contains(s, item);
if (present < 0) {
    return NULL;
}
```

The result is:

```text id="eari3m"
1 if present
0 if absent
-1 on error
```

## 16.33 Mental Model

Use this model:

```text id="78tgbw"
dict
    hash table mapping keys to values
    keys must be hashable
    insertion order preserved
    stores references to keys and values
    used throughout CPython runtime

set
    hash table storing keys only
    keys must be hashable
    no insertion-order guarantee
    optimized for membership and set algebra
```

For both containers:

```text id="e36bns"
hash chooses where to look
equality confirms the key
collisions are handled by probing
resizing keeps lookup efficient
mutation changes table structure
iteration depends on table state
```

## 16.34 Summary

Dictionaries and sets are hash-table containers. Dictionaries map hashable keys to values and preserve insertion order. Sets store unique hashable elements and optimize membership testing and set operations.

CPython uses dictionaries as a core runtime structure for globals, attributes, classes, modules, keyword arguments, and caches. Their implementation includes hash caching, collision handling, resizing, split-table layouts, version tags, and order preservation.

Understanding dictionaries and sets explains much of Python’s speed, object model, attribute lookup, namespace behavior, and common performance patterns.

