# 98. Serialization Internals

# 98. Serialization Internals

Serialization is the process of converting Python objects into a representation that can be stored, transmitted, copied, or reconstructed later.

In CPython, serialization appears in several forms:

```text
pickle
marshal
json
copy protocols
multiprocessing transfer
bytecode cache files
interpreter state exchange
```

These systems are related, but they have different goals.

The most important distinction is between general object serialization and CPython-internal serialization.

`pickle` is a public object serialization protocol. It is designed to represent many Python objects and reconstruct them later.

`marshal` is a lower-level CPython format used mainly for internal code object storage. It is not a general persistence format.

`json` is a text data interchange format. It represents a small set of portable data types.

This chapter focuses on how CPython thinks about serialization: object identity, type reconstruction, code object storage, recursive object graphs, extension hooks, security boundaries, and version stability.

## 98.1 Why Serialization Exists

Python objects live in process memory.

A list such as:

```python
items = ["a", "b", "c"]
```

is not stored as a contiguous portable data record. It is a CPython object graph:

```text
list object
    pointer to string object "a"
    pointer to string object "b"
    pointer to string object "c"
```

Serialization converts that graph into bytes or text.

Conceptually:

```text
object graph
    ↓
serializer
    ↓
byte stream
    ↓
deserializer
    ↓
new object graph
```

The reconstructed graph usually has equal values, but not the same object identities.

## 98.2 Object Graphs

Most Python values are not isolated objects.

Example:

```python
a = []
b = [a, a]
```

The object graph is:

```text
b
    index 0 -> a
    index 1 -> a
```

The same list object appears twice.

A correct serializer must decide whether to preserve aliasing.

If aliasing is preserved:

```python
restored = deserialize(serialize(b))
restored[0] is restored[1]
```

should be:

```text
True
```

For `pickle`, preserving shared references is part of the protocol.

For simpler formats such as JSON, aliasing is lost because JSON represents trees, not arbitrary object graphs.

## 98.3 Cycles

Python object graphs can contain cycles.

Example:

```python
x = []
x.append(x)
```

This graph points back to itself:

```text
x
    index 0 -> x
```

A naive serializer would recurse forever.

`pickle` handles cycles through memoization.

Conceptually:

```text
first time object is seen
    assign memo id
    serialize object contents

later reference to same object
    emit reference to memo id
```

This allows recursive graphs to be reconstructed.

## 98.4 Pickle

`pickle` is Python’s general object serialization protocol.

It stores instructions for rebuilding objects.

A pickle stream is not merely raw data. It is closer to a small stack-machine program.

Example:

```python
import pickle

data = {"x": [1, 2, 3]}
blob = pickle.dumps(data)
restored = pickle.loads(blob)
```

The pickle byte stream contains opcodes such as:

```text
create dict
create list
push constants
set item
memoize object
refer to memoized object
```

The unpickler executes these instructions to rebuild the graph.

## 98.5 Pickle Protocol Versions

Pickle has multiple protocol versions.

Newer protocols add features such as:

```text
more compact encodings
better bytes handling
large object support
out-of-band buffers
faster opcodes
```

The protocol version controls the serialized format.

Example:

```python
import pickle

blob = pickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL)
```

Higher protocols are usually more efficient, but may require newer Python versions to read.

## 98.6 Pickle Memo Table

The memo table is central to pickle correctness.

It maps object identities to memo indexes.

Conceptually:

```text
id(obj) -> memo index
```

When pickling this:

```python
shared = []
obj = [shared, shared]
```

pickle records that both positions refer to the same object.

The result is reconstructed as:

```python
restored[0] is restored[1]
```

with result:

```text
True
```

Without a memo table, pickle would create two independent lists.

## 98.7 Pickle and Object Reconstruction

Pickle usually reconstructs objects by importing their class and then rebuilding state.

For a normal class:

```python
class User:
    def __init__(self, name):
        self.name = name
```

Pickle records enough information to locate the class:

```text
module name
qualified class name
object state
```

During unpickling, Python imports the module and looks up the class.

This means the class definition must be importable by the same name when loading.

If the class moves or is renamed, old pickles may fail.

## 98.8 `__getstate__` and `__setstate__`

Classes can customize pickling.

```python
class User:
    def __init__(self, name, cache):
        self.name = name
        self.cache = cache

    def __getstate__(self):
        return {"name": self.name}

    def __setstate__(self, state):
        self.name = state["name"]
        self.cache = {}
```

`__getstate__` returns serialized state.

`__setstate__` restores it.

This is useful when an object contains non-serializable fields:

```text
locks
open files
database connections
sockets
thread handles
caches
```

The serialized form should describe durable state, not live process resources.

## 98.9 `__reduce__`

The lower-level customization hook is `__reduce__`.

It tells pickle how to reconstruct an object.

A reduce tuple usually describes:

```text
callable
arguments
state
list items
dict items
```

Example shape:

```python
def __reduce__(self):
    return (User, (self.name,), self.__dict__)
```

Unpickling calls the reconstruction callable with arguments, then applies state.

This mechanism is powerful because it can represent objects that ordinary attribute copying cannot.

It is also dangerous because unpickling can call arbitrary callables.

## 98.10 Pickle Security

Pickle is not safe for untrusted data.

Unpickling can execute code.

A pickle stream may instruct the runtime to import a module and call a function during reconstruction.

Therefore:

```text
never unpickle data from an untrusted source
```

This is a hard security boundary.

Safe interchange formats include:

```text
JSON
MessagePack with careful schema validation
Protocol Buffers
Cap'n Proto
FlatBuffers
custom validated formats
```

Pickle is appropriate for trusted Python-to-Python persistence or transport, not hostile input.

## 98.11 `marshal`

`marshal` is a CPython internal serialization format.

It is used for writing and reading code objects in `.pyc` files.

Example:

```python
import marshal

code = compile("x = 1", "<string>", "exec")
blob = marshal.dumps(code)
restored = marshal.loads(blob)
```

`marshal` can serialize certain built-in objects:

```text
None
booleans
integers
floats
strings
bytes
tuples
lists
dicts
sets
frozensets
code objects
```

But it is not a general object serialization system.

It does not preserve arbitrary class instances in the way pickle does.

## 98.12 Why `marshal` Exists

CPython needs a compact way to store compiled code objects.

When Python imports a source module, it may compile it and write a `.pyc` file.

A `.pyc` file contains:

```text
header
marshaled code object
```

The code object contains bytecode and metadata.

`marshal` exists because CPython needs to reload compiled code quickly.

It is optimized for interpreter internals, not long-term compatibility.

## 98.13 `.pyc` Files

A `.pyc` file stores cached bytecode.

Conceptually:

```text
magic number
flags
timestamp or source hash
source size or validation data
marshaled code object
```

The magic number identifies the bytecode format version.

If CPython changes bytecode or code object layout, the magic number changes.

This prevents the interpreter from loading incompatible cache files.

## 98.14 Code Object Serialization

A code object contains:

```text
bytecode
constants
names
local variable names
free variables
cell variables
filename
function name
line table
exception table
flags
stack size
argument metadata
```

When serialized into a `.pyc`, CPython stores this information using `marshal`.

The reconstructed code object can then be executed by the interpreter.

Example:

```python
source = "print(1 + 2)"
code = compile(source, "example.py", "exec")
exec(code)
```

The code object is the real executable unit, not the source string.

## 98.15 Pickle vs Marshal

| Feature | `pickle` | `marshal` |
|---|---|---|
| Purpose | General Python object serialization | CPython internal code object storage |
| Public persistence format | Yes, with care | No |
| Handles user classes | Yes | No |
| Preserves shared references | Yes | Limited by format behavior |
| Handles cycles | Yes | Not for general use |
| Version stability | Better protocol story | CPython-version dependent |
| Security risk | Unsafe for untrusted input | Also unsafe for untrusted input |
| Main use | Trusted object persistence | `.pyc` code cache |

Use `pickle` when you need Python object persistence.

Use `marshal` only when working with CPython internals or code objects.

## 98.16 JSON Serialization

JSON serializes a small portable data model:

```text
object
array
string
number
true
false
null
```

Python maps these roughly to:

```text
dict
list
str
int or float
True
False
None
```

Example:

```python
import json

blob = json.dumps({"name": "Ada", "age": 36})
data = json.loads(blob)
```

JSON does not represent:

```text
tuples
sets
object identity
cycles
custom classes
bytes
datetime values
NaN portability without caveats
```

JSON is an interchange format, not a Python object graph format.

## 98.17 Copy Protocols

The `copy` module uses related object reconstruction protocols.

```python
import copy

x = [[1], [2]]
y = copy.deepcopy(x)
```

Deep copy must preserve graph structure similarly to pickle.

For example:

```python
shared = []
x = [shared, shared]
y = copy.deepcopy(x)

print(y[0] is y[1])
```

The output is:

```text
True
```

`deepcopy` uses a memo dictionary to avoid infinite recursion and preserve aliasing.

Classes can customize copying with methods such as:

```text
__copy__
__deepcopy__
__reduce__
__getstate__
__setstate__
```

## 98.18 Multiprocessing Serialization

`multiprocessing` often serializes Python objects to send them between processes.

Because processes do not share ordinary Python heaps, arguments and results usually cross process boundaries as bytes.

Conceptually:

```text
parent process object
    ↓
pickle
    ↓
pipe or queue
    ↓
unpickle
    ↓
child process object
```

This explains why objects sent to worker processes must usually be pickleable.

Example:

```python
from multiprocessing import Pool

def square(x):
    return x * x

with Pool() as pool:
    print(pool.map(square, [1, 2, 3]))
```

The function and arguments must be serializable under the platform’s start method.

## 98.19 Start Methods and Serialization

`multiprocessing` behavior differs by start method.

Common start methods:

```text
fork
spawn
forkserver
```

With `fork`, the child initially inherits the parent process memory through the operating system.

With `spawn`, the child starts fresh and imports the main module.

This makes serialization and importability more important.

A function passed to a spawned process must be importable by name.

Bad pattern:

```python
def main():
    def worker(x):
        return x * x
```

Nested functions are often not pickleable by the standard pickle mechanism.

Better:

```python
def worker(x):
    return x * x

def main():
    ...
```

Top-level functions are importable by module and name.

## 98.20 Serialization and Object Identity

Serialization usually creates new objects.

Example:

```python
import pickle

x = [1, 2, 3]
y = pickle.loads(pickle.dumps(x))

print(x is y)
```

The result is:

```text
False
```

The value is preserved, but identity changes.

For shared references inside the serialized graph, pickle can preserve identity relationships.

For references outside the serialized graph, identity is not preserved.

## 98.21 Serialization and Types

Pickle records references to types by module and qualified name.

Example:

```text
myapp.models.User
```

This means the unpickling environment must have:

```text
module myapp.models
attribute User
compatible class behavior
```

Changing package layout can break old serialized data.

For long-term persistence, it is often better to serialize explicit schema data:

```json
{
  "type": "user",
  "version": 1,
  "name": "Ada"
}
```

and write migration logic.

## 98.22 Versioning Serialized Data

Durable serialized data needs a version field.

Example:

```python
record = {
    "version": 2,
    "name": "Ada",
    "email": "ada@example.com",
}
```

Without versioning, format evolution becomes fragile.

Problems include:

```text
renamed fields
removed fields
changed types
new required fields
different class locations
```

A robust serializer treats the byte stream as a public data format, not as an incidental memory dump.

## 98.23 Buffers and Large Data

Large binary data needs special handling.

Pickle protocol 5 introduced support for out-of-band buffers.

This is useful for objects such as:

```text
large arrays
memoryviews
binary tensors
image buffers
numeric matrices
```

The idea is to separate small object metadata from large raw buffers.

Conceptually:

```text
pickle stream
    object structure and metadata

external buffer
    large contiguous bytes
```

This avoids unnecessary copies in some systems.

## 98.24 The Buffer Protocol

The buffer protocol allows objects to expose raw memory to other objects.

Examples:

```text
bytes
bytearray
memoryview
array.array
NumPy arrays
```

Serialization systems can use buffers to avoid copying large data repeatedly.

Example:

```python
data = bytearray(b"abcdef")
view = memoryview(data)
```

The `memoryview` sees the same underlying bytes.

For serialization, the question becomes:

```text
copy the bytes
share the bytes
transfer ownership
reference external storage
```

Each option has different safety and lifetime rules.

## 98.25 Serialization and Subinterpreters

Subinterpreters usually cannot share arbitrary Python objects directly.

Serialization gives a clean boundary:

```text
interpreter A object
    ↓
serialize
    ↓
bytes or shareable data
    ↓
deserialize
    ↓
interpreter B object
```

This avoids cross-interpreter object ownership problems.

However, serialization has costs:

```text
CPU time
memory copies
schema constraints
loss of identity
conversion errors
```

A future subinterpreter communication model needs careful choices between copying, sharing immutable data, and transferring buffers.

## 98.26 Serialization and Security Boundaries

Serialization format choice is a security decision.

| Format | Safe for untrusted input | Notes |
|---|---|---|
| `pickle` | No | Can execute code |
| `marshal` | No | Internal format, not validated for hostile input |
| `json` | Usually, with validation | Data only, but schemas still matter |
| `yaml` | Depends on loader | Unsafe loaders can construct objects |
| Protobuf | Generally safer | Schema-based |
| MessagePack | Usually, with validation | Data format, not object protocol |

Parsing untrusted data still requires resource limits.

Even safe formats can be abused through:

```text
huge inputs
deep nesting
large integers
compression bombs
schema confusion
memory exhaustion
```

## 98.27 Practical Rules

Use these rules:

```text
Use pickle only for trusted Python-to-Python data.

Do not use pickle as a network protocol for untrusted clients.

Use JSON or a schema format for public APIs.

Use marshal only for CPython internals, not application persistence.

Put version fields in durable serialized data.

Keep serialized data independent of class layout when long-term compatibility matters.

Avoid serializing live resources such as sockets, files, locks, and threads.

Use buffers for large binary payloads when copy cost matters.
```

## 98.28 Mental Model

Use this model:

```text
Python objects form graphs.

Serialization writes a representation of that graph.

Pickle preserves many Python-specific graph properties:
    shared references
    cycles
    custom reconstruction
    class-based objects

Marshal stores CPython internal values, especially code objects.

JSON stores portable data trees, not Python object graphs.

Long-term serialized data should use explicit schemas.

Unpickling is code execution and must be treated as unsafe for untrusted input.
```

## 98.29 Chapter Summary

Serialization in CPython spans public protocols and internal formats.

`pickle` serializes Python object graphs using opcodes, memoization, and reconstruction hooks.

`marshal` stores CPython internal values, especially code objects in `.pyc` files.

`json` serializes portable data trees and avoids Python-specific object identity.

The hard problems are:

```text
cycles
shared references
custom class reconstruction
versioning
security
large buffers
cross-process and cross-interpreter transfer
```

Serialization is not just data conversion. It defines a boundary between object identity, runtime state, storage, transport, and trust.
