Skip to content

98. Serialization Internals

pickle protocol versions, the __reduce__ / __reduce_ex__ protocol, marshal format, and shelve internals.

Serialization is the process of converting Python objects into a representation that can be stored, transmitted, copied, or reconstructed later.

In CPython, serialization appears in several forms:

pickle
marshal
json
copy protocols
multiprocessing transfer
bytecode cache files
interpreter state exchange

These systems are related, but they have different goals.

The most important distinction is between general object serialization and CPython-internal serialization.

pickle is a public object serialization protocol. It is designed to represent many Python objects and reconstruct them later.

marshal is a lower-level CPython format used mainly for internal code object storage. It is not a general persistence format.

json is a text data interchange format. It represents a small set of portable data types.

This chapter focuses on how CPython thinks about serialization: object identity, type reconstruction, code object storage, recursive object graphs, extension hooks, security boundaries, and version stability.

98.1 Why Serialization Exists

Python objects live in process memory.

A list such as:

items = ["a", "b", "c"]

is not stored as a contiguous portable data record. It is a CPython object graph:

list object
    pointer to string object "a"
    pointer to string object "b"
    pointer to string object "c"

Serialization converts that graph into bytes or text.

Conceptually:

object graph
serializer
byte stream
deserializer
new object graph

The reconstructed graph usually has equal values, but not the same object identities.

98.2 Object Graphs

Most Python values are not isolated objects.

Example:

a = []
b = [a, a]

The object graph is:

b
    index 0 -> a
    index 1 -> a

The same list object appears twice.

A correct serializer must decide whether to preserve aliasing.

If aliasing is preserved:

restored = deserialize(serialize(b))
restored[0] is restored[1]

should be:

True

For pickle, preserving shared references is part of the protocol.

For simpler formats such as JSON, aliasing is lost because JSON represents trees, not arbitrary object graphs.

98.3 Cycles

Python object graphs can contain cycles.

Example:

x = []
x.append(x)

This graph points back to itself:

x
    index 0 -> x

A naive serializer would recurse forever.

pickle handles cycles through memoization.

Conceptually:

first time object is seen
    assign memo id
    serialize object contents

later reference to same object
    emit reference to memo id

This allows recursive graphs to be reconstructed.

98.4 Pickle

pickle is Python’s general object serialization protocol.

It stores instructions for rebuilding objects.

A pickle stream is not merely raw data. It is closer to a small stack-machine program.

Example:

import pickle

data = {"x": [1, 2, 3]}
blob = pickle.dumps(data)
restored = pickle.loads(blob)

The pickle byte stream contains opcodes such as:

create dict
create list
push constants
set item
memoize object
refer to memoized object

The unpickler executes these instructions to rebuild the graph.

98.5 Pickle Protocol Versions

Pickle has multiple protocol versions.

Newer protocols add features such as:

more compact encodings
better bytes handling
large object support
out-of-band buffers
faster opcodes

The protocol version controls the serialized format.

Example:

import pickle

blob = pickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL)

Higher protocols are usually more efficient, but may require newer Python versions to read.

98.6 Pickle Memo Table

The memo table is central to pickle correctness.

It maps object identities to memo indexes.

Conceptually:

id(obj) -> memo index

When pickling this:

shared = []
obj = [shared, shared]

pickle records that both positions refer to the same object.

The result is reconstructed as:

restored[0] is restored[1]

with result:

True

Without a memo table, pickle would create two independent lists.

98.7 Pickle and Object Reconstruction

Pickle usually reconstructs objects by importing their class and then rebuilding state.

For a normal class:

class User:
    def __init__(self, name):
        self.name = name

Pickle records enough information to locate the class:

module name
qualified class name
object state

During unpickling, Python imports the module and looks up the class.

This means the class definition must be importable by the same name when loading.

If the class moves or is renamed, old pickles may fail.

98.8 __getstate__ and __setstate__

Classes can customize pickling.

class User:
    def __init__(self, name, cache):
        self.name = name
        self.cache = cache

    def __getstate__(self):
        return {"name": self.name}

    def __setstate__(self, state):
        self.name = state["name"]
        self.cache = {}

__getstate__ returns serialized state.

__setstate__ restores it.

This is useful when an object contains non-serializable fields:

locks
open files
database connections
sockets
thread handles
caches

The serialized form should describe durable state, not live process resources.

98.9 __reduce__

The lower-level customization hook is __reduce__.

It tells pickle how to reconstruct an object.

A reduce tuple usually describes:

callable
arguments
state
list items
dict items

Example shape:

def __reduce__(self):
    return (User, (self.name,), self.__dict__)

Unpickling calls the reconstruction callable with arguments, then applies state.

This mechanism is powerful because it can represent objects that ordinary attribute copying cannot.

It is also dangerous because unpickling can call arbitrary callables.

98.10 Pickle Security

Pickle is not safe for untrusted data.

Unpickling can execute code.

A pickle stream may instruct the runtime to import a module and call a function during reconstruction.

Therefore:

never unpickle data from an untrusted source

This is a hard security boundary.

Safe interchange formats include:

JSON
MessagePack with careful schema validation
Protocol Buffers
Cap'n Proto
FlatBuffers
custom validated formats

Pickle is appropriate for trusted Python-to-Python persistence or transport, not hostile input.

98.11 marshal

marshal is a CPython internal serialization format.

It is used for writing and reading code objects in .pyc files.

Example:

import marshal

code = compile("x = 1", "<string>", "exec")
blob = marshal.dumps(code)
restored = marshal.loads(blob)

marshal can serialize certain built-in objects:

None
booleans
integers
floats
strings
bytes
tuples
lists
dicts
sets
frozensets
code objects

But it is not a general object serialization system.

It does not preserve arbitrary class instances in the way pickle does.

98.12 Why marshal Exists

CPython needs a compact way to store compiled code objects.

When Python imports a source module, it may compile it and write a .pyc file.

A .pyc file contains:

header
marshaled code object

The code object contains bytecode and metadata.

marshal exists because CPython needs to reload compiled code quickly.

It is optimized for interpreter internals, not long-term compatibility.

98.13 .pyc Files

A .pyc file stores cached bytecode.

Conceptually:

magic number
flags
timestamp or source hash
source size or validation data
marshaled code object

The magic number identifies the bytecode format version.

If CPython changes bytecode or code object layout, the magic number changes.

This prevents the interpreter from loading incompatible cache files.

98.14 Code Object Serialization

A code object contains:

bytecode
constants
names
local variable names
free variables
cell variables
filename
function name
line table
exception table
flags
stack size
argument metadata

When serialized into a .pyc, CPython stores this information using marshal.

The reconstructed code object can then be executed by the interpreter.

Example:

source = "print(1 + 2)"
code = compile(source, "example.py", "exec")
exec(code)

The code object is the real executable unit, not the source string.

98.15 Pickle vs Marshal

Featurepicklemarshal
PurposeGeneral Python object serializationCPython internal code object storage
Public persistence formatYes, with careNo
Handles user classesYesNo
Preserves shared referencesYesLimited by format behavior
Handles cyclesYesNot for general use
Version stabilityBetter protocol storyCPython-version dependent
Security riskUnsafe for untrusted inputAlso unsafe for untrusted input
Main useTrusted object persistence.pyc code cache

Use pickle when you need Python object persistence.

Use marshal only when working with CPython internals or code objects.

98.16 JSON Serialization

JSON serializes a small portable data model:

object
array
string
number
true
false
null

Python maps these roughly to:

dict
list
str
int or float
True
False
None

Example:

import json

blob = json.dumps({"name": "Ada", "age": 36})
data = json.loads(blob)

JSON does not represent:

tuples
sets
object identity
cycles
custom classes
bytes
datetime values
NaN portability without caveats

JSON is an interchange format, not a Python object graph format.

98.17 Copy Protocols

The copy module uses related object reconstruction protocols.

import copy

x = [[1], [2]]
y = copy.deepcopy(x)

Deep copy must preserve graph structure similarly to pickle.

For example:

shared = []
x = [shared, shared]
y = copy.deepcopy(x)

print(y[0] is y[1])

The output is:

True

deepcopy uses a memo dictionary to avoid infinite recursion and preserve aliasing.

Classes can customize copying with methods such as:

__copy__
__deepcopy__
__reduce__
__getstate__
__setstate__

98.18 Multiprocessing Serialization

multiprocessing often serializes Python objects to send them between processes.

Because processes do not share ordinary Python heaps, arguments and results usually cross process boundaries as bytes.

Conceptually:

parent process object
pickle
pipe or queue
unpickle
child process object

This explains why objects sent to worker processes must usually be pickleable.

Example:

from multiprocessing import Pool

def square(x):
    return x * x

with Pool() as pool:
    print(pool.map(square, [1, 2, 3]))

The function and arguments must be serializable under the platform’s start method.

98.19 Start Methods and Serialization

multiprocessing behavior differs by start method.

Common start methods:

fork
spawn
forkserver

With fork, the child initially inherits the parent process memory through the operating system.

With spawn, the child starts fresh and imports the main module.

This makes serialization and importability more important.

A function passed to a spawned process must be importable by name.

Bad pattern:

def main():
    def worker(x):
        return x * x

Nested functions are often not pickleable by the standard pickle mechanism.

Better:

def worker(x):
    return x * x

def main():
    ...

Top-level functions are importable by module and name.

98.20 Serialization and Object Identity

Serialization usually creates new objects.

Example:

import pickle

x = [1, 2, 3]
y = pickle.loads(pickle.dumps(x))

print(x is y)

The result is:

False

The value is preserved, but identity changes.

For shared references inside the serialized graph, pickle can preserve identity relationships.

For references outside the serialized graph, identity is not preserved.

98.21 Serialization and Types

Pickle records references to types by module and qualified name.

Example:

myapp.models.User

This means the unpickling environment must have:

module myapp.models
attribute User
compatible class behavior

Changing package layout can break old serialized data.

For long-term persistence, it is often better to serialize explicit schema data:

{
  "type": "user",
  "version": 1,
  "name": "Ada"
}

and write migration logic.

98.22 Versioning Serialized Data

Durable serialized data needs a version field.

Example:

record = {
    "version": 2,
    "name": "Ada",
    "email": "[email protected]",
}

Without versioning, format evolution becomes fragile.

Problems include:

renamed fields
removed fields
changed types
new required fields
different class locations

A robust serializer treats the byte stream as a public data format, not as an incidental memory dump.

98.23 Buffers and Large Data

Large binary data needs special handling.

Pickle protocol 5 introduced support for out-of-band buffers.

This is useful for objects such as:

large arrays
memoryviews
binary tensors
image buffers
numeric matrices

The idea is to separate small object metadata from large raw buffers.

Conceptually:

pickle stream
    object structure and metadata

external buffer
    large contiguous bytes

This avoids unnecessary copies in some systems.

98.24 The Buffer Protocol

The buffer protocol allows objects to expose raw memory to other objects.

Examples:

bytes
bytearray
memoryview
array.array
NumPy arrays

Serialization systems can use buffers to avoid copying large data repeatedly.

Example:

data = bytearray(b"abcdef")
view = memoryview(data)

The memoryview sees the same underlying bytes.

For serialization, the question becomes:

copy the bytes
share the bytes
transfer ownership
reference external storage

Each option has different safety and lifetime rules.

98.25 Serialization and Subinterpreters

Subinterpreters usually cannot share arbitrary Python objects directly.

Serialization gives a clean boundary:

interpreter A object
serialize
bytes or shareable data
deserialize
interpreter B object

This avoids cross-interpreter object ownership problems.

However, serialization has costs:

CPU time
memory copies
schema constraints
loss of identity
conversion errors

A future subinterpreter communication model needs careful choices between copying, sharing immutable data, and transferring buffers.

98.26 Serialization and Security Boundaries

Serialization format choice is a security decision.

FormatSafe for untrusted inputNotes
pickleNoCan execute code
marshalNoInternal format, not validated for hostile input
jsonUsually, with validationData only, but schemas still matter
yamlDepends on loaderUnsafe loaders can construct objects
ProtobufGenerally saferSchema-based
MessagePackUsually, with validationData format, not object protocol

Parsing untrusted data still requires resource limits.

Even safe formats can be abused through:

huge inputs
deep nesting
large integers
compression bombs
schema confusion
memory exhaustion

98.27 Practical Rules

Use these rules:

Use pickle only for trusted Python-to-Python data.

Do not use pickle as a network protocol for untrusted clients.

Use JSON or a schema format for public APIs.

Use marshal only for CPython internals, not application persistence.

Put version fields in durable serialized data.

Keep serialized data independent of class layout when long-term compatibility matters.

Avoid serializing live resources such as sockets, files, locks, and threads.

Use buffers for large binary payloads when copy cost matters.

98.28 Mental Model

Use this model:

Python objects form graphs.

Serialization writes a representation of that graph.

Pickle preserves many Python-specific graph properties:
    shared references
    cycles
    custom reconstruction
    class-based objects

Marshal stores CPython internal values, especially code objects.

JSON stores portable data trees, not Python object graphs.

Long-term serialized data should use explicit schemas.

Unpickling is code execution and must be treated as unsafe for untrusted input.

98.29 Chapter Summary

Serialization in CPython spans public protocols and internal formats.

pickle serializes Python object graphs using opcodes, memoization, and reconstruction hooks.

marshal stores CPython internal values, especially code objects in .pyc files.

json serializes portable data trees and avoids Python-specific object identity.

The hard problems are:

cycles
shared references
custom class reconstruction
versioning
security
large buffers
cross-process and cross-interpreter transfer

Serialization is not just data conversion. It defines a boundary between object identity, runtime state, storage, transport, and trust.