pickle protocol versions, the __reduce__ / __reduce_ex__ protocol, marshal format, and shelve internals.
Serialization is the process of converting Python objects into a representation that can be stored, transmitted, copied, or reconstructed later.
In CPython, serialization appears in several forms:
pickle
marshal
json
copy protocols
multiprocessing transfer
bytecode cache files
interpreter state exchangeThese systems are related, but they have different goals.
The most important distinction is between general object serialization and CPython-internal serialization.
pickle is a public object serialization protocol. It is designed to represent many Python objects and reconstruct them later.
marshal is a lower-level CPython format used mainly for internal code object storage. It is not a general persistence format.
json is a text data interchange format. It represents a small set of portable data types.
This chapter focuses on how CPython thinks about serialization: object identity, type reconstruction, code object storage, recursive object graphs, extension hooks, security boundaries, and version stability.
98.1 Why Serialization Exists
Python objects live in process memory.
A list such as:
items = ["a", "b", "c"]is not stored as a contiguous portable data record. It is a CPython object graph:
list object
pointer to string object "a"
pointer to string object "b"
pointer to string object "c"Serialization converts that graph into bytes or text.
Conceptually:
object graph
↓
serializer
↓
byte stream
↓
deserializer
↓
new object graphThe reconstructed graph usually has equal values, but not the same object identities.
98.2 Object Graphs
Most Python values are not isolated objects.
Example:
a = []
b = [a, a]The object graph is:
b
index 0 -> a
index 1 -> aThe same list object appears twice.
A correct serializer must decide whether to preserve aliasing.
If aliasing is preserved:
restored = deserialize(serialize(b))
restored[0] is restored[1]should be:
TrueFor pickle, preserving shared references is part of the protocol.
For simpler formats such as JSON, aliasing is lost because JSON represents trees, not arbitrary object graphs.
98.3 Cycles
Python object graphs can contain cycles.
Example:
x = []
x.append(x)This graph points back to itself:
x
index 0 -> xA naive serializer would recurse forever.
pickle handles cycles through memoization.
Conceptually:
first time object is seen
assign memo id
serialize object contents
later reference to same object
emit reference to memo idThis allows recursive graphs to be reconstructed.
98.4 Pickle
pickle is Python’s general object serialization protocol.
It stores instructions for rebuilding objects.
A pickle stream is not merely raw data. It is closer to a small stack-machine program.
Example:
import pickle
data = {"x": [1, 2, 3]}
blob = pickle.dumps(data)
restored = pickle.loads(blob)The pickle byte stream contains opcodes such as:
create dict
create list
push constants
set item
memoize object
refer to memoized objectThe unpickler executes these instructions to rebuild the graph.
98.5 Pickle Protocol Versions
Pickle has multiple protocol versions.
Newer protocols add features such as:
more compact encodings
better bytes handling
large object support
out-of-band buffers
faster opcodesThe protocol version controls the serialized format.
Example:
import pickle
blob = pickle.dumps(obj, protocol=pickle.HIGHEST_PROTOCOL)Higher protocols are usually more efficient, but may require newer Python versions to read.
98.6 Pickle Memo Table
The memo table is central to pickle correctness.
It maps object identities to memo indexes.
Conceptually:
id(obj) -> memo indexWhen pickling this:
shared = []
obj = [shared, shared]pickle records that both positions refer to the same object.
The result is reconstructed as:
restored[0] is restored[1]with result:
TrueWithout a memo table, pickle would create two independent lists.
98.7 Pickle and Object Reconstruction
Pickle usually reconstructs objects by importing their class and then rebuilding state.
For a normal class:
class User:
def __init__(self, name):
self.name = namePickle records enough information to locate the class:
module name
qualified class name
object stateDuring unpickling, Python imports the module and looks up the class.
This means the class definition must be importable by the same name when loading.
If the class moves or is renamed, old pickles may fail.
98.8 __getstate__ and __setstate__
Classes can customize pickling.
class User:
def __init__(self, name, cache):
self.name = name
self.cache = cache
def __getstate__(self):
return {"name": self.name}
def __setstate__(self, state):
self.name = state["name"]
self.cache = {}__getstate__ returns serialized state.
__setstate__ restores it.
This is useful when an object contains non-serializable fields:
locks
open files
database connections
sockets
thread handles
cachesThe serialized form should describe durable state, not live process resources.
98.9 __reduce__
The lower-level customization hook is __reduce__.
It tells pickle how to reconstruct an object.
A reduce tuple usually describes:
callable
arguments
state
list items
dict itemsExample shape:
def __reduce__(self):
return (User, (self.name,), self.__dict__)Unpickling calls the reconstruction callable with arguments, then applies state.
This mechanism is powerful because it can represent objects that ordinary attribute copying cannot.
It is also dangerous because unpickling can call arbitrary callables.
98.10 Pickle Security
Pickle is not safe for untrusted data.
Unpickling can execute code.
A pickle stream may instruct the runtime to import a module and call a function during reconstruction.
Therefore:
never unpickle data from an untrusted sourceThis is a hard security boundary.
Safe interchange formats include:
JSON
MessagePack with careful schema validation
Protocol Buffers
Cap'n Proto
FlatBuffers
custom validated formatsPickle is appropriate for trusted Python-to-Python persistence or transport, not hostile input.
98.11 marshal
marshal is a CPython internal serialization format.
It is used for writing and reading code objects in .pyc files.
Example:
import marshal
code = compile("x = 1", "<string>", "exec")
blob = marshal.dumps(code)
restored = marshal.loads(blob)marshal can serialize certain built-in objects:
None
booleans
integers
floats
strings
bytes
tuples
lists
dicts
sets
frozensets
code objectsBut it is not a general object serialization system.
It does not preserve arbitrary class instances in the way pickle does.
98.12 Why marshal Exists
CPython needs a compact way to store compiled code objects.
When Python imports a source module, it may compile it and write a .pyc file.
A .pyc file contains:
header
marshaled code objectThe code object contains bytecode and metadata.
marshal exists because CPython needs to reload compiled code quickly.
It is optimized for interpreter internals, not long-term compatibility.
98.13 .pyc Files
A .pyc file stores cached bytecode.
Conceptually:
magic number
flags
timestamp or source hash
source size or validation data
marshaled code objectThe magic number identifies the bytecode format version.
If CPython changes bytecode or code object layout, the magic number changes.
This prevents the interpreter from loading incompatible cache files.
98.14 Code Object Serialization
A code object contains:
bytecode
constants
names
local variable names
free variables
cell variables
filename
function name
line table
exception table
flags
stack size
argument metadataWhen serialized into a .pyc, CPython stores this information using marshal.
The reconstructed code object can then be executed by the interpreter.
Example:
source = "print(1 + 2)"
code = compile(source, "example.py", "exec")
exec(code)The code object is the real executable unit, not the source string.
98.15 Pickle vs Marshal
| Feature | pickle | marshal |
|---|---|---|
| Purpose | General Python object serialization | CPython internal code object storage |
| Public persistence format | Yes, with care | No |
| Handles user classes | Yes | No |
| Preserves shared references | Yes | Limited by format behavior |
| Handles cycles | Yes | Not for general use |
| Version stability | Better protocol story | CPython-version dependent |
| Security risk | Unsafe for untrusted input | Also unsafe for untrusted input |
| Main use | Trusted object persistence | .pyc code cache |
Use pickle when you need Python object persistence.
Use marshal only when working with CPython internals or code objects.
98.16 JSON Serialization
JSON serializes a small portable data model:
object
array
string
number
true
false
nullPython maps these roughly to:
dict
list
str
int or float
True
False
NoneExample:
import json
blob = json.dumps({"name": "Ada", "age": 36})
data = json.loads(blob)JSON does not represent:
tuples
sets
object identity
cycles
custom classes
bytes
datetime values
NaN portability without caveatsJSON is an interchange format, not a Python object graph format.
98.17 Copy Protocols
The copy module uses related object reconstruction protocols.
import copy
x = [[1], [2]]
y = copy.deepcopy(x)Deep copy must preserve graph structure similarly to pickle.
For example:
shared = []
x = [shared, shared]
y = copy.deepcopy(x)
print(y[0] is y[1])The output is:
Truedeepcopy uses a memo dictionary to avoid infinite recursion and preserve aliasing.
Classes can customize copying with methods such as:
__copy__
__deepcopy__
__reduce__
__getstate__
__setstate__98.18 Multiprocessing Serialization
multiprocessing often serializes Python objects to send them between processes.
Because processes do not share ordinary Python heaps, arguments and results usually cross process boundaries as bytes.
Conceptually:
parent process object
↓
pickle
↓
pipe or queue
↓
unpickle
↓
child process objectThis explains why objects sent to worker processes must usually be pickleable.
Example:
from multiprocessing import Pool
def square(x):
return x * x
with Pool() as pool:
print(pool.map(square, [1, 2, 3]))The function and arguments must be serializable under the platform’s start method.
98.19 Start Methods and Serialization
multiprocessing behavior differs by start method.
Common start methods:
fork
spawn
forkserverWith fork, the child initially inherits the parent process memory through the operating system.
With spawn, the child starts fresh and imports the main module.
This makes serialization and importability more important.
A function passed to a spawned process must be importable by name.
Bad pattern:
def main():
def worker(x):
return x * xNested functions are often not pickleable by the standard pickle mechanism.
Better:
def worker(x):
return x * x
def main():
...Top-level functions are importable by module and name.
98.20 Serialization and Object Identity
Serialization usually creates new objects.
Example:
import pickle
x = [1, 2, 3]
y = pickle.loads(pickle.dumps(x))
print(x is y)The result is:
FalseThe value is preserved, but identity changes.
For shared references inside the serialized graph, pickle can preserve identity relationships.
For references outside the serialized graph, identity is not preserved.
98.21 Serialization and Types
Pickle records references to types by module and qualified name.
Example:
myapp.models.UserThis means the unpickling environment must have:
module myapp.models
attribute User
compatible class behaviorChanging package layout can break old serialized data.
For long-term persistence, it is often better to serialize explicit schema data:
{
"type": "user",
"version": 1,
"name": "Ada"
}and write migration logic.
98.22 Versioning Serialized Data
Durable serialized data needs a version field.
Example:
record = {
"version": 2,
"name": "Ada",
"email": "[email protected]",
}Without versioning, format evolution becomes fragile.
Problems include:
renamed fields
removed fields
changed types
new required fields
different class locationsA robust serializer treats the byte stream as a public data format, not as an incidental memory dump.
98.23 Buffers and Large Data
Large binary data needs special handling.
Pickle protocol 5 introduced support for out-of-band buffers.
This is useful for objects such as:
large arrays
memoryviews
binary tensors
image buffers
numeric matricesThe idea is to separate small object metadata from large raw buffers.
Conceptually:
pickle stream
object structure and metadata
external buffer
large contiguous bytesThis avoids unnecessary copies in some systems.
98.24 The Buffer Protocol
The buffer protocol allows objects to expose raw memory to other objects.
Examples:
bytes
bytearray
memoryview
array.array
NumPy arraysSerialization systems can use buffers to avoid copying large data repeatedly.
Example:
data = bytearray(b"abcdef")
view = memoryview(data)The memoryview sees the same underlying bytes.
For serialization, the question becomes:
copy the bytes
share the bytes
transfer ownership
reference external storageEach option has different safety and lifetime rules.
98.25 Serialization and Subinterpreters
Subinterpreters usually cannot share arbitrary Python objects directly.
Serialization gives a clean boundary:
interpreter A object
↓
serialize
↓
bytes or shareable data
↓
deserialize
↓
interpreter B objectThis avoids cross-interpreter object ownership problems.
However, serialization has costs:
CPU time
memory copies
schema constraints
loss of identity
conversion errorsA future subinterpreter communication model needs careful choices between copying, sharing immutable data, and transferring buffers.
98.26 Serialization and Security Boundaries
Serialization format choice is a security decision.
| Format | Safe for untrusted input | Notes |
|---|---|---|
pickle | No | Can execute code |
marshal | No | Internal format, not validated for hostile input |
json | Usually, with validation | Data only, but schemas still matter |
yaml | Depends on loader | Unsafe loaders can construct objects |
| Protobuf | Generally safer | Schema-based |
| MessagePack | Usually, with validation | Data format, not object protocol |
Parsing untrusted data still requires resource limits.
Even safe formats can be abused through:
huge inputs
deep nesting
large integers
compression bombs
schema confusion
memory exhaustion98.27 Practical Rules
Use these rules:
Use pickle only for trusted Python-to-Python data.
Do not use pickle as a network protocol for untrusted clients.
Use JSON or a schema format for public APIs.
Use marshal only for CPython internals, not application persistence.
Put version fields in durable serialized data.
Keep serialized data independent of class layout when long-term compatibility matters.
Avoid serializing live resources such as sockets, files, locks, and threads.
Use buffers for large binary payloads when copy cost matters.98.28 Mental Model
Use this model:
Python objects form graphs.
Serialization writes a representation of that graph.
Pickle preserves many Python-specific graph properties:
shared references
cycles
custom reconstruction
class-based objects
Marshal stores CPython internal values, especially code objects.
JSON stores portable data trees, not Python object graphs.
Long-term serialized data should use explicit schemas.
Unpickling is code execution and must be treated as unsafe for untrusted input.98.29 Chapter Summary
Serialization in CPython spans public protocols and internal formats.
pickle serializes Python object graphs using opcodes, memoization, and reconstruction hooks.
marshal stores CPython internal values, especially code objects in .pyc files.
json serializes portable data trees and avoids Python-specific object identity.
The hard problems are:
cycles
shared references
custom class reconstruction
versioning
security
large buffers
cross-process and cross-interpreter transferSerialization is not just data conversion. It defines a boundary between object identity, runtime state, storage, transport, and trust.