PyUnicodeObject internal encodings, the interning table, bytes vs. bytearray, and the codec infrastructure.
Text and binary data are separate object families in Python. str represents Unicode text. bytes represents immutable binary data. bytearray represents mutable binary data.
This separation is one of Python 3’s most important runtime design choices. Text has characters and encodings. Binary data has bytes. CPython implements these concepts with different object layouts, APIs, and invariants.
14.1 Text vs Binary Data
A string is text:
s = "hello"A bytes object is binary data:
b = b"hello"They may look similar for ASCII content, but they are different types.
print(type("hello")) # <class 'str'>
print(type(b"hello")) # <class 'bytes'>Python does not implicitly mix them:
"hello" + b"world" # TypeErrorThis is deliberate. Combining text and bytes requires an encoding decision.
text = "hello"
data = text.encode("utf-8")
again = data.decode("utf-8")The conversion boundary is explicit.
14.2 str Represents Unicode Text
A Python str is a sequence of Unicode code points.
s = "Python 🐍"
print(len(s))len(s) counts code points, not encoded bytes.
s = "é"
print(len(s)) # 1
print(len(s.encode("utf-8"))) # 2The string contains one Unicode code point. Its UTF-8 encoding uses two bytes.
This distinction appears throughout CPython internals:
str
logical Unicode text
bytes
encoded or arbitrary binary data14.3 Unicode Code Points
A Unicode code point is an integer value assigned by the Unicode standard.
Examples:
| Character | Code point |
|---|---|
A | U+0041 |
é | U+00E9 |
中 | U+4E2D |
🐍 | U+1F40D |
Python exposes this through ord and chr.
print(ord("A")) # 65
print(hex(ord("🐍"))) # 0x1f40d
print(chr(0x1f40d)) # 🐍A string is not stored as UTF-8 bytes internally in the simple universal sense. CPython chooses an internal layout optimized for the string’s contents.
14.4 Flexible String Representation
CPython uses a flexible internal representation for Unicode strings.
The key idea is simple:
Use the smallest fixed-width storage that can represent every code point in the string.Conceptually:
| Maximum code point in string | Storage width |
|---|---|
| ASCII only | 1 byte per character |
Up to U+00FF | 1 byte per character |
Up to U+FFFF | 2 bytes per character |
Above U+FFFF | 4 bytes per character |
Example:
"abc" # compact ASCII path
"café" # may fit in 1-byte storage
"中" # needs 2-byte storage
"🐍" # needs 4-byte storageThis design avoids wasting four bytes per character for common ASCII-heavy text while still supporting all Unicode code points.
14.5 String Object Metadata
A CPython string stores more than character data.
Conceptual fields include:
object header
length
hash cache
state flags
kind
compact flag
ASCII flag
ready flag
character data
optional UTF-8 cacheThe exact C layout is version-specific, but the invariants matter more than the field names.
Important metadata:
| Metadata | Purpose |
|---|---|
| Length | Number of Unicode code points |
| Hash cache | Stores hash after first computation |
| Kind | Storage width |
| ASCII flag | Fast path for ASCII strings |
| Compact flag | Whether data is stored close to object header |
| UTF-8 cache | Cached encoded form for C API use |
Strings are immutable, so cached metadata is safe. Once computed, the hash remains valid.
14.6 String Immutability
Python strings are immutable.
s = "hello"
s[0] = "H" # TypeErrorOperations create new strings:
s = "hello"
t = "H" + s[1:]
print(s) # hello
print(t) # HelloImmutability gives CPython several advantages:
hash values can be cached
strings can be safely interned
strings can be shared across dictionaries
strings can be used as dict keys
substring operations cannot mutate originalsThe cost is that repeated string concatenation can allocate many intermediate strings if written poorly.
14.7 String Hashing
Strings are hashable.
hash("name")A string’s hash is computed from its contents. Because strings are immutable, CPython can cache the result inside the string object.
This matters because strings are used heavily as dictionary keys:
module globals
object attribute names
class dictionaries
keyword argument names
JSON-like data
configuration maps
protocol field namesWithout cached string hashes, attribute lookup and dictionary lookup would be more expensive.
14.8 String Interning
Interning means reusing one string object for equal string values in selected cases.
a = "identifier"
b = "identifier"
print(a is b) # may be TrueCPython interns many identifier-like strings because they are common in attribute lookup and namespaces.
Interning is useful for:
attribute names
variable names
keyword names
module names
common internal stringsWith interned strings, equality checks can often become pointer checks after hash checks in internal paths.
But user code should not depend on arbitrary string identity.
Correct:
if a == b:
...Incorrect:
if a is b:
...Use is only for identity semantics, especially documented singletons such as None.
14.9 Encoding
Encoding converts text to bytes.
s = "café"
b = s.encode("utf-8")
print(b) # b'caf\xc3\xa9'The string is Unicode text. UTF-8 is one external byte representation.
Common encodings:
| Encoding | Use |
|---|---|
| UTF-8 | Web, files, APIs, Unix systems |
| UTF-16 | Some platform APIs and file formats |
| UTF-32 | Fixed-width Unicode storage |
| Latin-1 | Legacy Western byte mapping |
| ASCII | 7-bit English subset |
CPython does not treat encoding as a property of a str. A string is decoded text. Encoding is used when crossing a byte boundary.
14.10 Decoding
Decoding converts bytes to text.
b = b"caf\xc3\xa9"
s = b.decode("utf-8")
print(s) # caféIf the bytes are invalid for the selected encoding, decoding fails unless an error handler is used.
b = b"\xff"
b.decode("utf-8") # UnicodeDecodeError
b.decode("utf-8", errors="ignore")
b.decode("utf-8", errors="replace")Error handlers include:
strict
ignore
replace
backslashreplace
surrogateescape
surrogatepassEncoding errors are part of the text boundary, not ordinary string operations.
14.11 UTF-8 Boundary
Most modern external text is UTF-8.
Examples:
source files
JSON
HTTP payloads
HTML
Markdown
logs
configuration files
database text protocolsA practical rule:
inside Python
use str
at file, network, process, and binary protocol boundaries
encode or decode explicitlyExample:
from pathlib import Path
text = Path("notes.txt").read_text(encoding="utf-8")
data = text.encode("utf-8")This keeps the boundary clear.
14.12 Source Code Encoding
Python source files are decoded before parsing.
The default source encoding is UTF-8 unless an encoding declaration says otherwise.
Example:
# -*- coding: utf-8 -*-
name = "café"The tokenizer works on decoded source text. String literals then become Python str objects unless they are bytes literals.
s = "abc" # str
b = b"abc" # bytesThis distinction is made during parsing and compilation.
14.13 String Literals
Python has several string literal forms.
"hello"
'hello'
"""hello"""
'''hello'''
r"\n"
f"value = {x}"Literal prefixes affect parsing:
| Prefix | Meaning |
|---|---|
r | Raw string literal |
f | Formatted string literal |
b | Bytes literal |
u | Historical compatibility prefix |
| combinations | fr, rf, br, rb |
A raw string changes how escapes are interpreted by the parser.
s = r"\n"
print(s) # \nIt still creates a normal str.
14.14 Bytes Objects
bytes represents immutable binary data.
b = b"hello"A bytes object is a sequence of integers in the range 0 to 255.
b = b"ABC"
print(b[0]) # 65
print(b[1]) # 66Slicing returns another bytes object:
print(b[1:]) # b'BC'Because bytes is immutable, it is hashable.
d = {b"key": "value"}Bytes are useful for:
network protocols
binary files
cryptographic data
compressed data
encoded text
database wire formats
image and audio formats14.15 Bytes Object Layout
A bytes object is variable-sized.
Conceptually:
PyBytesObject
PyVarObject header
ob_size = number of bytes
hash cache
byte data
trailing NUL byte for C compatibilityThe trailing NUL byte helps when passing data to C APIs that expect C strings, but bytes may contain embedded NUL bytes:
b = b"a\0b"
print(len(b)) # 3So bytes must not be treated as ordinary NUL-terminated strings unless the API specifically permits it and the content is known safe.
14.16 Bytearray Objects
bytearray is mutable binary data.
buf = bytearray(b"hello")
buf[0] = ord("H")
print(buf) # bytearray(b'Hello')A bytearray supports in-place modification:
buf.append(33)
buf.extend(b" world")It is not hashable because its contents can change.
hash(bytearray(b"abc")) # TypeErrorConceptually, bytearray is closer to a mutable list of bytes, but implemented as a compact byte buffer rather than an array of Python integer object references.
14.17 Bytes vs List of Ints
Compare:
b = b"abc"
xs = [97, 98, 99]The bytes object stores raw bytes compactly.
The list stores references to Python integer objects.
Conceptually:
bytes
[97][98][99]
list
[ptr][ptr][ptr]
| | |
v v v
int int intFor binary data, bytes and bytearray are far more memory efficient.
14.18 Memoryview
memoryview exposes the buffer of another object.
buf = bytearray(b"hello")
view = memoryview(buf)
view[0] = ord("H")
print(buf) # bytearray(b'Hello')A memoryview avoids copying.
It is useful when slicing or passing binary data between APIs:
data = bytearray(b"abcdef")
view = memoryview(data)[2:5]
print(view.tobytes()) # b'cde'The view keeps the exporter alive and enforces buffer lifetime rules.
14.19 Buffer Protocol
The buffer protocol is the C-level protocol behind memoryview.
It lets objects expose raw memory to other objects.
Common buffer exporters:
bytes
bytearray
array.array
memoryview
mmap
third-party arrays such as NumPy arrays
custom extension objectsThe protocol describes:
pointer to memory
length
item size
format
number of dimensions
shape
strides
readonly flag
lifetime callbacksThis makes zero-copy binary processing possible.
14.20 Text I/O vs Binary I/O
Files can be opened in text mode or binary mode.
Text mode decodes bytes into strings:
with open("notes.txt", "r", encoding="utf-8") as f:
text = f.read()Binary mode returns bytes:
with open("notes.txt", "rb") as f:
data = f.read()Text mode handles encoding, decoding, and newline translation.
Binary mode gives raw bytes.
Choose based on the data model:
human-readable text
text mode, str
protocol bytes or exact file bytes
binary mode, bytes14.21 Common Boundary Bug
A common bug is mixing text and bytes at boundaries.
Incorrect:
def send(sock, message):
sock.sendall(message) # fails if message is strCorrect:
def send(sock, message):
data = message.encode("utf-8")
sock.sendall(data)For receiving:
data = sock.recv(4096)
text = data.decode("utf-8")Keep the conversion explicit so the encoding is visible.
14.22 String Concatenation
Strings are immutable, so concatenation creates a new string.
s = "a"
s = s + "b"
s = s + "c"This can allocate repeatedly.
For many parts, prefer join:
parts = ["a", "b", "c"]
s = "".join(parts)For streaming text, use io.StringIO:
from io import StringIO
buf = StringIO()
buf.write("a")
buf.write("b")
buf.write("c")
s = buf.getvalue()CPython has optimizations for some local concatenation cases, but code should not depend on them for general performance.
14.23 Bytes Building
Bytes are immutable too.
Repeated bytes concatenation has the same allocation problem.
data = b""
for chunk in chunks:
data += chunkBetter:
data = b"".join(chunks)For mutable incremental construction:
buf = bytearray()
for chunk in chunks:
buf.extend(chunk)
data = bytes(buf)For file-like binary building:
from io import BytesIO
buf = BytesIO()
buf.write(b"abc")
buf.write(b"def")
data = buf.getvalue()14.24 Unicode Normalization
Different Unicode sequences can look the same.
Example:
a = "é" # single code point U+00E9
b = "e\u0301" # e plus combining acute accent
print(a == b) # FalseThey may render similarly, but they are different sequences of code points.
Normalize when comparing human text:
import unicodedata
a = unicodedata.normalize("NFC", a)
b = unicodedata.normalize("NFC", b)
print(a == b) # TrueCommon normalization forms:
| Form | Meaning |
|---|---|
| NFC | Canonical composition |
| NFD | Canonical decomposition |
| NFKC | Compatibility composition |
| NFKD | Compatibility decomposition |
Normalization is a Unicode-level concern, not a CPython object layout concern. But it matters for correct text processing.
14.25 Case Folding
Case-insensitive comparison should often use casefold, not lower.
a = "Straße"
b = "strasse"
print(a.lower() == b.lower()) # often False
print(a.casefold() == b.casefold()) # Truecasefold is designed for caseless matching.
For identifiers, filenames, usernames, and protocol fields, define exact normalization and case rules. Do not guess.
14.26 Grapheme Clusters
A user-visible character may contain multiple code points.
Example:
s = "👨👩👧👦"
print(len(s))This may print more than 1 because the visible family emoji is a sequence joined by zero-width joiners.
Python str indexes code points, not grapheme clusters.
s[0]may return only part of a visible character.
For UI text editing, cursor movement, truncation, and display width, code point indexing may be insufficient. You need Unicode-aware grapheme segmentation.
14.27 Encoding Error Strategies
Encoding can fail if a character cannot be represented.
"é".encode("ascii") # UnicodeEncodeErrorError strategies control behavior:
"é".encode("ascii", errors="ignore")
"é".encode("ascii", errors="replace")
"é".encode("ascii", errors="backslashreplace")For file and process boundaries, choose error handling deliberately.
A common Unix-facing strategy is surrogateescape, which allows undecodable bytes to round-trip through str without data loss in some filesystem and environment contexts.
14.28 Filesystem Encoding
File paths cross a text and bytes boundary.
Python usually exposes paths as str, but operating systems often operate on encoded byte sequences or platform-native string formats.
CPython has filesystem encoding logic to convert between Python strings and platform path representations.
Practical rule:
use pathlib and str paths for normal application code
use bytes paths only when you need exact low-level byte behaviorExample:
from pathlib import Path
p = Path("data") / "notes.txt"
text = p.read_text(encoding="utf-8")14.29 C API View of Unicode
C extension code often needs to accept or produce strings.
Common operations include:
create Unicode from UTF-8
convert Unicode to UTF-8
inspect length
read code points
parse arguments as strConceptual examples:
PyObject *s = PyUnicode_FromString("hello");and:
const char *p = PyUnicode_AsUTF8(obj);The UTF-8 pointer returned by some APIs may point into a cache owned by the string object. Its lifetime depends on the owning object remaining alive.
Extension code must not free that pointer.
14.30 C API View of Bytes
Bytes expose contiguous binary memory.
Conceptual examples:
PyObject *b = PyBytes_FromStringAndSize(data, len);Access:
char *p = PyBytes_AS_STRING(b);
Py_ssize_t n = PyBytes_GET_SIZE(b);Fast macros assume the object is really a bytes object.
Safer checked APIs should be used at public boundaries.
Bytes may contain NUL bytes, so always use explicit length.
14.31 Choosing the Right Type
| Need | Type |
|---|---|
| Human text | str |
| Encoded text | bytes |
| Binary protocol data | bytes |
| Mutable binary buffer | bytearray |
| Zero-copy view | memoryview |
| Many text fragments | list[str] plus "".join |
| Many binary fragments | list[bytes] plus b"".join |
| Incremental text writer | io.StringIO |
| Incremental binary writer | io.BytesIO or bytearray |
The type should reflect the data model. Avoid using str for arbitrary bytes. Avoid using bytes for decoded text.
14.32 Mental Model
Use this model:
str
immutable Unicode code point sequence
optimized internal storage
cached hash
explicit encode/decode boundary
bytes
immutable byte sequence
compact binary storage
hashable
no text semantics unless decoded
bytearray
mutable byte sequence
compact binary storage
useful for incremental construction
memoryview
zero-copy view over buffer-exporting object
lifetime tied to exporterText processing bugs often come from confusing code points, bytes, encodings, grapheme clusters, normalization, and display width. CPython’s object model keeps text and bytes separate so these decisions remain explicit.
14.33 Summary
str is CPython’s Unicode text object. It is immutable, hashable, and internally optimized through flexible storage, cached hashes, and interning. It represents decoded text, not encoded bytes.
bytes is immutable binary data. bytearray is mutable binary data. memoryview provides zero-copy access to buffer-exporting objects.
The boundary between text and binary data is explicit: encode str to produce bytes, and decode bytes to produce str. This boundary is essential for correct file I/O, network protocols, source decoding, extension code, and Unicode-aware applications.