Skip to content

14. Strings, Bytes, and Unicode

PyUnicodeObject internal encodings, the interning table, bytes vs. bytearray, and the codec infrastructure.

Text and binary data are separate object families in Python. str represents Unicode text. bytes represents immutable binary data. bytearray represents mutable binary data.

This separation is one of Python 3’s most important runtime design choices. Text has characters and encodings. Binary data has bytes. CPython implements these concepts with different object layouts, APIs, and invariants.

14.1 Text vs Binary Data

A string is text:

s = "hello"

A bytes object is binary data:

b = b"hello"

They may look similar for ASCII content, but they are different types.

print(type("hello"))    # <class 'str'>
print(type(b"hello"))   # <class 'bytes'>

Python does not implicitly mix them:

"hello" + b"world"      # TypeError

This is deliberate. Combining text and bytes requires an encoding decision.

text = "hello"
data = text.encode("utf-8")

again = data.decode("utf-8")

The conversion boundary is explicit.

14.2 str Represents Unicode Text

A Python str is a sequence of Unicode code points.

s = "Python 🐍"
print(len(s))

len(s) counts code points, not encoded bytes.

s = "é"

print(len(s))                  # 1
print(len(s.encode("utf-8")))  # 2

The string contains one Unicode code point. Its UTF-8 encoding uses two bytes.

This distinction appears throughout CPython internals:

str
    logical Unicode text

bytes
    encoded or arbitrary binary data

14.3 Unicode Code Points

A Unicode code point is an integer value assigned by the Unicode standard.

Examples:

CharacterCode point
AU+0041
éU+00E9
U+4E2D
🐍U+1F40D

Python exposes this through ord and chr.

print(ord("A"))        # 65
print(hex(ord("🐍")))  # 0x1f40d

print(chr(0x1f40d))    # 🐍

A string is not stored as UTF-8 bytes internally in the simple universal sense. CPython chooses an internal layout optimized for the string’s contents.

14.4 Flexible String Representation

CPython uses a flexible internal representation for Unicode strings.

The key idea is simple:

Use the smallest fixed-width storage that can represent every code point in the string.

Conceptually:

Maximum code point in stringStorage width
ASCII only1 byte per character
Up to U+00FF1 byte per character
Up to U+FFFF2 bytes per character
Above U+FFFF4 bytes per character

Example:

"abc"       # compact ASCII path
"café"      # may fit in 1-byte storage
"中"        # needs 2-byte storage
"🐍"        # needs 4-byte storage

This design avoids wasting four bytes per character for common ASCII-heavy text while still supporting all Unicode code points.

14.5 String Object Metadata

A CPython string stores more than character data.

Conceptual fields include:

object header
length
hash cache
state flags
kind
compact flag
ASCII flag
ready flag
character data
optional UTF-8 cache

The exact C layout is version-specific, but the invariants matter more than the field names.

Important metadata:

MetadataPurpose
LengthNumber of Unicode code points
Hash cacheStores hash after first computation
KindStorage width
ASCII flagFast path for ASCII strings
Compact flagWhether data is stored close to object header
UTF-8 cacheCached encoded form for C API use

Strings are immutable, so cached metadata is safe. Once computed, the hash remains valid.

14.6 String Immutability

Python strings are immutable.

s = "hello"
s[0] = "H"       # TypeError

Operations create new strings:

s = "hello"
t = "H" + s[1:]

print(s)    # hello
print(t)    # Hello

Immutability gives CPython several advantages:

hash values can be cached
strings can be safely interned
strings can be shared across dictionaries
strings can be used as dict keys
substring operations cannot mutate originals

The cost is that repeated string concatenation can allocate many intermediate strings if written poorly.

14.7 String Hashing

Strings are hashable.

hash("name")

A string’s hash is computed from its contents. Because strings are immutable, CPython can cache the result inside the string object.

This matters because strings are used heavily as dictionary keys:

module globals
object attribute names
class dictionaries
keyword argument names
JSON-like data
configuration maps
protocol field names

Without cached string hashes, attribute lookup and dictionary lookup would be more expensive.

14.8 String Interning

Interning means reusing one string object for equal string values in selected cases.

a = "identifier"
b = "identifier"

print(a is b)   # may be True

CPython interns many identifier-like strings because they are common in attribute lookup and namespaces.

Interning is useful for:

attribute names
variable names
keyword names
module names
common internal strings

With interned strings, equality checks can often become pointer checks after hash checks in internal paths.

But user code should not depend on arbitrary string identity.

Correct:

if a == b:
    ...

Incorrect:

if a is b:
    ...

Use is only for identity semantics, especially documented singletons such as None.

14.9 Encoding

Encoding converts text to bytes.

s = "café"
b = s.encode("utf-8")

print(b)        # b'caf\xc3\xa9'

The string is Unicode text. UTF-8 is one external byte representation.

Common encodings:

EncodingUse
UTF-8Web, files, APIs, Unix systems
UTF-16Some platform APIs and file formats
UTF-32Fixed-width Unicode storage
Latin-1Legacy Western byte mapping
ASCII7-bit English subset

CPython does not treat encoding as a property of a str. A string is decoded text. Encoding is used when crossing a byte boundary.

14.10 Decoding

Decoding converts bytes to text.

b = b"caf\xc3\xa9"
s = b.decode("utf-8")

print(s)        # café

If the bytes are invalid for the selected encoding, decoding fails unless an error handler is used.

b = b"\xff"

b.decode("utf-8")                 # UnicodeDecodeError
b.decode("utf-8", errors="ignore")
b.decode("utf-8", errors="replace")

Error handlers include:

strict
ignore
replace
backslashreplace
surrogateescape
surrogatepass

Encoding errors are part of the text boundary, not ordinary string operations.

14.11 UTF-8 Boundary

Most modern external text is UTF-8.

Examples:

source files
JSON
HTTP payloads
HTML
Markdown
logs
configuration files
database text protocols

A practical rule:

inside Python
    use str

at file, network, process, and binary protocol boundaries
    encode or decode explicitly

Example:

from pathlib import Path

text = Path("notes.txt").read_text(encoding="utf-8")
data = text.encode("utf-8")

This keeps the boundary clear.

14.12 Source Code Encoding

Python source files are decoded before parsing.

The default source encoding is UTF-8 unless an encoding declaration says otherwise.

Example:

# -*- coding: utf-8 -*-

name = "café"

The tokenizer works on decoded source text. String literals then become Python str objects unless they are bytes literals.

s = "abc"       # str
b = b"abc"      # bytes

This distinction is made during parsing and compilation.

14.13 String Literals

Python has several string literal forms.

"hello"
'hello'
"""hello"""
'''hello'''
r"\n"
f"value = {x}"

Literal prefixes affect parsing:

PrefixMeaning
rRaw string literal
fFormatted string literal
bBytes literal
uHistorical compatibility prefix
combinationsfr, rf, br, rb

A raw string changes how escapes are interpreted by the parser.

s = r"\n"
print(s)        # \n

It still creates a normal str.

14.14 Bytes Objects

bytes represents immutable binary data.

b = b"hello"

A bytes object is a sequence of integers in the range 0 to 255.

b = b"ABC"

print(b[0])     # 65
print(b[1])     # 66

Slicing returns another bytes object:

print(b[1:])    # b'BC'

Because bytes is immutable, it is hashable.

d = {b"key": "value"}

Bytes are useful for:

network protocols
binary files
cryptographic data
compressed data
encoded text
database wire formats
image and audio formats

14.15 Bytes Object Layout

A bytes object is variable-sized.

Conceptually:

PyBytesObject
    PyVarObject header
        ob_size = number of bytes
    hash cache
    byte data
    trailing NUL byte for C compatibility

The trailing NUL byte helps when passing data to C APIs that expect C strings, but bytes may contain embedded NUL bytes:

b = b"a\0b"
print(len(b))   # 3

So bytes must not be treated as ordinary NUL-terminated strings unless the API specifically permits it and the content is known safe.

14.16 Bytearray Objects

bytearray is mutable binary data.

buf = bytearray(b"hello")
buf[0] = ord("H")

print(buf)      # bytearray(b'Hello')

A bytearray supports in-place modification:

buf.append(33)
buf.extend(b" world")

It is not hashable because its contents can change.

hash(bytearray(b"abc"))     # TypeError

Conceptually, bytearray is closer to a mutable list of bytes, but implemented as a compact byte buffer rather than an array of Python integer object references.

14.17 Bytes vs List of Ints

Compare:

b = b"abc"
xs = [97, 98, 99]

The bytes object stores raw bytes compactly.

The list stores references to Python integer objects.

Conceptually:

bytes
    [97][98][99]

list
    [ptr][ptr][ptr]
      |    |    |
      v    v    v
     int  int  int

For binary data, bytes and bytearray are far more memory efficient.

14.18 Memoryview

memoryview exposes the buffer of another object.

buf = bytearray(b"hello")
view = memoryview(buf)

view[0] = ord("H")

print(buf)      # bytearray(b'Hello')

A memoryview avoids copying.

It is useful when slicing or passing binary data between APIs:

data = bytearray(b"abcdef")
view = memoryview(data)[2:5]

print(view.tobytes())     # b'cde'

The view keeps the exporter alive and enforces buffer lifetime rules.

14.19 Buffer Protocol

The buffer protocol is the C-level protocol behind memoryview.

It lets objects expose raw memory to other objects.

Common buffer exporters:

bytes
bytearray
array.array
memoryview
mmap
third-party arrays such as NumPy arrays
custom extension objects

The protocol describes:

pointer to memory
length
item size
format
number of dimensions
shape
strides
readonly flag
lifetime callbacks

This makes zero-copy binary processing possible.

14.20 Text I/O vs Binary I/O

Files can be opened in text mode or binary mode.

Text mode decodes bytes into strings:

with open("notes.txt", "r", encoding="utf-8") as f:
    text = f.read()

Binary mode returns bytes:

with open("notes.txt", "rb") as f:
    data = f.read()

Text mode handles encoding, decoding, and newline translation.

Binary mode gives raw bytes.

Choose based on the data model:

human-readable text
    text mode, str

protocol bytes or exact file bytes
    binary mode, bytes

14.21 Common Boundary Bug

A common bug is mixing text and bytes at boundaries.

Incorrect:

def send(sock, message):
    sock.sendall(message)       # fails if message is str

Correct:

def send(sock, message):
    data = message.encode("utf-8")
    sock.sendall(data)

For receiving:

data = sock.recv(4096)
text = data.decode("utf-8")

Keep the conversion explicit so the encoding is visible.

14.22 String Concatenation

Strings are immutable, so concatenation creates a new string.

s = "a"
s = s + "b"
s = s + "c"

This can allocate repeatedly.

For many parts, prefer join:

parts = ["a", "b", "c"]
s = "".join(parts)

For streaming text, use io.StringIO:

from io import StringIO

buf = StringIO()
buf.write("a")
buf.write("b")
buf.write("c")

s = buf.getvalue()

CPython has optimizations for some local concatenation cases, but code should not depend on them for general performance.

14.23 Bytes Building

Bytes are immutable too.

Repeated bytes concatenation has the same allocation problem.

data = b""
for chunk in chunks:
    data += chunk

Better:

data = b"".join(chunks)

For mutable incremental construction:

buf = bytearray()
for chunk in chunks:
    buf.extend(chunk)

data = bytes(buf)

For file-like binary building:

from io import BytesIO

buf = BytesIO()
buf.write(b"abc")
buf.write(b"def")

data = buf.getvalue()

14.24 Unicode Normalization

Different Unicode sequences can look the same.

Example:

a = "é"          # single code point U+00E9
b = "e\u0301"    # e plus combining acute accent

print(a == b)    # False

They may render similarly, but they are different sequences of code points.

Normalize when comparing human text:

import unicodedata

a = unicodedata.normalize("NFC", a)
b = unicodedata.normalize("NFC", b)

print(a == b)    # True

Common normalization forms:

FormMeaning
NFCCanonical composition
NFDCanonical decomposition
NFKCCompatibility composition
NFKDCompatibility decomposition

Normalization is a Unicode-level concern, not a CPython object layout concern. But it matters for correct text processing.

14.25 Case Folding

Case-insensitive comparison should often use casefold, not lower.

a = "Straße"
b = "strasse"

print(a.lower() == b.lower())       # often False
print(a.casefold() == b.casefold()) # True

casefold is designed for caseless matching.

For identifiers, filenames, usernames, and protocol fields, define exact normalization and case rules. Do not guess.

14.26 Grapheme Clusters

A user-visible character may contain multiple code points.

Example:

s = "👨‍👩‍👧‍👦"
print(len(s))

This may print more than 1 because the visible family emoji is a sequence joined by zero-width joiners.

Python str indexes code points, not grapheme clusters.

s[0]

may return only part of a visible character.

For UI text editing, cursor movement, truncation, and display width, code point indexing may be insufficient. You need Unicode-aware grapheme segmentation.

14.27 Encoding Error Strategies

Encoding can fail if a character cannot be represented.

"é".encode("ascii")       # UnicodeEncodeError

Error strategies control behavior:

"é".encode("ascii", errors="ignore")
"é".encode("ascii", errors="replace")
"é".encode("ascii", errors="backslashreplace")

For file and process boundaries, choose error handling deliberately.

A common Unix-facing strategy is surrogateescape, which allows undecodable bytes to round-trip through str without data loss in some filesystem and environment contexts.

14.28 Filesystem Encoding

File paths cross a text and bytes boundary.

Python usually exposes paths as str, but operating systems often operate on encoded byte sequences or platform-native string formats.

CPython has filesystem encoding logic to convert between Python strings and platform path representations.

Practical rule:

use pathlib and str paths for normal application code
use bytes paths only when you need exact low-level byte behavior

Example:

from pathlib import Path

p = Path("data") / "notes.txt"
text = p.read_text(encoding="utf-8")

14.29 C API View of Unicode

C extension code often needs to accept or produce strings.

Common operations include:

create Unicode from UTF-8
convert Unicode to UTF-8
inspect length
read code points
parse arguments as str

Conceptual examples:

PyObject *s = PyUnicode_FromString("hello");

and:

const char *p = PyUnicode_AsUTF8(obj);

The UTF-8 pointer returned by some APIs may point into a cache owned by the string object. Its lifetime depends on the owning object remaining alive.

Extension code must not free that pointer.

14.30 C API View of Bytes

Bytes expose contiguous binary memory.

Conceptual examples:

PyObject *b = PyBytes_FromStringAndSize(data, len);

Access:

char *p = PyBytes_AS_STRING(b);
Py_ssize_t n = PyBytes_GET_SIZE(b);

Fast macros assume the object is really a bytes object.

Safer checked APIs should be used at public boundaries.

Bytes may contain NUL bytes, so always use explicit length.

14.31 Choosing the Right Type

NeedType
Human textstr
Encoded textbytes
Binary protocol databytes
Mutable binary bufferbytearray
Zero-copy viewmemoryview
Many text fragmentslist[str] plus "".join
Many binary fragmentslist[bytes] plus b"".join
Incremental text writerio.StringIO
Incremental binary writerio.BytesIO or bytearray

The type should reflect the data model. Avoid using str for arbitrary bytes. Avoid using bytes for decoded text.

14.32 Mental Model

Use this model:

str
    immutable Unicode code point sequence
    optimized internal storage
    cached hash
    explicit encode/decode boundary

bytes
    immutable byte sequence
    compact binary storage
    hashable
    no text semantics unless decoded

bytearray
    mutable byte sequence
    compact binary storage
    useful for incremental construction

memoryview
    zero-copy view over buffer-exporting object
    lifetime tied to exporter

Text processing bugs often come from confusing code points, bytes, encodings, grapheme clusters, normalization, and display width. CPython’s object model keeps text and bytes separate so these decisions remain explicit.

14.33 Summary

str is CPython’s Unicode text object. It is immutable, hashable, and internally optimized through flexible storage, cached hashes, and interning. It represents decoded text, not encoded bytes.

bytes is immutable binary data. bytearray is mutable binary data. memoryview provides zero-copy access to buffer-exporting objects.

The boundary between text and binary data is explicit: encode str to produce bytes, and decode bytes to produce str. This boundary is essential for correct file I/O, network protocols, source decoding, extension code, and Unicode-aware applications.