14. Strings, Bytes, and Unicode

Text and binary data are separate object families in Python. str represents Unicode text. bytes represents immutable binary data. bytearray represents mutable binary data.

This separation is one of Python 3’s most important runtime design choices. Text has characters and encodings. Binary data has bytes. CPython implements these concepts with different object layouts, APIs, and invariants.

14.1 Text vs Binary Data

A string is text:

s = "hello"

A bytes object is binary data:

b = b"hello"

They may look similar for ASCII content, but they are different types.

print(type("hello"))    # <class 'str'>
print(type(b"hello"))   # <class 'bytes'>

Python does not implicitly mix them:

"hello" + b"world"      # TypeError

This is deliberate. Combining text and bytes requires an encoding decision.

text = "hello"
data = text.encode("utf-8")

again = data.decode("utf-8")

The conversion boundary is explicit.

14.2 `str` Represents Unicode Text

A Python str is a sequence of Unicode code points.

s = "Python 🐍"
print(len(s))

len(s) counts code points, not encoded bytes.

s = "é"

print(len(s))                  # 1
print(len(s.encode("utf-8")))  # 2

The string contains one Unicode code point. Its UTF-8 encoding uses two bytes.

This distinction appears throughout CPython internals:

str
    logical Unicode text

bytes
    encoded or arbitrary binary data

14.3 Unicode Code Points

A Unicode code point is an integer value assigned by the Unicode standard.

Examples:

Character	Code point
`A`	`U+0041`
`é`	`U+00E9`
`中`	`U+4E2D`
`🐍`	`U+1F40D`

Python exposes this through ord and chr.

print(ord("A"))        # 65
print(hex(ord("🐍")))  # 0x1f40d

print(chr(0x1f40d))    # 🐍

A string is not stored as UTF-8 bytes internally in the simple universal sense. CPython chooses an internal layout optimized for the string’s contents.

14.4 Flexible String Representation

CPython uses a flexible internal representation for Unicode strings.

The key idea is simple:

Use the smallest fixed-width storage that can represent every code point in the string.

Conceptually:

Maximum code point in string	Storage width
ASCII only	1 byte per character
Up to `U+00FF`	1 byte per character
Up to `U+FFFF`	2 bytes per character
Above `U+FFFF`	4 bytes per character

Example:

"abc"       # compact ASCII path
"café"      # may fit in 1-byte storage
"中"        # needs 2-byte storage
"🐍"        # needs 4-byte storage

This design avoids wasting four bytes per character for common ASCII-heavy text while still supporting all Unicode code points.

14.5 String Object Metadata

A CPython string stores more than character data.

Conceptual fields include:

object header
length
hash cache
state flags
kind
compact flag
ASCII flag
ready flag
character data
optional UTF-8 cache

The exact C layout is version-specific, but the invariants matter more than the field names.

Important metadata:

Metadata	Purpose
Length	Number of Unicode code points
Hash cache	Stores hash after first computation
Kind	Storage width
ASCII flag	Fast path for ASCII strings
Compact flag	Whether data is stored close to object header
UTF-8 cache	Cached encoded form for C API use

Strings are immutable, so cached metadata is safe. Once computed, the hash remains valid.

14.6 String Immutability

Python strings are immutable.

s = "hello"
s[0] = "H"       # TypeError

Operations create new strings:

s = "hello"
t = "H" + s[1:]

print(s)    # hello
print(t)    # Hello

Immutability gives CPython several advantages:

hash values can be cached
strings can be safely interned
strings can be shared across dictionaries
strings can be used as dict keys
substring operations cannot mutate originals

The cost is that repeated string concatenation can allocate many intermediate strings if written poorly.

14.7 String Hashing

Strings are hashable.

hash("name")

A string’s hash is computed from its contents. Because strings are immutable, CPython can cache the result inside the string object.

This matters because strings are used heavily as dictionary keys:

module globals
object attribute names
class dictionaries
keyword argument names
JSON-like data
configuration maps
protocol field names

Without cached string hashes, attribute lookup and dictionary lookup would be more expensive.

14.8 String Interning

Interning means reusing one string object for equal string values in selected cases.

a = "identifier"
b = "identifier"

print(a is b)   # may be True

CPython interns many identifier-like strings because they are common in attribute lookup and namespaces.

Interning is useful for:

attribute names
variable names
keyword names
module names
common internal strings

With interned strings, equality checks can often become pointer checks after hash checks in internal paths.

But user code should not depend on arbitrary string identity.

Correct:

if a == b:
    ...

Incorrect:

if a is b:
    ...

Use is only for identity semantics, especially documented singletons such as None.

14.9 Encoding

Encoding converts text to bytes.

s = "café"
b = s.encode("utf-8")

print(b)        # b'caf\xc3\xa9'

The string is Unicode text. UTF-8 is one external byte representation.

Common encodings:

Encoding	Use
UTF-8	Web, files, APIs, Unix systems
UTF-16	Some platform APIs and file formats
UTF-32	Fixed-width Unicode storage
Latin-1	Legacy Western byte mapping
ASCII	7-bit English subset

CPython does not treat encoding as a property of a str. A string is decoded text. Encoding is used when crossing a byte boundary.

14.10 Decoding

Decoding converts bytes to text.

b = b"caf\xc3\xa9"
s = b.decode("utf-8")

print(s)        # café

If the bytes are invalid for the selected encoding, decoding fails unless an error handler is used.

b = b"\xff"

b.decode("utf-8")                 # UnicodeDecodeError
b.decode("utf-8", errors="ignore")
b.decode("utf-8", errors="replace")

Error handlers include:

strict
ignore
replace
backslashreplace
surrogateescape
surrogatepass

Encoding errors are part of the text boundary, not ordinary string operations.

14.11 UTF-8 Boundary

Most modern external text is UTF-8.

Examples:

source files
JSON
HTTP payloads
HTML
Markdown
logs
configuration files
database text protocols

A practical rule:

inside Python
    use str

at file, network, process, and binary protocol boundaries
    encode or decode explicitly

Example:

from pathlib import Path

text = Path("notes.txt").read_text(encoding="utf-8")
data = text.encode("utf-8")

This keeps the boundary clear.

14.12 Source Code Encoding

Python source files are decoded before parsing.

The default source encoding is UTF-8 unless an encoding declaration says otherwise.

Example:

# -*- coding: utf-8 -*-

name = "café"

The tokenizer works on decoded source text. String literals then become Python str objects unless they are bytes literals.

s = "abc"       # str
b = b"abc"      # bytes

This distinction is made during parsing and compilation.

14.13 String Literals

Python has several string literal forms.

"hello"
'hello'
"""hello"""
'''hello'''
r"\n"
f"value = {x}"

Literal prefixes affect parsing:

Prefix	Meaning
`r`	Raw string literal
`f`	Formatted string literal
`b`	Bytes literal
`u`	Historical compatibility prefix
combinations	`fr`, `rf`, `br`, `rb`

A raw string changes how escapes are interpreted by the parser.

s = r"\n"
print(s)        # \n

It still creates a normal str.

14.14 Bytes Objects

bytes represents immutable binary data.

b = b"hello"

A bytes object is a sequence of integers in the range 0 to 255.

b = b"ABC"

print(b[0])     # 65
print(b[1])     # 66

Slicing returns another bytes object:

print(b[1:])    # b'BC'

Because bytes is immutable, it is hashable.

d = {b"key": "value"}

Bytes are useful for:

network protocols
binary files
cryptographic data
compressed data
encoded text
database wire formats
image and audio formats

14.15 Bytes Object Layout

A bytes object is variable-sized.

Conceptually:

PyBytesObject
    PyVarObject header
        ob_size = number of bytes
    hash cache
    byte data
    trailing NUL byte for C compatibility

The trailing NUL byte helps when passing data to C APIs that expect C strings, but bytes may contain embedded NUL bytes:

b = b"a\0b"
print(len(b))   # 3

So bytes must not be treated as ordinary NUL-terminated strings unless the API specifically permits it and the content is known safe.

14.16 Bytearray Objects

bytearray is mutable binary data.

buf = bytearray(b"hello")
buf[0] = ord("H")

print(buf)      # bytearray(b'Hello')

A bytearray supports in-place modification:

buf.append(33)
buf.extend(b" world")

It is not hashable because its contents can change.

hash(bytearray(b"abc"))     # TypeError

Conceptually, bytearray is closer to a mutable list of bytes, but implemented as a compact byte buffer rather than an array of Python integer object references.

14.17 Bytes vs List of Ints

Compare:

b = b"abc"
xs = [97, 98, 99]

The bytes object stores raw bytes compactly.

The list stores references to Python integer objects.

Conceptually:

bytes
    [97][98][99]

list
    [ptr][ptr][ptr]
      |    |    |
      v    v    v
     int  int  int

For binary data, bytes and bytearray are far more memory efficient.

14.18 Memoryview

memoryview exposes the buffer of another object.

buf = bytearray(b"hello")
view = memoryview(buf)

view[0] = ord("H")

print(buf)      # bytearray(b'Hello')

A memoryview avoids copying.

It is useful when slicing or passing binary data between APIs:

data = bytearray(b"abcdef")
view = memoryview(data)[2:5]

print(view.tobytes())     # b'cde'

The view keeps the exporter alive and enforces buffer lifetime rules.

14.19 Buffer Protocol

The buffer protocol is the C-level protocol behind memoryview.

It lets objects expose raw memory to other objects.

Common buffer exporters:

bytes
bytearray
array.array
memoryview
mmap
third-party arrays such as NumPy arrays
custom extension objects

The protocol describes:

pointer to memory
length
item size
format
number of dimensions
shape
strides
readonly flag
lifetime callbacks

This makes zero-copy binary processing possible.

14.20 Text I/O vs Binary I/O

Files can be opened in text mode or binary mode.

Text mode decodes bytes into strings:

with open("notes.txt", "r", encoding="utf-8") as f:
    text = f.read()

Binary mode returns bytes:

with open("notes.txt", "rb") as f:
    data = f.read()

Text mode handles encoding, decoding, and newline translation.

Binary mode gives raw bytes.

Choose based on the data model:

human-readable text
    text mode, str

protocol bytes or exact file bytes
    binary mode, bytes

14.21 Common Boundary Bug

A common bug is mixing text and bytes at boundaries.

Incorrect:

def send(sock, message):
    sock.sendall(message)       # fails if message is str

Correct:

def send(sock, message):
    data = message.encode("utf-8")
    sock.sendall(data)

For receiving:

data = sock.recv(4096)
text = data.decode("utf-8")

Keep the conversion explicit so the encoding is visible.

14.22 String Concatenation

Strings are immutable, so concatenation creates a new string.

s = "a"
s = s + "b"
s = s + "c"

This can allocate repeatedly.

For many parts, prefer join:

parts = ["a", "b", "c"]
s = "".join(parts)

For streaming text, use io.StringIO:

from io import StringIO

buf = StringIO()
buf.write("a")
buf.write("b")
buf.write("c")

s = buf.getvalue()

CPython has optimizations for some local concatenation cases, but code should not depend on them for general performance.

14.23 Bytes Building

Bytes are immutable too.

Repeated bytes concatenation has the same allocation problem.

data = b""
for chunk in chunks:
    data += chunk

Better:

data = b"".join(chunks)

For mutable incremental construction:

buf = bytearray()
for chunk in chunks:
    buf.extend(chunk)

data = bytes(buf)

For file-like binary building:

from io import BytesIO

buf = BytesIO()
buf.write(b"abc")
buf.write(b"def")

data = buf.getvalue()

14.24 Unicode Normalization

Different Unicode sequences can look the same.

Example:

a = "é"          # single code point U+00E9
b = "e\u0301"    # e plus combining acute accent

print(a == b)    # False

They may render similarly, but they are different sequences of code points.

Normalize when comparing human text:

import unicodedata

a = unicodedata.normalize("NFC", a)
b = unicodedata.normalize("NFC", b)

print(a == b)    # True

Common normalization forms:

Form	Meaning
NFC	Canonical composition
NFD	Canonical decomposition
NFKC	Compatibility composition
NFKD	Compatibility decomposition

Normalization is a Unicode-level concern, not a CPython object layout concern. But it matters for correct text processing.

14.25 Case Folding

Case-insensitive comparison should often use casefold, not lower.

a = "Straße"
b = "strasse"

print(a.lower() == b.lower())       # often False
print(a.casefold() == b.casefold()) # True

casefold is designed for caseless matching.

For identifiers, filenames, usernames, and protocol fields, define exact normalization and case rules. Do not guess.

14.26 Grapheme Clusters

A user-visible character may contain multiple code points.

Example:

s = "👨‍👩‍👧‍👦"
print(len(s))

This may print more than 1 because the visible family emoji is a sequence joined by zero-width joiners.

Python str indexes code points, not grapheme clusters.

s[0]

may return only part of a visible character.

For UI text editing, cursor movement, truncation, and display width, code point indexing may be insufficient. You need Unicode-aware grapheme segmentation.

14.27 Encoding Error Strategies

Encoding can fail if a character cannot be represented.

"é".encode("ascii")       # UnicodeEncodeError

Error strategies control behavior:

"é".encode("ascii", errors="ignore")
"é".encode("ascii", errors="replace")
"é".encode("ascii", errors="backslashreplace")

For file and process boundaries, choose error handling deliberately.

A common Unix-facing strategy is surrogateescape, which allows undecodable bytes to round-trip through str without data loss in some filesystem and environment contexts.

14.28 Filesystem Encoding

File paths cross a text and bytes boundary.

Python usually exposes paths as str, but operating systems often operate on encoded byte sequences or platform-native string formats.

CPython has filesystem encoding logic to convert between Python strings and platform path representations.

Practical rule:

use pathlib and str paths for normal application code
use bytes paths only when you need exact low-level byte behavior

Example:

from pathlib import Path

p = Path("data") / "notes.txt"
text = p.read_text(encoding="utf-8")

14.29 C API View of Unicode

C extension code often needs to accept or produce strings.

Common operations include:

create Unicode from UTF-8
convert Unicode to UTF-8
inspect length
read code points
parse arguments as str

Conceptual examples:

PyObject *s = PyUnicode_FromString("hello");

and:

const char *p = PyUnicode_AsUTF8(obj);

The UTF-8 pointer returned by some APIs may point into a cache owned by the string object. Its lifetime depends on the owning object remaining alive.

Extension code must not free that pointer.

14.30 C API View of Bytes

Bytes expose contiguous binary memory.

Conceptual examples:

PyObject *b = PyBytes_FromStringAndSize(data, len);

Access:

char *p = PyBytes_AS_STRING(b);
Py_ssize_t n = PyBytes_GET_SIZE(b);

Fast macros assume the object is really a bytes object.

Safer checked APIs should be used at public boundaries.

Bytes may contain NUL bytes, so always use explicit length.

14.31 Choosing the Right Type

Need	Type
Human text	`str`
Encoded text	`bytes`
Binary protocol data	`bytes`
Mutable binary buffer	`bytearray`
Zero-copy view	`memoryview`
Many text fragments	`list[str]` plus `"".join`
Many binary fragments	`list[bytes]` plus `b"".join`
Incremental text writer	`io.StringIO`
Incremental binary writer	`io.BytesIO` or `bytearray`

The type should reflect the data model. Avoid using str for arbitrary bytes. Avoid using bytes for decoded text.

14.32 Mental Model

Use this model:

str
    immutable Unicode code point sequence
    optimized internal storage
    cached hash
    explicit encode/decode boundary

bytes
    immutable byte sequence
    compact binary storage
    hashable
    no text semantics unless decoded

bytearray
    mutable byte sequence
    compact binary storage
    useful for incremental construction

memoryview
    zero-copy view over buffer-exporting object
    lifetime tied to exporter

Text processing bugs often come from confusing code points, bytes, encodings, grapheme clusters, normalization, and display width. CPython’s object model keeps text and bytes separate so these decisions remain explicit.

14.33 Summary

str is CPython’s Unicode text object. It is immutable, hashable, and internally optimized through flexible storage, cached hashes, and interning. It represents decoded text, not encoded bytes.

bytes is immutable binary data. bytearray is mutable binary data. memoryview provides zero-copy access to buffer-exporting objects.

The boundary between text and binary data is explicit: encode str to produce bytes, and decode bytes to produce str. This boundary is essential for correct file I/O, network protocols, source decoding, extension code, and Unicode-aware applications.

14. Strings, Bytes, and Unicode

14.1 Text vs Binary Data

14.2 str Represents Unicode Text

14.3 Unicode Code Points

14.4 Flexible String Representation

14.5 String Object Metadata

14.6 String Immutability

14.7 String Hashing

14.8 String Interning

14.9 Encoding

14.10 Decoding

14.11 UTF-8 Boundary

14.12 Source Code Encoding

14.13 String Literals

14.14 Bytes Objects

14.15 Bytes Object Layout

14.16 Bytearray Objects

14.17 Bytes vs List of Ints

14.18 Memoryview

14.19 Buffer Protocol

14.20 Text I/O vs Binary I/O

14.21 Common Boundary Bug

14.22 String Concatenation

14.23 Bytes Building

14.24 Unicode Normalization

14.25 Case Folding

14.26 Grapheme Clusters

14.27 Encoding Error Strategies

14.28 Filesystem Encoding

14.29 C API View of Unicode

14.30 C API View of Bytes

14.31 Choosing the Right Type

14.32 Mental Model

14.33 Summary

14.2 `str` Represents Unicode Text