# 14. Strings, Bytes, and Unicode

# 14. Strings, Bytes, and Unicode

Text and binary data are separate object families in Python. `str` represents Unicode text. `bytes` represents immutable binary data. `bytearray` represents mutable binary data.

This separation is one of Python 3’s most important runtime design choices. Text has characters and encodings. Binary data has bytes. CPython implements these concepts with different object layouts, APIs, and invariants.

## 14.1 Text vs Binary Data

A string is text:

```python id="ly45du"
s = "hello"
```

A bytes object is binary data:

```python id="1wh8hi"
b = b"hello"
```

They may look similar for ASCII content, but they are different types.

```python id="iqtxra"
print(type("hello"))    # <class 'str'>
print(type(b"hello"))   # <class 'bytes'>
```

Python does not implicitly mix them:

```python id="8bzmko"
"hello" + b"world"      # TypeError
```

This is deliberate. Combining text and bytes requires an encoding decision.

```python id="8w5t7o"
text = "hello"
data = text.encode("utf-8")

again = data.decode("utf-8")
```

The conversion boundary is explicit.

## 14.2 `str` Represents Unicode Text

A Python `str` is a sequence of Unicode code points.

```python id="tq2xfz"
s = "Python 🐍"
print(len(s))
```

`len(s)` counts code points, not encoded bytes.

```python id="lxgm2u"
s = "é"

print(len(s))                  # 1
print(len(s.encode("utf-8")))  # 2
```

The string contains one Unicode code point. Its UTF-8 encoding uses two bytes.

This distinction appears throughout CPython internals:

```text id="c3cd4w"
str
    logical Unicode text

bytes
    encoded or arbitrary binary data
```

## 14.3 Unicode Code Points

A Unicode code point is an integer value assigned by the Unicode standard.

Examples:

| Character | Code point |
| --------- | ---------- |
| `A`       | `U+0041`   |
| `é`       | `U+00E9`   |
| `中`       | `U+4E2D`   |
| `🐍`      | `U+1F40D`  |

Python exposes this through `ord` and `chr`.

```python id="cvt38d"
print(ord("A"))        # 65
print(hex(ord("🐍")))  # 0x1f40d

print(chr(0x1f40d))    # 🐍
```

A string is not stored as UTF-8 bytes internally in the simple universal sense. CPython chooses an internal layout optimized for the string’s contents.

## 14.4 Flexible String Representation

CPython uses a flexible internal representation for Unicode strings.

The key idea is simple:

```text id="hj5qoo"
Use the smallest fixed-width storage that can represent every code point in the string.
```

Conceptually:

| Maximum code point in string |         Storage width |
| ---------------------------- | --------------------: |
| ASCII only                   |  1 byte per character |
| Up to `U+00FF`               |  1 byte per character |
| Up to `U+FFFF`               | 2 bytes per character |
| Above `U+FFFF`               | 4 bytes per character |

Example:

```python id="k8wqod"
"abc"       # compact ASCII path
"café"      # may fit in 1-byte storage
"中"        # needs 2-byte storage
"🐍"        # needs 4-byte storage
```

This design avoids wasting four bytes per character for common ASCII-heavy text while still supporting all Unicode code points.

## 14.5 String Object Metadata

A CPython string stores more than character data.

Conceptual fields include:

```text id="xr6949"
object header
length
hash cache
state flags
kind
compact flag
ASCII flag
ready flag
character data
optional UTF-8 cache
```

The exact C layout is version-specific, but the invariants matter more than the field names.

Important metadata:

| Metadata     | Purpose                                       |
| ------------ | --------------------------------------------- |
| Length       | Number of Unicode code points                 |
| Hash cache   | Stores hash after first computation           |
| Kind         | Storage width                                 |
| ASCII flag   | Fast path for ASCII strings                   |
| Compact flag | Whether data is stored close to object header |
| UTF-8 cache  | Cached encoded form for C API use             |

Strings are immutable, so cached metadata is safe. Once computed, the hash remains valid.

## 14.6 String Immutability

Python strings are immutable.

```python id="8lnw2h"
s = "hello"
s[0] = "H"       # TypeError
```

Operations create new strings:

```python id="q4ux6e"
s = "hello"
t = "H" + s[1:]

print(s)    # hello
print(t)    # Hello
```

Immutability gives CPython several advantages:

```text id="ylorrg"
hash values can be cached
strings can be safely interned
strings can be shared across dictionaries
strings can be used as dict keys
substring operations cannot mutate originals
```

The cost is that repeated string concatenation can allocate many intermediate strings if written poorly.

## 14.7 String Hashing

Strings are hashable.

```python id="7pmw4s"
hash("name")
```

A string’s hash is computed from its contents. Because strings are immutable, CPython can cache the result inside the string object.

This matters because strings are used heavily as dictionary keys:

```text id="n69066"
module globals
object attribute names
class dictionaries
keyword argument names
JSON-like data
configuration maps
protocol field names
```

Without cached string hashes, attribute lookup and dictionary lookup would be more expensive.

## 14.8 String Interning

Interning means reusing one string object for equal string values in selected cases.

```python id="alsrhm"
a = "identifier"
b = "identifier"

print(a is b)   # may be True
```

CPython interns many identifier-like strings because they are common in attribute lookup and namespaces.

Interning is useful for:

```text id="te68vu"
attribute names
variable names
keyword names
module names
common internal strings
```

With interned strings, equality checks can often become pointer checks after hash checks in internal paths.

But user code should not depend on arbitrary string identity.

Correct:

```python id="k4ah3b"
if a == b:
    ...
```

Incorrect:

```python id="tgxz95"
if a is b:
    ...
```

Use `is` only for identity semantics, especially documented singletons such as `None`.

## 14.9 Encoding

Encoding converts text to bytes.

```python id="n587av"
s = "café"
b = s.encode("utf-8")

print(b)        # b'caf\xc3\xa9'
```

The string is Unicode text. UTF-8 is one external byte representation.

Common encodings:

| Encoding | Use                                 |
| -------- | ----------------------------------- |
| UTF-8    | Web, files, APIs, Unix systems      |
| UTF-16   | Some platform APIs and file formats |
| UTF-32   | Fixed-width Unicode storage         |
| Latin-1  | Legacy Western byte mapping         |
| ASCII    | 7-bit English subset                |

CPython does not treat encoding as a property of a `str`. A string is decoded text. Encoding is used when crossing a byte boundary.

## 14.10 Decoding

Decoding converts bytes to text.

```python id="n1sd4m"
b = b"caf\xc3\xa9"
s = b.decode("utf-8")

print(s)        # café
```

If the bytes are invalid for the selected encoding, decoding fails unless an error handler is used.

```python id="83jkr9"
b = b"\xff"

b.decode("utf-8")                 # UnicodeDecodeError
b.decode("utf-8", errors="ignore")
b.decode("utf-8", errors="replace")
```

Error handlers include:

```text id="j4u46t"
strict
ignore
replace
backslashreplace
surrogateescape
surrogatepass
```

Encoding errors are part of the text boundary, not ordinary string operations.

## 14.11 UTF-8 Boundary

Most modern external text is UTF-8.

Examples:

```text id="ps2rks"
source files
JSON
HTTP payloads
HTML
Markdown
logs
configuration files
database text protocols
```

A practical rule:

```text id="cszoaq"
inside Python
    use str

at file, network, process, and binary protocol boundaries
    encode or decode explicitly
```

Example:

```python id="le5dmj"
from pathlib import Path

text = Path("notes.txt").read_text(encoding="utf-8")
data = text.encode("utf-8")
```

This keeps the boundary clear.

## 14.12 Source Code Encoding

Python source files are decoded before parsing.

The default source encoding is UTF-8 unless an encoding declaration says otherwise.

Example:

```python id="cndzxu"
# -*- coding: utf-8 -*-

name = "café"
```

The tokenizer works on decoded source text. String literals then become Python `str` objects unless they are bytes literals.

```python id="wu9nv8"
s = "abc"       # str
b = b"abc"      # bytes
```

This distinction is made during parsing and compilation.

## 14.13 String Literals

Python has several string literal forms.

```python id="9uk2iv"
"hello"
'hello'
"""hello"""
'''hello'''
r"\n"
f"value = {x}"
```

Literal prefixes affect parsing:

| Prefix       | Meaning                         |
| ------------ | ------------------------------- |
| `r`          | Raw string literal              |
| `f`          | Formatted string literal        |
| `b`          | Bytes literal                   |
| `u`          | Historical compatibility prefix |
| combinations | `fr`, `rf`, `br`, `rb`          |

A raw string changes how escapes are interpreted by the parser.

```python id="5il9aw"
s = r"\n"
print(s)        # \n
```

It still creates a normal `str`.

## 14.14 Bytes Objects

`bytes` represents immutable binary data.

```python id="rdc547"
b = b"hello"
```

A bytes object is a sequence of integers in the range 0 to 255.

```python id="a55532"
b = b"ABC"

print(b[0])     # 65
print(b[1])     # 66
```

Slicing returns another bytes object:

```python id="ry5nl7"
print(b[1:])    # b'BC'
```

Because `bytes` is immutable, it is hashable.

```python id="524c01"
d = {b"key": "value"}
```

Bytes are useful for:

```text id="ujfy6z"
network protocols
binary files
cryptographic data
compressed data
encoded text
database wire formats
image and audio formats
```

## 14.15 Bytes Object Layout

A bytes object is variable-sized.

Conceptually:

```text id="kr49t3"
PyBytesObject
    PyVarObject header
        ob_size = number of bytes
    hash cache
    byte data
    trailing NUL byte for C compatibility
```

The trailing NUL byte helps when passing data to C APIs that expect C strings, but bytes may contain embedded NUL bytes:

```python id="z96jr9"
b = b"a\0b"
print(len(b))   # 3
```

So bytes must not be treated as ordinary NUL-terminated strings unless the API specifically permits it and the content is known safe.

## 14.16 Bytearray Objects

`bytearray` is mutable binary data.

```python id="cltuuo"
buf = bytearray(b"hello")
buf[0] = ord("H")

print(buf)      # bytearray(b'Hello')
```

A bytearray supports in-place modification:

```python id="vf2c2m"
buf.append(33)
buf.extend(b" world")
```

It is not hashable because its contents can change.

```python id="2mamf7"
hash(bytearray(b"abc"))     # TypeError
```

Conceptually, bytearray is closer to a mutable list of bytes, but implemented as a compact byte buffer rather than an array of Python integer object references.

## 14.17 Bytes vs List of Ints

Compare:

```python id="mm43w3"
b = b"abc"
xs = [97, 98, 99]
```

The bytes object stores raw bytes compactly.

The list stores references to Python integer objects.

Conceptually:

```text id="6zghb9"
bytes
    [97][98][99]

list
    [ptr][ptr][ptr]
      |    |    |
      v    v    v
     int  int  int
```

For binary data, `bytes` and `bytearray` are far more memory efficient.

## 14.18 Memoryview

`memoryview` exposes the buffer of another object.

```python id="lk88r7"
buf = bytearray(b"hello")
view = memoryview(buf)

view[0] = ord("H")

print(buf)      # bytearray(b'Hello')
```

A memoryview avoids copying.

It is useful when slicing or passing binary data between APIs:

```python id="fmmfbg"
data = bytearray(b"abcdef")
view = memoryview(data)[2:5]

print(view.tobytes())     # b'cde'
```

The view keeps the exporter alive and enforces buffer lifetime rules.

## 14.19 Buffer Protocol

The buffer protocol is the C-level protocol behind `memoryview`.

It lets objects expose raw memory to other objects.

Common buffer exporters:

```text id="yt656l"
bytes
bytearray
array.array
memoryview
mmap
third-party arrays such as NumPy arrays
custom extension objects
```

The protocol describes:

```text id="va15hu"
pointer to memory
length
item size
format
number of dimensions
shape
strides
readonly flag
lifetime callbacks
```

This makes zero-copy binary processing possible.

## 14.20 Text I/O vs Binary I/O

Files can be opened in text mode or binary mode.

Text mode decodes bytes into strings:

```python id="ztfr4f"
with open("notes.txt", "r", encoding="utf-8") as f:
    text = f.read()
```

Binary mode returns bytes:

```python id="2vcxb8"
with open("notes.txt", "rb") as f:
    data = f.read()
```

Text mode handles encoding, decoding, and newline translation.

Binary mode gives raw bytes.

Choose based on the data model:

```text id="ifzmcj"
human-readable text
    text mode, str

protocol bytes or exact file bytes
    binary mode, bytes
```

## 14.21 Common Boundary Bug

A common bug is mixing text and bytes at boundaries.

Incorrect:

```python id="zctul0"
def send(sock, message):
    sock.sendall(message)       # fails if message is str
```

Correct:

```python id="daxwma"
def send(sock, message):
    data = message.encode("utf-8")
    sock.sendall(data)
```

For receiving:

```python id="jynmly"
data = sock.recv(4096)
text = data.decode("utf-8")
```

Keep the conversion explicit so the encoding is visible.

## 14.22 String Concatenation

Strings are immutable, so concatenation creates a new string.

```python id="fnu5rd"
s = "a"
s = s + "b"
s = s + "c"
```

This can allocate repeatedly.

For many parts, prefer `join`:

```python id="xnzlkg"
parts = ["a", "b", "c"]
s = "".join(parts)
```

For streaming text, use `io.StringIO`:

```python id="1icewe"
from io import StringIO

buf = StringIO()
buf.write("a")
buf.write("b")
buf.write("c")

s = buf.getvalue()
```

CPython has optimizations for some local concatenation cases, but code should not depend on them for general performance.

## 14.23 Bytes Building

Bytes are immutable too.

Repeated bytes concatenation has the same allocation problem.

```python id="bufv11"
data = b""
for chunk in chunks:
    data += chunk
```

Better:

```python id="5ojx6g"
data = b"".join(chunks)
```

For mutable incremental construction:

```python id="bpfowr"
buf = bytearray()
for chunk in chunks:
    buf.extend(chunk)

data = bytes(buf)
```

For file-like binary building:

```python id="s6mtb3"
from io import BytesIO

buf = BytesIO()
buf.write(b"abc")
buf.write(b"def")

data = buf.getvalue()
```

## 14.24 Unicode Normalization

Different Unicode sequences can look the same.

Example:

```python id="d3pxs7"
a = "é"          # single code point U+00E9
b = "e\u0301"    # e plus combining acute accent

print(a == b)    # False
```

They may render similarly, but they are different sequences of code points.

Normalize when comparing human text:

```python id="7m77tv"
import unicodedata

a = unicodedata.normalize("NFC", a)
b = unicodedata.normalize("NFC", b)

print(a == b)    # True
```

Common normalization forms:

| Form | Meaning                     |
| ---- | --------------------------- |
| NFC  | Canonical composition       |
| NFD  | Canonical decomposition     |
| NFKC | Compatibility composition   |
| NFKD | Compatibility decomposition |

Normalization is a Unicode-level concern, not a CPython object layout concern. But it matters for correct text processing.

## 14.25 Case Folding

Case-insensitive comparison should often use `casefold`, not `lower`.

```python id="xo53gl"
a = "Straße"
b = "strasse"

print(a.lower() == b.lower())       # often False
print(a.casefold() == b.casefold()) # True
```

`casefold` is designed for caseless matching.

For identifiers, filenames, usernames, and protocol fields, define exact normalization and case rules. Do not guess.

## 14.26 Grapheme Clusters

A user-visible character may contain multiple code points.

Example:

```python id="g55aom"
s = "👨‍👩‍👧‍👦"
print(len(s))
```

This may print more than 1 because the visible family emoji is a sequence joined by zero-width joiners.

Python `str` indexes code points, not grapheme clusters.

```python id="1i0lb4"
s[0]
```

may return only part of a visible character.

For UI text editing, cursor movement, truncation, and display width, code point indexing may be insufficient. You need Unicode-aware grapheme segmentation.

## 14.27 Encoding Error Strategies

Encoding can fail if a character cannot be represented.

```python id="j37zya"
"é".encode("ascii")       # UnicodeEncodeError
```

Error strategies control behavior:

```python id="9s35v1"
"é".encode("ascii", errors="ignore")
"é".encode("ascii", errors="replace")
"é".encode("ascii", errors="backslashreplace")
```

For file and process boundaries, choose error handling deliberately.

A common Unix-facing strategy is `surrogateescape`, which allows undecodable bytes to round-trip through `str` without data loss in some filesystem and environment contexts.

## 14.28 Filesystem Encoding

File paths cross a text and bytes boundary.

Python usually exposes paths as `str`, but operating systems often operate on encoded byte sequences or platform-native string formats.

CPython has filesystem encoding logic to convert between Python strings and platform path representations.

Practical rule:

```text id="pqbj4c"
use pathlib and str paths for normal application code
use bytes paths only when you need exact low-level byte behavior
```

Example:

```python id="w0jnqu"
from pathlib import Path

p = Path("data") / "notes.txt"
text = p.read_text(encoding="utf-8")
```

## 14.29 C API View of Unicode

C extension code often needs to accept or produce strings.

Common operations include:

```text id="psub45"
create Unicode from UTF-8
convert Unicode to UTF-8
inspect length
read code points
parse arguments as str
```

Conceptual examples:

```c id="ffcssi"
PyObject *s = PyUnicode_FromString("hello");
```

and:

```c id="umgbtx"
const char *p = PyUnicode_AsUTF8(obj);
```

The UTF-8 pointer returned by some APIs may point into a cache owned by the string object. Its lifetime depends on the owning object remaining alive.

Extension code must not free that pointer.

## 14.30 C API View of Bytes

Bytes expose contiguous binary memory.

Conceptual examples:

```c id="g34cn9"
PyObject *b = PyBytes_FromStringAndSize(data, len);
```

Access:

```c id="q3e1ai"
char *p = PyBytes_AS_STRING(b);
Py_ssize_t n = PyBytes_GET_SIZE(b);
```

Fast macros assume the object is really a bytes object.

Safer checked APIs should be used at public boundaries.

Bytes may contain NUL bytes, so always use explicit length.

## 14.31 Choosing the Right Type

| Need                      | Type                          |
| ------------------------- | ----------------------------- |
| Human text                | `str`                         |
| Encoded text              | `bytes`                       |
| Binary protocol data      | `bytes`                       |
| Mutable binary buffer     | `bytearray`                   |
| Zero-copy view            | `memoryview`                  |
| Many text fragments       | `list[str]` plus `"".join`    |
| Many binary fragments     | `list[bytes]` plus `b"".join` |
| Incremental text writer   | `io.StringIO`                 |
| Incremental binary writer | `io.BytesIO` or `bytearray`   |

The type should reflect the data model. Avoid using `str` for arbitrary bytes. Avoid using `bytes` for decoded text.

## 14.32 Mental Model

Use this model:

```text id="dluupp"
str
    immutable Unicode code point sequence
    optimized internal storage
    cached hash
    explicit encode/decode boundary

bytes
    immutable byte sequence
    compact binary storage
    hashable
    no text semantics unless decoded

bytearray
    mutable byte sequence
    compact binary storage
    useful for incremental construction

memoryview
    zero-copy view over buffer-exporting object
    lifetime tied to exporter
```

Text processing bugs often come from confusing code points, bytes, encodings, grapheme clusters, normalization, and display width. CPython’s object model keeps text and bytes separate so these decisions remain explicit.

## 14.33 Summary

`str` is CPython’s Unicode text object. It is immutable, hashable, and internally optimized through flexible storage, cached hashes, and interning. It represents decoded text, not encoded bytes.

`bytes` is immutable binary data. `bytearray` is mutable binary data. `memoryview` provides zero-copy access to buffer-exporting objects.

The boundary between text and binary data is explicit: encode `str` to produce `bytes`, and decode `bytes` to produce `str`. This boundary is essential for correct file I/O, network protocols, source decoding, extension code, and Unicode-aware applications.

