1.21 Data Representation

Data representation is the choice of concrete form used to store the objects in a problem. The same mathematical object can have many representations, and each representation makes some operations cheap while making others expensive. Choosing the representation is therefore part of algorithm design.

A graph can be stored as an edge list, adjacency list, adjacency matrix, or compressed sparse structure. A set can be stored as a sorted array, hash table, bitset, or balanced tree. A string can be stored as bytes, Unicode code points, tokens, or a suffix structure. These choices directly affect correctness details, time complexity, space complexity, and implementation risk.

Problem

You know the abstract data in the problem, but you need a concrete representation that supports the required operations efficiently and safely.

Solution

Choose the representation by listing the operations first.

Use this template:

Object:
  What mathematical object is being represented?

Required operations:
  Which operations are frequent?
  Which operations must be fast?
  Which operations are rare?

Candidate representations:
  What are the options?

Decision:
  Which representation matches the operation pattern?

For example, if a problem repeatedly asks whether an integer value has appeared before, a hash set or bitset may be appropriate. If the problem repeatedly asks for the smallest remaining value, a heap or balanced tree may be appropriate. If the problem repeatedly scans all neighbors of a graph vertex, an adjacency list is often the natural representation.

Example: Set of Integers

Suppose you need to store a set of integers and support insertion and membership queries.

Representation	Membership	Insert	Space	Good When
Unsorted array	(O(n))	(O(1))	(O(n))	Very small data
Sorted array	(O(\log n))	(O(n))	(O(n))	Mostly read-only data
Hash set	Expected (O(1))	Expected (O(1))	(O(n))	General membership queries
Balanced tree	(O(\log n))	(O(\log n))	(O(n))	Ordered traversal needed
Bitset	(O(1))	(O(1))	(O(U))	Values lie in small universe (0..U-1)

There is no universally best representation. A bitset is excellent when the value universe is small. It is wasteful when values are sparse and large. A sorted array is poor for frequent insertions but excellent for compact storage and binary search.

Example: Graph Representation

For a graph with (n) vertices and (m) edges, the common representations are:

Representation	Space	Iterate Neighbors	Edge Query
Edge list	(O(m))	(O(m))	(O(m))
Adjacency list	(O(n + m))	(O(\deg(v)))	(O(\deg(v)))
Adjacency matrix	(O(n^2))	(O(n))	(O(1))
Hash adjacency sets	(O(n + m))	(O(\deg(v)))	Expected (O(1))

For breadth-first search or depth-first search, adjacency lists are usually best because traversal needs to iterate outgoing edges. For dense graph algorithms that frequently test whether an edge exists, an adjacency matrix can be better despite its larger space cost.

Representation Invariants

A representation should have invariants. These are facts that must remain true whenever the data structure is used.

Examples:

Sorted array:
  Elements are stored in nondecreasing order.

Heap:
  Every parent has priority no greater than its children.

Adjacency list:
  For every stored edge u -> v, both u and v are valid vertex labels.

Union-find:
  Parent pointers form rooted trees.

Representation invariants make operations easier to reason about. They also give useful assertion checks during testing.

Encoding Choices

Some representation decisions are small but important.

For intervals:

Closed interval:      [l, r]
Half-open interval:   [l, r)

Half-open intervals often simplify length calculations because the length is r - l. Closed intervals often match binary search explanations. Either convention can work, but mixing them causes bugs.

For grids:

Cell index:
  row, col

Flattened index:
  id = row * width + col

Flattening can simplify graph algorithms on grids, but explicit row and column coordinates may be easier to read.

For graphs:

Zero-based labels:
  0..n-1

One-based labels:
  1..n

Most array-based implementations prefer zero-based labels because labels can be used directly as indices.

Space and Locality

Representation also affects cache behavior and constant factors. A compact array can be much faster than a pointer-heavy structure even when asymptotic bounds look similar. For example, a sorted array with binary search may outperform a balanced tree for moderate-size read-heavy data because the array uses contiguous memory.

This does not change the asymptotic model, but it matters in implementation.

Common Pitfalls

Do not choose a representation before listing operations. The best representation follows from the workload.

Do not ignore memory limits. An adjacency matrix for (n = 10^5) is impossible in ordinary memory.

Do not expose internal representation unless callers need it. Encapsulation lets you change the representation without changing the algorithm’s interface.

Do not forget representation invariants after updates. Insertions, deletions, and swaps must preserve the facts that later operations rely on.

Takeaway

Data representation determines which operations are cheap, which invariants are natural, and which complexity bounds are realistic. Choose the representation from the operation pattern, state its invariants, and verify that it fits both time and space constraints.