# High Performance Concurrent Design

### High Performance Concurrent Design

High performance concurrent design means using several threads or tasks without making the program slower, more fragile, or harder to reason about.

Concurrency does not automatically make a program fast.

A bad concurrent program can be slower than a simple single-threaded program. It can waste time on locks, memory allocation, thread scheduling, cache misses, and communication between threads.

The goal is not to use many threads.

The goal is to keep useful work moving.

#### Start with the Work

Before adding concurrency, ask what kind of work the program performs.

| Work type | Usually good tool |
|---|---|
| CPU-heavy independent work | Worker threads |
| Many waiting network operations | Event loop or async I/O |
| Simple shared counters | Atomics |
| Shared complex state | Mutexes |
| Producer-consumer pipelines | Queues |
| Periodic background work | Threads, timers, or event loop tasks |

A program that compresses many files has a different shape from a program that handles many sockets.

Do not choose the concurrency tool first. Choose it after you understand the work.

#### Avoid Shared Mutable State

The fastest lock is the lock you do not need.

Shared mutable state forces coordination. Coordination costs time and creates bugs.

Prefer this shape:

```text
thread 1 owns data A
thread 2 owns data B
thread 3 owns data C

main combines the results later
```

over this shape:

```text
all threads update one shared object
```

For example, instead of one shared counter:

```zig
var counter = std.atomic.Value(u64).init(0);
```

you can often give each worker its own local counter:

```zig
fn worker(result: *u64) void {
    var local: u64 = 0;

    // do work
    local += 1;

    result.* = local;
}
```

Then combine results after `join`.

```zig
const total = a + b + c;
```

Local data is cheap. Shared data is expensive.

#### Partition the Input

A good parallel program often splits input into independent chunks.

For example, suppose you need to process a large array.

Bad shape:

```text
all threads pull one item at a time from one shared queue
```

Better shape:

```text
thread 1 processes items 0..1000
thread 2 processes items 1000..2000
thread 3 processes items 2000..3000
```

Each thread owns a range.

```zig
const Range = struct {
    start: usize,
    end: usize,
};
```

The worker receives its range:

```zig
fn worker(items: []const u64, range: Range, result: *u64) void {
    var sum: u64 = 0;

    var i = range.start;
    while (i < range.end) : (i += 1) {
        sum += items[i];
    }

    result.* = sum;
}
```

This avoids locking inside the loop.

#### Lock Outside Hot Loops

A hot loop is code that runs many times.

This is usually bad:

```zig
while (i < items.len) : (i += 1) {
    mutex.lock();
    shared_sum += items[i];
    mutex.unlock();
}
```

Every iteration locks and unlocks the mutex.

Better:

```zig
var local_sum: u64 = 0;

while (i < items.len) : (i += 1) {
    local_sum += items[i];
}

mutex.lock();
shared_sum += local_sum;
mutex.unlock();
```

Now the lock is used once.

Do as much work locally as possible. Communicate less often.

#### Reduce Contention

Contention means several threads want the same resource at the same time.

The resource might be a mutex, allocator, queue, file, socket, cache line, or atomic counter.

High contention means threads spend time waiting instead of working.

Common fixes:

| Problem | Better design |
|---|---|
| One shared counter | Per-thread counters, then merge |
| One global queue | Work stealing or per-thread queues |
| One shared allocator | Arena per worker or fixed buffers |
| One large mutex | Smaller locks around independent state |
| Frequent tiny messages | Batch messages |

A useful question is:

```text
What are all threads fighting over?
```

Then remove or split that thing.

#### Be Careful with Atomics

Atomics can look cheaper than mutexes, but they can still be expensive under contention.

This can become a bottleneck:

```zig
_ = counter.fetchAdd(1, .seq_cst);
```

If every thread increments the same atomic millions of times, the CPU cores must constantly coordinate ownership of that memory location.

Better:

```text
each thread counts locally
one final merge happens at the end
```

Atomics are good for small shared facts. They are not magic performance tools.

#### Cache Lines Matter

Modern CPUs move memory in cache lines, not individual bytes.

A cache line is commonly 64 bytes. If two threads write different variables that happen to sit on the same cache line, the CPU cores can still interfere with each other.

This is called false sharing.

Example:

```zig
const Counters = struct {
    a: u64,
    b: u64,
};
```

If thread 1 writes `a` and thread 2 writes `b`, they may still fight over the same cache line.

For very hot per-thread counters, you may need padding or alignment so each thread writes to separate cache lines.

The beginner rule is simpler:

Do not put heavily written per-thread values tightly next to each other unless you have measured and know it is safe.

#### Use Batching

Batching means doing many operations together instead of one at a time.

Instead of pushing one job at a time:

```text
push job
signal
push job
signal
push job
signal
```

push a batch:

```text
lock
append many jobs
signal or broadcast
unlock
```

Batching reduces lock overhead, wakeups, allocator calls, and queue traffic.

The tradeoff is latency. A batch may make an individual job wait a little longer.

For high throughput, batching is often excellent.

#### Keep Ownership Clear

High performance code must still be readable.

A fast design with unclear ownership is dangerous.

Every object should have an owner:

```text
this buffer belongs to worker 1
this queue owns these jobs
this result array belongs to main until workers finish
this connection owns its read buffer
```

Ownership gives you two benefits.

It prevents races.

It also reduces synchronization.

If only one thread owns a buffer, that buffer needs no lock.

#### Avoid Unbounded Thread Creation

Creating a thread has a cost.

This is usually bad:

```text
for each request:
    create a new thread
```

A busy server could create thousands of threads. That can waste memory and overwhelm the scheduler.

Better:

```text
create a fixed worker pool
send jobs to workers
reuse the same threads
```

A worker pool keeps concurrency under control.

The number of workers should usually be related to the work type.

For CPU-heavy work, start near the number of CPU cores.

For blocking I/O work, more threads may help, but measure.

#### Backpressure

Backpressure means the system slows producers down when consumers cannot keep up.

Without backpressure, queues can grow until memory runs out.

Bad design:

```text
producer creates jobs forever
queue grows forever
workers fall behind
memory usage grows
```

Better design:

```text
queue has a maximum size
producer waits when queue is full
workers drain the queue
```

A bounded queue protects the program under load.

High performance is not only about speed when things are normal. It is also about controlled behavior when things are overloaded.

#### Measure Before Optimizing

Concurrency bugs are hard. Performance guesses are often wrong.

Measure before making the design more complex.

Useful things to measure:

| Metric | Question |
|---|---|
| Throughput | How much work finishes per second? |
| Latency | How long does one job wait? |
| CPU usage | Are cores busy or idle? |
| Lock contention | Are threads waiting on locks? |
| Queue length | Are producers faster than consumers? |
| Allocation count | Is memory allocation dominating? |
| Cache misses | Is memory layout hurting performance? |

Do not add lock-free structures because they sound fast. Add them only when measurement shows the current design is the bottleneck.

#### Prefer Simple Correct Designs First

A good first concurrent design is often:

```text
main thread creates work
fixed workers process work
workers keep local results
main joins workers
main merges results
```

This design is boring, but strong.

It has limited sharing.

It has clear lifetimes.

It has predictable shutdown.

It is easy to test.

Only move to more advanced designs when this shape is not enough.

#### A Practical Pattern

For CPU-heavy batch work:

```text
split input into N chunks
start N workers
each worker processes one chunk
each worker writes one result
join workers
merge results
```

For I/O-heavy servers:

```text
event loop handles sockets
small handlers do minimal work
blocking or CPU-heavy work goes to workers
bounded queues provide backpressure
shutdown wakes all workers
```

For pipelines:

```text
stage 1 parses input
stage 2 transforms data
stage 3 writes output
bounded queues connect stages
each stage owns its local buffers
```

Each design keeps communication explicit.

#### The Main Rule

Concurrency is a structure, not a decoration.

You do not make a program fast by adding threads around random functions. You make it fast by dividing ownership, reducing communication, keeping hot loops local, bounding queues, and measuring the real bottlenecks.

A good concurrent Zig program is explicit about:

who owns each piece of memory

which data is shared

which synchronization protects it

where work can wait

where work can run in parallel

how the program shuts down

That discipline gives you both speed and correctness.

