Chapter 0: The Parallel Mindset

01 - THE CHALLENGE

Why Parallelism is Hard

You've written sequential code for years. Your intuition says: "make the thing faster." But parallel programming requires a different question: "how do I divide this work?"

Sequential thinking leads to wrong parallel code. Consider summing an array:

# Sequential (correct, simple)
total = 0
for x in array:
    total += x  # Each step depends on the previous

# Naive parallel (WRONG - race condition!)
total = 0
parallel_for x in array:
    total += x  # Multiple threads read/write total

The naive parallel version has a race condition: multiple threads read total, add their value, and write back—overwriting each other's work.

The Core Difficulty

Parallel programming requires thinking about what can happen simultaneously. Your code must be correct regardless of execution order. This is fundamentally different from sequential programming where order is guaranteed.

Common parallel hazards include:

Race conditions: Multiple threads access shared data, at least one writes
Deadlocks: Threads wait for each other in a cycle, forever
Data dependencies: Step N needs the result of step N-1

The good news: GPU programming has patterns that avoid these hazards. The bad news: you need to learn to recognize which patterns apply to your problem.

A race condition occurs when:

Threads execute too quickly

Multiple threads access shared data with at least one write

The GPU runs out of memory

02 - AMDAHL'S LAW

The Speedup Limit

Amdahl's Law tells us the maximum speedup from parallelization. If part of your program is inherently sequential, it limits how fast the whole program can go—no matter how many processors you throw at it.

The Formula

Speedup = 1 / (S + P/N)

Where S = sequential fraction, P = parallel fraction (S + P = 1), N = number of processors

The devastating insight: as N → ∞, speedup approaches 1/S. If 10% of your code is sequential, maximum speedup is 10x. Period.

Sequential fraction (S) 10%

Number of processors (N) 100

Theoretical speedup 9.17x

Maximum possible (1/S) 10.00x

9.17x

Max: 10x

What about Gustafson's Law?

Gustafson's Law offers a more optimistic view: as we add processors, we often scale the problem size too. Instead of fixed work divided among more processors, we do more work in the same time. For GPU workloads (training larger models, processing more data), Gustafson is often more relevant. But Amdahl reminds us: sequential bottlenecks always hurt.

If 5% of your code is sequential, the maximum speedup is:

5x

20x

95x

03 - PARALLEL PATTERNS

Categories of Parallelism

Not all problems parallelize equally. Recognizing which category your problem falls into determines your implementation strategy.

■■■

Embarrassingly Parallel

Each element is independent. No communication needed between parallel units. This is the GPU sweet spot.

Examples: Element-wise ops, image processing, Monte Carlo, matrix-vector multiply

↓↓↓

Reduction

Combine many values into one (or few). Requires careful synchronization but well-understood patterns exist.

Examples: Sum, max, min, histogram, dot product, softmax denominator

▦▦▦

Stencil / Neighbor

Each output depends on nearby inputs. Predictable communication pattern, maps well to GPU shared memory.

Examples: Convolution, blur, edge detection, finite difference methods

❖❖❖

Irregular / Sparse

Unpredictable access patterns. Hardest to optimize—often memory-bound with poor cache utilization.

Examples: Graph traversal, sparse matrix ops, tree algorithms

Pattern Recognition Skill

When you see a problem, ask: "Is each output independent? Does it reduce? Does it access neighbors?" This determines which GPU optimization techniques apply.

Computing sigmoid(x) for every element in a tensor is:

Embarrassingly parallel

Reduction

Stencil

04 - SIMD TO SIMT

From CPU Vectors to GPU Warps

If you've used NumPy, you already understand parallel thinking at a basic level. a + b on arrays doesn't loop—it operates on all elements simultaneously. This is SIMD (Single Instruction, Multiple Data).

Modern CPUs have SIMD units like AVX-512 that process 8-16 elements at once. GPUs take this further with SIMT (Single Instruction, Multiple Threads): 32 threads execute in lockstep.

CPU SIMD (AVX-512): 8 lanes

8 floats processed per instruction

GPU SIMT (Warp): 32 threads

32 threads execute same instruction (+ thousands more warps)

The key difference: a GPU doesn't have 32 lanes—it has thousands of warps, each with 32 threads. A B200 can have ~200,000+ threads in flight.

NumPy as Training Wheels

# NumPy: implicit parallelism (CPU SIMD under the hood)
result = np.exp(x) + np.log(y)  # Operates on all elements

# GPU thinking: explicit parallelism
# "What does ONE thread do to ONE element?"
# Then run that 1 million times in parallel

How many threads execute in lockstep in a GPU warp?

8

16

32

05 - THE MENTAL SHIFT

Thinking in Parallel

The hardest part of GPU programming isn't the syntax—it's the mental model. You must stop thinking about the computation and start thinking about how to divide the computation.

The Thought Experiment

When approaching a problem, ask: "What if 10,000 copies of me worked on this?"

What would each copy need to know? (input data, index)
What would each copy produce? (one output element? a partial result?)
Would any copies need to talk to each other? (synchronization)
Would any copies fight over the same resource? (race conditions)

Sequential Thinking

"Process item 1, then item 2..."
"Loop through the array"
"Accumulate into a variable"
Focus on the algorithm

Parallel Thinking

"Each thread handles item[thread_id]"
"All elements processed simultaneously"
"Reduce partial results at the end"
Focus on data flow

The Data-Centric View

In GPU programming, you don't write "a loop that processes data." You write "what happens to one piece of data" and let the hardware run it everywhere. Think about the data, not the loop.

This mental shift takes practice. The good news: once it clicks, you'll see parallelization opportunities everywhere—even in code you thought was inherently sequential.

When writing a GPU kernel, you should think about:

The order elements are processed

What ONE thread does to ONE element

How to minimize the number of threads

PRACTICE

Hands-On Labs

📓

Part 0 Notebooks

4 labs: Sequential vs parallel, Amdahl's Law, parallel patterns, NumPy vectorization

REFERENCES

The Parallel Mindset

Why Parallelism is Hard

The Speedup Limit

The Formula

Categories of Parallelism

From CPU Vectors to GPU Warps

NumPy as Training Wheels

Thinking in Parallel

The Thought Experiment

Sequential Thinking

Parallel Thinking

Hands-On Labs

Further Reading

Key Concepts