Chapter 0

The Parallel Mindset

Before diving into GPU hardware, we need to rewire our thinking. Sequential programming intuition will mislead you. This chapter builds the mental model for thinking in parallel.

What You'll Learn
  1. Explain why parallelism is fundamentally different from sequential programming
  2. Apply Amdahl's Law to estimate parallel speedup limits
  3. Identify embarrassingly parallel vs communication-bound problems
  4. Describe the mental model shift from "one fast thing" to "many slow things"
  5. Recognize SIMD patterns in CPU code as a bridge to GPU SIMT
01 - THE CHALLENGE

Why Parallelism is Hard

You've written sequential code for years. Your intuition says: "make the thing faster." But parallel programming requires a different question: "how do I divide this work?"

Sequential thinking leads to wrong parallel code. Consider summing an array:

# Sequential (correct, simple)
total = 0
for x in array:
    total += x  # Each step depends on the previous

# Naive parallel (WRONG - race condition!)
total = 0
parallel_for x in array:
    total += x  # Multiple threads read/write total

The naive parallel version has a race condition: multiple threads read total, add their value, and write back—overwriting each other's work.

The Core Difficulty

Parallel programming requires thinking about what can happen simultaneously. Your code must be correct regardless of execution order. This is fundamentally different from sequential programming where order is guaranteed.

Common parallel hazards include:

The good news: GPU programming has patterns that avoid these hazards. The bad news: you need to learn to recognize which patterns apply to your problem.

A race condition occurs when:
Threads execute too quickly
Multiple threads access shared data with at least one write
The GPU runs out of memory

02 - AMDAHL'S LAW

The Speedup Limit

Amdahl's Law tells us the maximum speedup from parallelization. If part of your program is inherently sequential, it limits how fast the whole program can go—no matter how many processors you throw at it.

The Formula

Speedup = 1 / (S + P/N)

Where S = sequential fraction, P = parallel fraction (S + P = 1), N = number of processors

The devastating insight: as N → ∞, speedup approaches 1/S. If 10% of your code is sequential, maximum speedup is 10x. Period.

Interactive: Amdahl's Law Calculator
Sequential fraction (S) 10%
Number of processors (N) 100
Theoretical speedup 9.17x
Maximum possible (1/S) 10.00x
9.17x
Max: 10x
What about Gustafson's Law?
Gustafson's Law offers a more optimistic view: as we add processors, we often scale the problem size too. Instead of fixed work divided among more processors, we do more work in the same time. For GPU workloads (training larger models, processing more data), Gustafson is often more relevant. But Amdahl reminds us: sequential bottlenecks always hurt.
If 5% of your code is sequential, the maximum speedup is:
5x
20x
95x

03 - PARALLEL PATTERNS

Categories of Parallelism

Not all problems parallelize equally. Recognizing which category your problem falls into determines your implementation strategy.

■■■
Embarrassingly Parallel
Each element is independent. No communication needed between parallel units. This is the GPU sweet spot.
Examples: Element-wise ops, image processing, Monte Carlo, matrix-vector multiply
↓↓↓
Reduction
Combine many values into one (or few). Requires careful synchronization but well-understood patterns exist.
Examples: Sum, max, min, histogram, dot product, softmax denominator
▦▦▦
Stencil / Neighbor
Each output depends on nearby inputs. Predictable communication pattern, maps well to GPU shared memory.
Examples: Convolution, blur, edge detection, finite difference methods
❖❖❖
Irregular / Sparse
Unpredictable access patterns. Hardest to optimize—often memory-bound with poor cache utilization.
Examples: Graph traversal, sparse matrix ops, tree algorithms
Pattern Recognition Skill

When you see a problem, ask: "Is each output independent? Does it reduce? Does it access neighbors?" This determines which GPU optimization techniques apply.

Computing sigmoid(x) for every element in a tensor is:
Embarrassingly parallel
Reduction
Stencil

04 - SIMD TO SIMT

From CPU Vectors to GPU Warps

If you've used NumPy, you already understand parallel thinking at a basic level. a + b on arrays doesn't loop—it operates on all elements simultaneously. This is SIMD (Single Instruction, Multiple Data).

Modern CPUs have SIMD units like AVX-512 that process 8-16 elements at once. GPUs take this further with SIMT (Single Instruction, Multiple Threads): 32 threads execute in lockstep.

CPU SIMD (AVX-512): 8 lanes

8 floats processed per instruction

GPU SIMT (Warp): 32 threads

32 threads execute same instruction (+ thousands more warps)

The key difference: a GPU doesn't have 32 lanes—it has thousands of warps, each with 32 threads. A B200 can have ~200,000+ threads in flight.

NumPy as Training Wheels

# NumPy: implicit parallelism (CPU SIMD under the hood)
result = np.exp(x) + np.log(y)  # Operates on all elements

# GPU thinking: explicit parallelism
# "What does ONE thread do to ONE element?"
# Then run that 1 million times in parallel
How many threads execute in lockstep in a GPU warp?
8
16
32

05 - THE MENTAL SHIFT

Thinking in Parallel

The hardest part of GPU programming isn't the syntax—it's the mental model. You must stop thinking about the computation and start thinking about how to divide the computation.

The Thought Experiment

When approaching a problem, ask: "What if 10,000 copies of me worked on this?"

  • What would each copy need to know? (input data, index)
  • What would each copy produce? (one output element? a partial result?)
  • Would any copies need to talk to each other? (synchronization)
  • Would any copies fight over the same resource? (race conditions)

Sequential Thinking

  • "Process item 1, then item 2..."
  • "Loop through the array"
  • "Accumulate into a variable"
  • Focus on the algorithm

Parallel Thinking

  • "Each thread handles item[thread_id]"
  • "All elements processed simultaneously"
  • "Reduce partial results at the end"
  • Focus on data flow
The Data-Centric View

In GPU programming, you don't write "a loop that processes data." You write "what happens to one piece of data" and let the hardware run it everywhere. Think about the data, not the loop.

This mental shift takes practice. The good news: once it clicks, you'll see parallelization opportunities everywhere—even in code you thought was inherently sequential.

When writing a GPU kernel, you should think about:
The order elements are processed
What ONE thread does to ONE element
How to minimize the number of threads

PRACTICE

Hands-On Labs

REFERENCES

Further Reading

Key Concepts

  1. Amdahl's Law
    Wikipedia: Amdahl's Law — Original 1967 formulation and implications
  2. Gustafson's Law
    Wikipedia: Gustafson's Law — The scaled speedup alternative
  3. SIMT Architecture
    CUDA Programming Guide: SIMT — NVIDIA's official explanation of GPU execution model
  4. Race Conditions
    CUDA Programming Guide: Synchronization — How to avoid race conditions in GPU code
All material licensed under CC BY-NC-SA 4.0