A Hardware-First Path to Understanding GPUs, TPUs, and LLMs
How matrix multiplication, kernels, and accelerator design explain modern machine learning systems
One of the most common mistakes people make when learning about GPUs, TPUs, and large language models is starting from models instead of machines.
You read about attention, transformers, or scaling laws — but without a solid mental model of the hardware underneath, everything feels like magic. Performance feels accidental. Bottlenecks feel mysterious. And “scaling” sounds like a slogan rather than an engineering constraint.
This post lays out a hardware-first learning path: how to understand GPUs and TPUs as machines, how matrix multiplication becomes the dominant abstraction, and how LLMs emerge naturally as a stress test for modern accelerators.
Start With Why Accelerators Exist
Before touching CUDA, kernels, or models, it helps to zoom out.
GPU and TPU architecture aricle is an ideal starting point. It explains why accelerators exist at all: general-purpose CPUs optimize for latency and control flow, while GPUs and TPUs optimize for throughput, regularity, and data reuse.
This framing matters. GPUs aren’t “faster CPUs,” and TPUs aren’t “better GPUs.” They are machines designed around a very specific assumption: most of the work is dense linear algebra, and moving data is more expensive than computing on it.
Once you internalize that, everything else follows.
Learn the Language of Hardware
Next comes the unavoidable foundation: computer architecture.
Chapter 7 of Patterson & Hennessy (6th edition) gives you the vocabulary you’ll keep using:
memory hierarchies
SIMD vs SIMT
throughput vs latency
parallelism as a first-class design goal
This isn’t about memorizing diagrams. It’s about learning how hardware designers think. GPUs and TPUs stop feeling exotic once you see them as logical extensions of these ideas.
Understand GPUs as Machines, Not APIs
With that foundation, GPU architecture deep dive lands much better.
You start to see GPUs as collections of streaming multiprocessors, warps, registers, shared memory, and very high-bandwidth global memory. Concepts like occupancy, latency hiding, and memory coalescing stop being jargon and start being intuitive consequences of the design.
At this stage, don’t worry about ML at all. Think like a hardware engineer asking: how do I keep this machine busy?
Matrix Multiplication Is the Rosetta Stone
This is where everything clicks.
Inside NVIDIA GPUs: Anatomy of high performance matmul kernels is the right next step. Matmul is not just an operation — it is the canonical accelerator workload. It forces you to think about:
tiling
data reuse
arithmetic intensity
registers vs shared memory vs HBM
If you understand how high-performance matmul kernels work, you understand why GPUs look the way they do.
This is also where a powerful idea emerges:
Matmul is the grammar of GPU programming.
From Matmul to Kernels
Once you understand matmul, the huggingface Ultrascale playbook A primer on GPUS section on improving performance with kernels makes sense.
You realize that most high-performance kernels are just variations on the same theme:
matmul + reductions
matmul + elementwise ops
matmul + memory movement
This is why the statement is true:
Understanding matmul kernels gives you the toolkit to design nearly any other high-performance GPU kernel.
At this point, “kernel optimization” stops being mystical. It becomes a game of managing data movement and reuse.
At this point, resources like the JAX Scaling Book’s Chapter on GPUs are especially valuable. They tie together hardware primitives, kernels, and performance models into a coherent framework for reasoning about GPU efficiency, serving as a bridge between architectural understanding and large-scale system design.
TPUs: When Hardware Commits Fully
Only now is it worth reading the classic ACM paper, A Domain-Specific Supercomputer for Training Deep Neural Networks alongside
TPUs are what happens when you take the matmul insight seriously and remove almost everything else. Systolic arrays, explicit dataflow, and compiler-driven execution (XLA) are not quirks — they are honest acknowledgments of the workload.
This is accelerator design taken to its logical extreme.
Scaling Is a Hardware Problem
With kernels and accelerators understood, articles like How to Scale your Model and the distributed sections of the UltraScale Playbook land differently.
Scaling is no longer “add more GPUs.” It is:
bandwidth
interconnect topology
synchronization
kernel efficiency at scale
This is where hardware, kernels, and systems fully merge.
LLMs as a Consequence, Not a Mystery
Finally, Large Language Models - The Hardware Connection becomes the conclusion, not the introduction.
Attention stresses memory bandwidth. Inference is sequential. Batching matters. TPU vs GPU tradeoffs become obvious.
LLMs stop being magical and start looking like what they really are:
extremely large, extremely regular linear algebra workloads that push hardware to its limits.
Closing Thought
If there’s one theme running through this path, it’s this:
Scale is not just about models.
Scale is about how algorithms align with machines.
Understanding that alignment — from architecture to matmul to kernels to systems — is the real education.
Once these core components are clear — hardware primitives, memory hierarchies, matrix multiplication, kernels, and scaling constraints — you’re no longer dependent on a single learning path. At this stage, articles like the SemiAnalysis deep dive on TPUs
become especially valuable: not as introductions, but as reality checks. They connect architectural choices to utilization, cost, and performance at scale. With these mental models in place, most courses, papers, and deep dives become accessible, and you can go arbitrarily deep based on curiosity rather than confusion.
Other References
In-Datacenter Performance Analysis of a Tensor Processing Unit



Regarding the topic of the article, this hardware-first perspective is truly foundational for comprehending modern AI accellerators. Do you have any recommendations for introductory exercises that could reinforce this architectural understanding in a practical setting?