A Hardware-First Path to Understanding GPUs, TPUs, and LLMs

How matrix multiplication, kernels, and accelerator design explain modern machine learning systems

Dec 26, 2025

One of the most common mistakes people make when learning about GPUs, TPUs, and large language models is starting from models instead of machines.

You read about attention, transformers, or scaling laws — but without a solid mental model of the hardware underneath, everything feels like magic. Performance feels accidental. Bottlenecks feel mysterious. And “scaling” sounds like a slogan rather than an engineering constraint.

This post lays out a hardware-first learning path: how to understand GPUs and TPUs as machines, how matrix multiplication becomes the dominant abstraction, and how LLMs emerge naturally as a stress test for modern accelerators.

Start With Why Accelerators Exist

Before touching CUDA, kernels, or models, it helps to zoom out.

GPU and TPU architecture aricle is an ideal starting point. It explains why accelerators exist at all: general-purpose CPUs optimize for latency and control flow, while GPUs and TPUs optimize for throughput, regularity, and data reuse.

This framing matters. GPUs aren’t “faster CPUs,” and TPUs aren’t “better GPUs.” They are machines designed around a very specific assumption: most of the work is dense linear algebra, and moving data is more expensive than computing on it.

Once you internalize that, everything else follows.

Learn the Language of Hardware

Next comes the unavoidable foundation: computer architecture.

Chapter 7 of Patterson & Hennessy (6th edition) gives you the vocabulary you’ll keep using:

memory hierarchies
SIMD vs SIMT
throughput vs latency
parallelism as a first-class design goal

This isn’t about memorizing diagrams. It’s about learning how hardware designers think. GPUs and TPUs stop feeling exotic once you see them as logical extensions of these ideas.

Understand GPUs as Machines, Not APIs

With that foundation, GPU architecture deep dive lands much better.

You start to see GPUs as collections of streaming multiprocessors, warps, registers, shared memory, and very high-bandwidth global memory. Concepts like occupancy, latency hiding, and memory coalescing stop being jargon and start being intuitive consequences of the design.

At this stage, don’t worry about ML at all. Think like a hardware engineer asking: how do I keep this machine busy?

Matrix Multiplication Is the Rosetta Stone

This is where everything clicks.

Inside NVIDIA GPUs: Anatomy of high performance matmul kernels is the right next step. Matmul is not just an operation — it is the canonical accelerator workload. It forces you to think about:

tiling
data reuse
arithmetic intensity
registers vs shared memory vs HBM

If you understand how high-performance matmul kernels work, you understand why GPUs look the way they do.

This is also where a powerful idea emerges:

Matmul is the grammar of GPU programming.

From Matmul to Kernels

Once you understand matmul, the huggingface Ultrascale playbook A primer on GPUS section on improving performance with kernels makes sense.

You realize that most high-performance kernels are just variations on the same theme:

matmul + reductions
matmul + elementwise ops
matmul + memory movement

This is why the statement is true:

Understanding matmul kernels gives you the toolkit to design nearly any other high-performance GPU kernel.

At this point, “kernel optimization” stops being mystical. It becomes a game of managing data movement and reuse.

At this point, resources like the JAX Scaling Book’s Chapter on GPUs are especially valuable. They tie together hardware primitives, kernels, and performance models into a coherent framework for reasoning about GPU efficiency, serving as a bridge between architectural understanding and large-scale system design.

TPUs: When Hardware Commits Fully

Only now is it worth reading the classic ACM paper, A Domain-Specific Supercomputer for Training Deep Neural Networks alongside

Chewing on Chips

12/01: Day 1 🎄

Welcome to Unwrapping TPUs…

2 months ago · 9 likes · 1 comment · Alan Ma and Abiral Shakya

TPUs are what happens when you take the matmul insight seriously and remove almost everything else. Systolic arrays, explicit dataflow, and compiler-driven execution (XLA) are not quirks — they are honest acknowledgments of the workload.

This is accelerator design taken to its logical extreme.

Scaling Is a Hardware Problem

With kernels and accelerators understood, articles like How to Scale your Model and the distributed sections of the UltraScale Playbook land differently.

Scaling is no longer “add more GPUs.” It is:

bandwidth
interconnect topology
synchronization
kernel efficiency at scale

This is where hardware, kernels, and systems fully merge.

LLMs as a Consequence, Not a Mystery

Finally, Large Language Models - The Hardware Connection becomes the conclusion, not the introduction.

Attention stresses memory bandwidth. Inference is sequential. Batching matters. TPU vs GPU tradeoffs become obvious.

LLMs stop being magical and start looking like what they really are:

extremely large, extremely regular linear algebra workloads that push hardware to its limits.

Closing Thought

If there’s one theme running through this path, it’s this:

Scale is not just about models.
Scale is about how algorithms align with machines.

Understanding that alignment — from architecture to matmul to kernels to systems — is the real education.

Once these core components are clear — hardware primitives, memory hierarchies, matrix multiplication, kernels, and scaling constraints — you’re no longer dependent on a single learning path. At this stage, articles like the SemiAnalysis deep dive on TPUs

SemiAnalysis

TPUv7: Google Takes a Swing at the King

The two best models in the world, Anthropic’s Claude 4.5 Opus and Google’s Gemini 3 have the majority of their training and inference infrastructure on Google’s TPUs and Amazon’s Trainium. Now Google is selling TPUs physically to multiple firms. Is this the end of Nvidia’s dominance…

2 months ago · 342 likes · 11 comments · Dylan Patel, Myron Xie, Daniel Nishball, Wei Zhou, Jeremie Eliahou Ontiveros, Ivan Chiam, Cheang Kang Wen, Clara Ee, Wega Chu, Kimbo Chen, AJ Kourabi, and Michael Chen

become especially valuable: not as introductions, but as reality checks. They connect architectural choices to utilization, cost, and performance at scale. With these mental models in place, most courses, papers, and deep dives become accessible, and you can go arbitrarily deep based on curiosity rather than confusion.

Other References

In-Datacenter Performance Analysis of a Tensor Processing Unit

Why Systolic Architectures

Better ideas Newsletter

Discussion about this post

Ready for more?