GPUs: The Basics to Understand Their Performance

Today, I’m going to talk to you about GPUs (Graphics Processing Units). Of course, you’re all familiar with the term, and you probably have a GPU in your computer. But do you actually know how these processors are architected and why they are so valuable—not only for graphics but, more importantly, for AI nowadays?

My goal is to help you understand what defines the key characteristics of GPUs and how they differ from other types of processors, all without diving too deep into technical details.

GPUs haven’t always been the tech (and Nasdaq) darlings they are today. Back in the late 1990s, their primary use was limited to graphical rendering, particularly for smoother 3D visuals. It was NVidia that released what is often (wrongly) considered the first GPU: the GeForce 256, launched in October 1999. The key selling point of this card was actually its compatibility with Direct3D, but that’s another story entirely.

Over the years, GPUs have made rapid progress, and in the early 2000s, the concept of GPGPU (General Purpose GPU) emerged. This refers to the use of GPUs for parallel computing outside the realm of graphical rendering. In 2010, NVidia introduced a new architecture (codename FERMI), first available in the GTX 480. This architecture marked a turning point for GPUs in artificial intelligence, with their first major achievement being the training of AlexNet in 2012 by Alex Krizhevsky, with support from Geoffrey Hinton and Ilya Sutskever—renowned figures in modern AI.

Since then, GPUs have become an essential standard for AI (and not just for AI—they’re also critical for serious physical simulations, for example).

However, progress hasn’t stopped with GPUs. More recently, even more specialized processors have started to emerge. These are collectively referred to as NPUs (Neural Processing Units). Among them are Google’s TPUs, introduced in 2016, Graphcore’s IPUs (Intelligence Processing Units—I even authored a scientific article on this topic), and many others, as hardware has become a central challenge in modern AI. I’ll conclude by briefly mentioning the recent architectures highlighted by Apple (M3 and M4) and Qualcomm (Snapdragon X Elite), though I’ll say nothing more about them. 😉

In short, as you can see, the story of GPUs is not as recent as it might seem. Yet, you likely don’t have a clear intuition about how they work. Let’s fix that today!

The GPU: An Architecture Designed for Parallelism

To explain how a GPU works, let’s start with CPUs, the classic processors.

A CPU has an architecture designed for versatility: it can handle a wide variety of tasks, but typically in a sequential manner. To achieve the best possible performance, the CPU relies on branch prediction, enabling speculative execution. This means it loads instructions into its processing pipeline in advance, based on their likelihood of being executed. While this mechanism occasionally makes mistakes, it takes up a significant portion of the processor’s resources—resources that a GPU, which lacks this mechanism, uses instead to pack in more transistors. 😉

A CPU also employs out-of-order execution, which rearranges the order of instructions from the running program to optimize processing time. This, too, occupies considerable space on the processor.

As for the rest, CPUs are composed of multiple cores—usually between 4 and 16 in modern processors. These cores are designed to handle varied tasks and execute complex calculations efficiently.

You can see in the diagram the high-level architecture of a CPU: a few computation units (the ALUs, or Arithmetic Logic Units) controlled by a CU (Control Unit), along with cache memory to minimize latencies (the moments when computation units are idle, waiting for data). Of course, there’s also standard memory at a higher level.

The GPU, however, is organized entirely differently and is built for completely different objectives. Its architecture consists of thousands of simpler cores, designed specifically to handle repetitive, basic tasks on a vast number of elements simultaneously. These cores are arranged in blocks and grids, with SMs (Streaming Multiprocessors) managing the distribution of calculations. For example, in NVidia’s Fermi architecture (2010), there were 16 SMs, each controlling 32 cores—resulting in a total of 512 cores.

The diagram below illustrates the high-level architecture of a GPU. Here, the ALUs are much simpler compared to those of a CPU.

A Toy Example, but More Concrete

To make the concept clearer, let’s take the classic example of adding a series of numbers together.

How do you calculate the result of the operation: 2 + 34 + 65 + 12 + 91 + 3 + 77 + 43?

More importantly, how many computational cycles are needed to complete the calculation?

Let’s start with the CPU case. The CPU will work sequentially, performing the first addition, 2 + 34, then the second, then the next, and so on.

The diagram below illustrates this process.

It takes 7 cycles to complete the addition of 8 numbers.

Now, let’s look at the same calculation, but with Mr. GPU! Since the GPU has plenty of cores, it will use 4 of them for this operation. The diagram below illustrates how it works. Here, C1, C2, C3, and C4 represent the core names.

It takes just 3 cycles to complete the calculation in this case, and you can see that many of the cores are freed up well before the end of the process. The GPU is therefore far more efficient than the CPU for this type of task!

Of course, reality is more complex than this example, but you get the idea. When dealing with large datasets and needing to perform brute-force operations by repeating the same task many times, the GPU shines with its efficiency.

Neural AI: The GPUs’ Favourite Playground

Apologies in advance, but to explain why GPUs are so useful for AI, I first need to talk about AI and neural networks. AI is a broad field encompassing many different approaches. For several years, neural network-based methods have proven highly effective, particularly when computations are performed on GPUs.

To illustrate, let’s take a very simple example of a small neural network with two inputs, two outputs, and a hidden layer. The connection weight between x1 and h1 is denoted as w₁,₁, while the weight between h1 and y1 is v₁,₁, and so on. The diagram below represents this network.

There are many operations involved in this type of network, but one typical operation is the feed-forward step. This step processes input data to calculate the output values produced by the network (at y1 and y2, for example). This is exactly what happens during inference in a language model.

All of this boils down to a series of matrix multiplications and additions, as illustrated in the following diagram.

Of course, this might seem utterly incomprehensible if you don’t already have a solid foundation in math and some familiarity with neural networks. But to sum it up, we see that what the hidden layer sends to the outputs is a weighted sum of the input values, where the weights are defined by coefficients like w₁,₁, and so on.

Depending on certain choices about the network’s characteristics (e.g., linear or non-linear), the computation for the output values becomes more or less complex. However, it always boils down to matrix multiplications combined with vector additions.

And this is where the “magic” of GPUs comes into play. Their architecture is specifically designed to efficiently handle matrix multiplication and addition. More precisely, the FMA operation—short for Fused Multiply-Add—has been hardwired into GPU architectures since Fermi (2010), and subsequently into Kepler (2012), Maxwell (2016), Volta (2017), and others.

I won’t go into detail, but what enables the GPU to perform matrix operations so quickly is the combination of its ability to handle multiply-add operations on floating-point numbers efficiently and the massively parallel nature of its architecture.

The diagram below illustrates how splitting input matrices into different blocks and assigning them to multiple computation units allows for significantly faster calculations.

With 4 computation units working in parallel, this calculation can be completed in a single cycle—provided the data is correctly distributed to the right locations.

It’s a wrap

The concept of a GPU is finally very simple: on a piece of silicon, you pack in more computation units, all of them simpler in design, well-coordinated, and with access to data through caches that are “local” to blocks of computation units.

This allows each core to perform simpler calculations, but at incredible speed. If a problem can be easily divided into blocks of sub-data that can be processed uniformly, then the GPU will work wonders!

The current miracle of AI is largely built on this, supported by processors that now go beyond GPUs and are increasingly specialized. Most NPUs, for instance, work with numerical data of very low precision. When benchmarking Apple’s NPUs (from the M4 to the A17 in the iPhone 15), performance figures often focus on multiply-add operations involving INTs (integers), whereas GPUs are typically evaluated on FLOATs (floating-point numbers). These represent very different use cases. NPUs tend to excel in inference tasks but are less effective for model training.

That’s it! I hope this article has given you some intuition about GPUs. See you soon for another post, right here on the same blog!