Apple’s M1 chip is Apple’s fastest single-core CPU benchmark on a Mac, and it beats many high-end Intel competitors in multi-core performance. Developer Erik Engheim recently shared an in-depth look at the M1 chip, exploring why Apple’s new processor is so much faster than the Intel chip it replaces.
M1 is not a CPU!
First, the M1 is not a simple CPU. As Apple explains, it’s a system-on-chip, a series of chips all housed in a single silicon package. The M1 houses an 8-core CPU, 8-core GPU (7-core on some MacBook Air models), unified memory, SSD controller, image signal processor, Secure Enclave, and many more modules. This is what we call a system on a chip (SoC).
The M1 is a system-on-a-chip, which means that all the parts that make up the computer are placed on a single silicon chip.
Intel and AMD also house multiple microprocessors in a single package, but as Engheim describes it, Apple has an advantage because instead of focusing on general-purpose CPU cores like its competitors do, it focuses on specialized tasks dedicated chip.
An example of a computer motherboard. Many components such as memory, CPU, graphics card, IO controller, network card, etc. can be connected to the motherboard to communicate with each other.
However, because we are able to put so many transistors on a silicon chip today, companies like Intel and AMD are starting to put multiple microprocessors on a single chip. Today, we refer to these chips as CPU cores. A core is basically a completely separate chip that reads instructions from memory and performs calculations.
We are able to put so many transistors on a single silicon chip today that companies like Intel and AMD are starting to put multiple microprocessors on a single chip. We call these chips CPU cores. A core is basically a completely separate chip that reads instructions from memory and performs calculations.
Microchip with multiple CPU cores.
Going the multi-core approach has been the main way to improve performance for a long time. But now there is one player in the CPU market that is deviating from this trend, and that is Apple.
Apple’s heterogeneous computing strategy is no mystery
Instead of adding more general-purpose CPU cores, Apple took a different tack: They started adding more specialized chips to do some specialized tasks. The advantage of this is that specialized chips tend to be able to solve specific tasks with far less power than general-purpose CPU cores, and are faster.
This is not entirely new knowledge. Specialized chips such as graphics processing units (GPUs) have sat in Nvidia and AMD graphics cards for years to perform graphics-related operations much faster than general-purpose CPUs.
All Apple has done is a more radical shift in that direction. Rather than just having general-purpose cores and memory, the M1 includes a wide variety of specialized chips:
Central Processing Unit (CPU) – The “brain” of the SoC. Runs most of the code for the operating system and applications.
Graphics Processing Unit (GPU) – Handles graphics-related tasks such as visualizing the user interface of applications and 2D/3D games.
Image Processing Unit (ISP) – Can be used to speed up common tasks done by image processing applications.
Digital Signal Processor (DSP) – Handles more math-intensive functions than a CPU. Including decompressing music files.
Neural Processing Unit (NPU) – used in high-end smartphones to accelerate machine learning (AI) tasks. These tasks include speech recognition and camera processing.
Video Encoder/Decoder – Handles energy-efficient conversion of video files and formats.
Secure Enclave – Encryption, Authentication and Security.
Unified Memory – Allows the CPU, GPU and other cores to quickly exchange information.
This is part of the reason why many people are seeing speed improvements when doing image and video editing on the M1 mac. Many of the tasks they perform can run directly on specific hardware. That’s why a cheap M1 Mac Mini can encode a large video file effortlessly, while another expensive iMac can’t keep up with all the fans running at full speed.
Blue is multiple CPU cores accessing memory, green is a large number of GPU cores accessing memory
Unified memory might confuse you. How is it different from shared memory? And isn’t it a bad idea to share video memory with main memory in the past, with poor performance? That’s right, shared memory really sucks. The reason is that the CPU and GPU have to take turns accessing memory. Sharing means fighting for the right to use the data bus. Basically, the GPU and CPU have to take turns using a narrow pipe for data exchange.
But unified memory is not the case. In unified memory, GPU cores and CPU cores can access memory at the same time. So in this case there is no overhead for shared memory. In addition, the CPU and GPU can tell each other the location of some memory. Previously the CPU had to copy data from the area of its main memory to the area used by the GPU. With unified memory, it’s more like saying “Hey Mr GPU, I have 30MB of polygon data starting at memory location 2430”. The GPU can start using that memory without doing any copying.
This means that by using the same memory pool, all the special coprocessors on the M1 can quickly exchange information with each other, which can significantly improve performance.
How did Macs use GPUs before unified memory? There’s even an option to install the graphics card outside the computer using a Thunderbolt 3 cable.
Why don’t Intel and AMD do this?
If what Apple does is so smart, why isn’t everyone doing it? To a certain extent, it does. Other ARM chipmakers are also increasingly investing in specialized hardware.
AMD is also starting to put more powerful GPUs on some of their chips and is gradually moving towards some form of SoC, where an accelerated processor (APU) is basically a CPU core and a GPU core on the same chip.
AMD Ryzen Accelerated Processing Unit (APU) combines CPU and GPU (Radeon Vega) on a single silicon chip. But no other coprocessors, IO controllers, or unified memory are included.
However, there are important reasons why they cannot do this. SoC is essentially a whole system on a chip. This makes it a more natural fit for real computer makers like HP and Dell. Let me illustrate with an automotive analogy, if your business Model is to make and sell car engines, it would be an unusual leap to start making and selling complete vehicles.
For ARM, by contrast, this is not a problem. Computer makers (like Dell or HP) can simply license the ARM IP and buy IP for other chips to augment whatever specialized hardware they think their SoC should have. Next, they send the finished product to a semiconductor foundry such as GlobalFoundries or TSMC for foundry production.
Here we will find that there is a big problem with the business models of Intel and AMD. Their business model is based on selling general-purpose CPUs that people simply plug into their PC motherboards. Therefore, computer manufacturers can simply buy motherboards, memory, CPUs and graphics cards from different suppliers and combine them into one solution.
But in the new SoC world, you don’t assemble physical components from different suppliers. Instead, you assemble IP from different vendors. You buy IP for graphics cards, CPUs, modems, IO controllers and other stuff from different vendors and use that IP to design an SoC in-house. Then find a foundry to make this thing.
Now you have a big question because neither Intel nor AMD nor Nvidia will license their IP to Dell or HP to make SoCs for their machines.
Of course, Intel and AMD could simply start selling the entire finished SoC. But what are these to contain? PC makers may have different ideas about what they should contain. You might have a conflict between Intel, AMD, Microsoft, and PC makers over what specialized chips should be included because those chips require software support.
For Apple, it’s simple. They control the whole part. They can give you libraries like Core ML to let developers write machine learning stuff. Whether Core ML runs on Apple’s CPUs or the Neural Engine is an implementation detail that developers don’t have to care about.
The basic challenge of getting any CPU running fast
So heterogeneous computing is part of the reason, but not the only reason. The fast general-purpose CPU core on the M1, called Firestorm, is really fast. This is very different from past ARM CPU cores, which tended to be very weak compared to AMD and Intel cores.
By comparison, Firestorm beats most Intel cores and nearly beats the fastest AMD Ryzen cores. Conventional wisdom holds that this cannot happen.
Before we discuss what makes Firestorm so fast, let’s understand what is at the heart of designing faster CPUs?
In principle, you can do it with a combination of two strategies:
Execute more instructions faster in one sequence
Execute as many instructions as possible in parallel
In the 80s, this was easy to achieve. Instructions will complete faster just by increasing the clock frequency. However, an instruction may take multiple clock cycles to complete because it consists of several small tasks.
However, increasing the clock frequency is almost impossible today. This is the “end of Moore’s Law” that people have been chattering about for over a decade.
So it’s actually more concerned with executing as many instructions in parallel as possible now.
Multi-core processor or out-of-order processor?
There are two ways to solve this problem. One is to add more CPU cores. From a software developer’s perspective, it’s like adding threads. Each CPU core is like a hardware thread. Since there are two cores, the CPU can execute two independent tasks concurrently. These tasks can be described as two separate programs stored in memory, or they can actually be the same program executed twice. Each thread needs to make some records, such as where the thread is currently in the program instruction sequence. Each thread can store temporary results, which should be kept separately.
In principle, a processor can have only one core, running multiple threads. In this case it simply stops one thread and stores the current progress before switching to another thread. Later it switched back. This doesn’t give much performance gain, and is only used when a thread may be stalled frequently waiting for user input, data on slow network connections, etc. These might be called software threads. Hardware threading means you have actual extra physical hardware, like extra cores, to increase speed.
The problem with this is that developers have to write code to take advantage of this. Some tasks, such as server software, are easy to write this way. You can imagine handling each connected user individually. These tasks are independent of each other, so for servers, especially cloud-based services, having a large number of cores is a good option.
Ampere Altra Max ARM CPU 128 cores, designed for cloud computing, having lots of hardware threads is an advantage
That’s why you see CPUs from ARM CPU manufacturers like Ampere, like the Altra Max, which have a crazy 128 cores. This chip is specially made for cloud computing. You don’t need crazy single-core performance, because in cloud computing it’s all about having as many threads per watt to handle as many concurrent users as possible.
Apple, by contrast, is on the exact opposite end. Apple makes single-user devices. A large number of threads is not an advantage. Their equipment is used for gaming, video editing, development, and more. They want more responsive graphics and animations on the desktop.
Desktop software is generally not built to utilize a large number of cores. For example, a computer game might benefit from 8 cores, but something like 128 cores is a complete waste. Instead, you’ll want fewer but more powerful cores.
Interestingly, out-of-order execution is a way to execute more instructions in parallel without exposing this capability to multiple threads. Developers don’t need to write software code specifically to take advantage of it. From a developer’s perspective, it just looks like each core runs faster.
To understand how this works, you need to know a few things about memory. Requesting data at a specific memory location is slow. But there is no difference in latency between getting 1 byte versus getting say 128 bytes. Data is sent over what we call a data bus. You can think of it as a road or pipe between different parts of memory and the CPU, data is pushed through. If the data bus is wide enough, you can get multiple bytes of data at the same time.
This way the CPU gets a whole block of instructions to execute each time. They are recorded and executed one after the other. What modern microprocessors do is what we call out-of-order (OoO) execution.
This means they can quickly analyze a buffer of instructions and see which instructions depend on which instructions. See the simple example below:
01: mul r1, r2, r3 // r1 ← r2 × r3
02: add r4, r1, 5 // r4 ← r1 + 5
03: add r6, r2, 1 // r6 ← r2 + 1
Multiplication tends to be a slow process. Suppose it takes multiple clock cycles to execute. The second instruction has to wait because its computation depends on being put into a register. the result of r1.
And the third instruction does not depend on the computation of the previous instruction. Therefore, the out-of-order processor can start computing this instruction in parallel.
More realistically, however, we’re talking about hundreds of instructions. The CPU is able to figure out all the dependencies between these instructions.
It analyzes these instructions by observing the input of each instruction. Does the input depend on the output of one or more other instructions? When we say input and output, we mean registers that contain the results of previous computations.
Such as: the instruction depends on the input by. generated input. We can concatenate these relationships to form a complex long graph through which the CPU can work. Nodes are instructions and edges are registers that connect them.
add r4, r1, 5r1mul r1, r2, r3
The CPU can analyze such a graph of nodes and determine which instructions can be executed in parallel, and where to wait for the results of multiple dependent computations before executing them.
Many directives will be completed early, but we cannot formalize their results. They cannot be submitted immediately or the results will be provided in the wrong order. Instructions must be executed in the same order they were issued.
Like a stack, the CPU keeps popping completed instructions from the top until it hits an unfinished instruction.
We’re not quite done with this explanation, but this gives you a clue. Basically you can have parallelism that programmers have to know about, or that kind of CPU pretending like everything is single threaded. Behind the scenes, however, it is “using black magic.”
It is the excellent out-of-order execution that makes the Firestorm core on the M1 shine and make a name for itself. In fact, it’s far stronger than anything from Intel or AMD. Probably more powerful than any other processor on the mainstream market.
Why AMD and Intel’s out-of-order execution is not as good as M1?
In my explanation of out-of-order execution (OoO), I skipped some important details, which need to be said. Otherwise, it’s impossible to understand why Apple is ahead and Intel and AMD may not be able to catch up.
That big “scratchpad” I talked about is actually called the “Reorder Buffer” (ROB), and it doesn’t contain normal machine code instructions. Not the ones that the CPU fetches from memory to execute, but the instructions in the CPU’s Instruction Set Architecture (ISA). That is, what we call x86, ARM, PowerPC and other instructions.
Inside the CPU, however, it works on a completely different instruction set, invisible to the programmer. We call these instructions micro-ops (or μops). ROBs are full of these micro-ops.
These micro-operations are more practical for the CPU. The reason is that micro-ops are very wide (contain many bits) and can contain various meta-information.
You can’t add this kind of information to an ARM or x86 instruction because it would:
Program binaries are bloated
Exposing details about how the CPU works (with or without OoO cells) has register renaming and many other details.
Much meta information is only meaningful in the context of the current execution.
You can think of it as when writing programs. You have a public API that needs to be stable and everyone is using it. That is, ARM, x86, PowerPC, MIPS and other instruction sets. Microinstructions are basically private APIs used to implement public APIs.
And micro-operations are usually easier to operate for the CPU. Why do you say that? Because each of them only does a simple limited task. Regular ISA instructions can be complex enough to cause a bunch of stuff to happen, so it actually translates to multiple micro-ops.
For CISC CPUs, there is usually no choice but to use micro-ops, otherwise large complex CISC instructions would make pipelining and out-of-order nearly impossible.
RISC CPU can choose. So for example smaller ARM CPUs don’t use micro operations at all. But that also means they can’t do OoO etc.
But do you want to know why these matters? Why is knowing these details important to understand why Apple has the upper hand over AMD and Intel?
This is because the ability to run fast depends on how fast and how much you fill up the ROB with micro-ops. The faster you fill it up, the bigger it gets, and the more chance you have to choose the instructions you can execute in parallel, improving performance.
Machine code instructions are what we call an instruction decoder, and are decoded into micro-ops. If we have more decoders, we can decode more instructions in parallel, filling up the ROB faster.
That’s the huge difference we see. The most powerful Intel and AMD microprocessors have 4 decoders, which means they can decode 4 instructions and output micro-ops in parallel.
But Apple has a crazy 8 decoder. Not only that, but the ROB is three times as large. Basically can hold 3 times as many instructions. No other mainstream chip maker has as many decoders in their CPUs.
Why can’t Intel and AMD add more instruction decoders?
This is where we finally see that RISC can “revenge,” and it’s where the fact that the M1 Firestorm core uses the ARM RISC architecture starts to matter.
For x86, an instruction can be 1-15 bytes long. On RISC chips, the instruction size is fixed. Why would it matter in this case?
As you can see, for x86, the length of an instruction can be 1-15 bytes. On RISC chips, the instruction size is fixed. Why is this relevant to this case?
Because if every instruction has the same length, it becomes much simpler to split a byte stream into instructions and feed them into 8 different decoders in parallel!
However, on x86 CPUs, the decoder doesn’t know where the next instruction starts. It has to actually analyze each instruction to see how long it is.
Intel and AMD approach this in a “brutal” way, first by simply trying to decode the instruction at every possible starting point. This means we have to deal with a lot of bad guesses and mistakes that have to be thrown away. This creates an intricate decoder stage that makes it difficult to add more decoders. But it’s simple for Apple to keep adding more decoders.
In fact adding more numbers can cause a lot of other problems, and in AMD’s own words, 4 decoders is basically the upper limit of what they can do.
That’s why the M1 Firestorm core handles essentially twice as many instructions as AMD and Intel CPUs at the same clock frequency.
One could argue that CISC instructions turn into more micro-ops, they are more dense, so for example decoding one x86 instruction is more akin to decoding two ARM instructions.
Just not in reality. Highly optimized x86 code rarely uses complex CISC instructions. In some ways it has a RISC flavor.
But that doesn’t help Intel or AMD, because even those 15-byte long instructions are few and must be left to the decoder to handle them. This creates complexity and prevents AMD and Intel from adding more decoders.
But the Amds Zen3 core is still faster, right?
As far as I can tell, the latest AMD CPU cores are slightly faster than the Firestorm cores in performance tests, the ones called Zen3. But the point here is that it’s only because the Zen3 cores are clocked at 5GHz, while the Firestorm cores are clocked at 3.2GHz. Even though the Zen3 is clocked nearly 60% higher than the Firestorm, it just barely beats the Firestorm.
So why didn’t Apple increase the clock rate? Because a higher clock frequency makes the chip hotter. This is one of Apple’s key selling points. Unlike products from Intel and AMD, their computers hardly need fans.
Essentially, one could say that Firestorm cores are indeed superior to Zen3 cores. Zen3 just consumes more power to get this higher performance.
AMD and Intel seem to have cornered themselves in two ways:
They don’t have a business model that makes it easy to pursue heterogeneous computing and SoC designs.
Their legacy x86 CISC instruction set is back to haunt them, making it difficult to improve OoO performance.
This doesn’t mean the game is over. They can of course simply increase the clock frequency, use more cooling, add more cores, boost the CPU cache, etc. But these are at a disadvantage. Intel is in the worst case because their cores have been thoroughly beaten by Firestorm and they have weak GPUs.
The problem with adding more cores is that for typical desktop workloads, the performance gain of too many cores is not noticeable. Of course, a large number of cores is still very efficient for a server.
Fortunately for AMD and Intel, Apple doesn’t market their chips. So PC users will have to live with whatever they offer.