Those computer things. They work, somehow. Some say it runs on a proprietary mixture of unicorn meat and peanuts. I think that is ridiculous, personally – why would we use unicorn meat as fuel when we can be eating it instead?
Computers consist of several components, among them a CPU, memory, storage, and so on. The perceptible speed of a computer likewise depends on several components. One of these is the CPU, the brain of the computer. It is kind of like your brain, but much, much faster. Not all CPUs are created equal, though. What makes one CPU faster than another? To understand this, we first need to understand several concepts related to CPUs.
What Does A CPU Do?
First, CPU is an acronym. It stands for Central Processing Unit. Speaking of standing for things, apparently CPUs are going to unionize. Something about working all day long and overheated working conditions. Anyway, back to the acronym. From it we can derive the two main traits of a CPU: it is central, and it processes stuff. As I said, a CPU is the brain of the entire operation. All input goes into it, and all output comes from it. It is in the middle of all the action. That is the central part from the acronym. The other part is processing, which is the topic for today.
You can think of processing as doing work. Let’s suppose you are a cashier in a grocery store. I had higher hopes for you than that, but you said you wanted to be around vegetables, so there you go. Anyway, people pick out some food (once they are done racing carts) and get in line. You interact with the people in line, and check out their items. Once you finish checking out one person, you help the next person in line. You are given tasks, and you process through them.
The work a CPU does is very low-level. It primarily deals with basic arithmetic. You know, stuff like 5 + 7. It also works with memory – getting values from RAM, and later storing it back. These basic building blocks eventually form the bulk of most programs. Unlike you, a CPU is good at math. Very good – and very fast. A CPU can perform millions of these operations every second. This is related to the clock speed of a CPU.
A CPU does work in cycles. A cycle is one run through the circuit of the CPU. Think of a maze with multiple entrances and exits. Each cycle, a pulse goes through the maze, takes various branches, and exits. It then goes through again, taking a slightly different path, then exits. In a CPU, each run-through slightly modifies the state of the CPU. The reason it must be done in cycles is due to how circuits work in general. They flow a certain way. Much like a freeway, you cannot simply turn around in the middle, you have to reach an exit.
Each operation (e.g. adding two numbers) takes several cycles. Something like addition will only take a few cycles, say 4. Something more complex like division requires many more cycles, say 36. These numbers are completely arbitrary and simplify reality, because a CPU is very complex. For example, there is a pipeline of operations to be done. The CPU gets the next operation, performs it, then repeats the process. In some cases, it can execute (as in run, not the guillotine version) two operations at the same time, or even execute them in a different order than it received them. For this reason, it is hard to pinpoint an exact cycle count for anything, and so these examples are more illustrative than factual. The point is that a CPU works in cycles.
Let’s take our hypothetical 36-cycle division. How long does that take? It depends how often a cycle occurs. For a CPU, this cycle-rate is measured in hertz. Hertz is a measurement of cycles per second. For example, you are likely using a 60 hertz (Hz) monitor. This means it can change 60 times per second. For monitors, this is known as the refresh rate. Conventional CPU speeds are measured in gigahertz (GHz). The prefix giga denotes one billion. A CPU with a clock speed of 1GHz has one billion cycles per second. That is a lot of bikes, my friend.
For a long time, the primary improvement in CPUs was an increase in clock speed. Back in the days of 256MB RAM sticks, CPU speeds were given in MHz (megahertz, one million hertz), with successive models having higher clock speeds. Speeds increased for years. Around 2005, 4GHz was reached. And then… it stopped climbing. Much like you, once you reached the fifth stair. Even today high-end processors still peak around 4GHz. Why is that? The problem here is, well… physics. Typical physics, always making stuff not work. The problem is heat and power consumption.
In normal use, a CPU generates heat as a byproduct of computation. A CPU can only operate within a certain range of temperatures. If a CPU is getting too hot, it throttles (reduces clock speed) in order to regulate its temperature. Cooling systems help this to an extent in normal conditions, which is why you likely hear a fan whirring away when doing CPU-intensive work on a laptop.
Why don’t we just get better cooling, you say. Then we could run CPUs at 10 GHz. Well, cooling helps, to an extent, but there is a massive scaling problem here. The relation between clock speed and heat output is non-linear, which is a fancy way of saying the difference in heat output between 3 and 4 GHz is more than the difference between 2 and 3 GHz. The relationship is more quadratic than linear. This means that as you increase the clock speed, you also increase the heat growth rate. At some point, cooling just cannot catch up and we are restrained to an upper bound with clock speeds. Like they say, if you cannot take the heat, throttle. And something about a kitchen, I forget.
The other reason we cannot have higher clocks rates is power consumption, which nearly mirrors heat output. Higher clock speeds require more power. Much more power. Growing upwards just isn’t sustainable at some point.
So clock speed has more or less remained constant for a decade. What has changed?
CPUs gained multiple cores. Dual-core CPUs are common even among lower-end CPUs, and higher-end/desktop CPUs are often quad-core. That is great, but what are cores?
A core is basically a CPU. It fulfills all the same functions as the CPUs of the past did. The difference is that there are multiple cores per chip. They effectively took the processing part of a CPU, put it in a core, and put multiple cores on a CPU. This happens to solve the power and heat problem. You cannot reasonably have a 8GHz CPU, but you can have 4 cores running at 2GHz running smoothly. The reason multiple cores don’t have the same problem? I don’t know, honestly. Some engineering magic. Anyway, they work. So then, we got 4 2GHz cores, that is like a 8GHz CPU, right?
Um… no. More like… 4 2GHz cores.
Cores only do one thing at once. If there are multiple things to be done, they are done in order. With our hypothetical 8GHz core, we can do 8GHz of work. With each of our 4 cores, we can do 2GHz of work. The problem is, they cannot work on the same thing. To understand why, let’s revisit the grocery store. People are in your line, and you are working your way through the line. If I add another cashier to your line, will you go through customers faster? No. There isn’t room for another cashier, and even if there somehow was, they would be doing the exact same thing, which isn’t useful. Double-scanning an item doesn’t help us here. Instead, we could help more customers at once if we had multiple lines.
Each cashier, in this case, is a core. You can get a sum total of their speed only if all the cores are working at the same time. With computers, each core is usually running a particular program (helping a customer). Much like how multiple cashiers in a single line isn’t helpful, most programs cannot be subdivided to be run on multiple cores. Because each core runs a program and you cannot have both of them work on the same program, only one core could be used for the program, limiting you to the 2GHz speed of a single core. The other cores could be used to run other programs, but they would largely be idle most of the time. Because of this, you cannot really “sum” cores.
I said cores can only run one thing at once. How, then, can you run so many programs at the same time?
Threads enable computers to run multiple programs at the same time. Let’s say we have a single-core CPU. It can only run one thing at once. Obviously, we want to run several programs at once, so that just won’t do. Let’s go back to the grocery store again. People, being normal, join the line with the fewest customers, spreading themselves out evenly. There is a slight problem, though: only you are working. Multiple lines are waiting. Despite your attempts, you cannot duplicate yourself, so there is only one of you. How can you manage this?
People don’t like waiting in lines, in case you haven’t noticed that before. Additionally, you want to help those that got in line first. If you work each line until there are no people left, some lines have to wait longer than others, even though they all filled at the same rate. Clearly that method will not work. The answer? Help one person in your line, then move to the next line, and repeat this step until the work is complete. This gets the job done evenly and is the fairest approach.
This approach is also how a single core can run multiple programs at once, using a method known as time slicing. Basically, it divides up its time among programs that have things to run, and keeps switching between them. Each separately schedule-able part of a program is known as a thread. Programs are composed of at least one thread. Now, CPUs are fast – fast enough that they can switch between threads without you noticing at all. This is how multiple programs can be run at the same time. This is also why a program freezing up doesn’t make everything else stop, because other programs still get time on the CPU. So next time a program is frozen, instead of singing “Let It Go” you can use other programs.
Threads also help explain why the sum of cores generally doesn’t happen. In order for this to happen, the cores all need enough work to do. However, most programs are single-threaded – they only have one thread. If a program only has a single thread, it can only be run on one core, limiting you to the speed of that core. If a way was found to make the program multi-threaded – known as parallelization – then it could enjoy the speed of multiple cores. For example, server CPUs have many cores – 16, say. This works well because many of the tasks servers do can be done independent of each other. 16 cores for a typical desktop workload, on the other hand, would very likely be underutilized, even if running CPU-intensive programs.
Back to engineering. Increasing clock speeds worked for a while, but that well dried up. There is a similarly incremental process that is ongoing, though: die shrinks. CPUs are composed of billions of transistors. As manufacturing processes improved, transistors could be made smaller and smaller, enabling more of them to fit on the CPU die (chip). This is what Moore’s Law predicts, named after the co-founder of Intel. I can tell you aren’t an interesting person because there isn’t a law named after you. Moore’s Law states that the number of transistors that can be fit into a certain area doubles every two years. Note that Moore’s Law is simply an observation, and does not have any particular scientific basis. Instead, it has been used as a progress goal for CPU manufacturing. It has been followed very well, though: only in recent years has progress begun to drop off.
The benefits of Moore’s Law are readily visible. Smaller transistors results in less power usage, which in turn results in less heat output. The smaller size also grants the ability to put more transistors in the same space. The result of this is quite incredible. Computers in the past that were the size of entire rooms are now easily outpaced by computers that fit in your pocket. The rate of improvement was so rapid that only a year after purchase things began to be outdated, simply due to the massive performance increases.
Lithography refers to the size of an individual transistor. The most recent generation of CPUs have a lithography of 14 nanometers. A nanometer is one billionth of a meter. This is some seriously small stuff.
As I said, Moore’s Law has struggled in recent years. Dates are beginning to slip. As it turns out, new problems crop up when things get really small. The size of these transistors is approaching the size of atoms. Roadmaps for smaller manufacturing processes are still being followed, but the guarantee of Moore’s Law is largely gone at this point. Still, it has brought us far.
Another frequent change in CPUs is their microarchitecture. Microarchitecture, like normal architecture, has to do with designing the structure of something, deciding what should go where and why. The goal of these successive designs is simple: make it faster. Typically each new generation of CPUs has a different microarchitecture. These architecture designs consist of the logical circuit layout of the processor, which is quite complex.
Many innovations have been found through these microarchitecture iterations that improved performance. For example, in CPU design there is the concept of pipelines. Pipelines represent the flow of instructions through a CPU. One of the biggest innovations has been executing multiple instructions at once. Prior to this, one instruction would be executed for however many cycles it took until it was complete, then the next instruction would be executed. With pipelining, there are several stages instructions go through. It is like a drive-through at a fast-food restaurant. There are multiple windows, and multiple customers are helped at the same time, but only for the specific stage they are in.
Related to pipelining is out-of-order execution. Suppose you asked me to divide 100 by 5, then add 3 and 5. Presuming I were a computer, the addition is easier than the division, but you asked me to do the division first. But since these problems are independent, I can actually complete them in any order I want, because it will not change the results. This allows speedups in certain circumstances.
CPUs also use branch prediction. Branches are when a single path splits into two. Programs have branches all over the place, but no leaves, mostly because they aren’t trees. Programs are composed of many logical structures like “if x, then y.” For example, if somebody’s name is longer than 20 characters, tell them they have a long name. If this condition is true, then the following instructions are executed, otherwise they aren’t. Pipelining speeds things up, but CPUs still have to wait for certain instructions to be completed before executing others. Still, idle time is wasted time for CPUs. Instead of waiting to find out if a particular branch is taken, they predict which branch will be taken, execute the instructions for it, and use the results of them only if the branch ended up being taken. In a sense, CPUs ask for forgiveness, not permission, because it makes them faster.
There are numerous other additions similar to these, but not all microarchitecture changes are restructuring the processing circuitry. Many microarchitecture changes replaced a component on the motherboard with something on the CPU chip itself. For example, CPUs now have an FPU (floating-point unit), which is like a CPU but for decimal numbers. CPUs at that time only worked with integers. FPUs were once separate from the CPU chip like GPUs are now, but they are now integrated onto the chip. CPUs, being the brain of the system, are also connected to all input sources like USB ports or other hardware components like RAM. Previously, communications between these components and the CPU was routed through a number of different chips, known as the chipset. Now various components of the chipset have been added directly to the CPU, to improve performance.
Speaking of RAM, RAM is slow. Like you. Ba-dum-tss. That joke is gonna cost you five dollars. Anyway. I said RAM is slow, but it isn’t – RAM is actually quite fast. But it is slow relative to how fast CPUs are. Oftentimes CPUs are simply waiting for data from RAM before continuing. As always, idle time is wasted time. The problem is how to avoid it, because the RAM isn’t going to go any faster. This problem is solved with a cache. That is pronounced like cash, the green stuff. Typical French people and their silly words. The cache is kind of like a mini-memory on the CPU itself. When the CPU needs data, it first checks the cache. If it finds the data there, a trip to the RAM is saved. If it doesn’t find it, it gets the data from RAM, and stores it (along with whatever was next to it in RAM) in the cache. When it is needed again (which happens fairly frequently), the CPU will likely find it in the cache and save time. The cache is very small – less than 10MB. Because of this, only recently used memory can be stored in there. Older contents are replaced by newer contents.
As implemented in modern CPUs, the cache is actually multi-level. Generally speaking, levels one and two are local to a certain core, and all cores share access to a final level three cache. The level two cache is bigger than the level one cache, and likewise the level three cache is bigger than the level 2 cache. Conceptually, the further from the CPU, the bigger the memory. The opposite is true for speed: the closer to the CPU, the faster the memory. The less than 10MB figure given here is the size of the level three cache; the level 1 cache is a handful of KB instead.
CPUs are designed for certain use cases in mind. For example, the design considerations behind CPUs used in servers is much different than for consumer desktops. These design considerations affect power usage, heat output and performance.
For example, desktop CPUs typically run at higher clock speeds than laptop CPUs, and have more cores. As a result, they perform much better. In return, they use much more power, and produce much more heat. In the case of a desktop system, this tradeoff is acceptable because most desktop users don’t care too much about power consumption, and desktop computers are much easier to cool than laptops. For the same reason, such a processor would not work well in a laptop, as the battery life would be dismal and the resulting heat would throttle the processor anyway.
Similar tradeoffs occur with mobile devices. They go further in this same direction, trading performance for battery life. That being said, generational performance increases in mobile devices have been much greater in recent years than with non-mobile CPUs. The last few desktop generations have been around 10% improvement, whereas mobile phones have seen much greater improvements, often 50% or greater. Whether that will continue or not remains to be seen, but it has been a strong force in mobile computing. The reason they improve at different rates is partially because mobile CPUs are distinct in another way from desktop CPUs.
Instruction Set Architecture
Mobile devices use ARM CPUs. Desktop/laptop CPUs, on the other hand, use x86 or x64 CPUs. ARM, x86 and x64 are all instruction sets. An instruction set is the set of instructions you can send to a CPU to execute. These instructions, as mentioned earlier, are simple things like adding numbers, or storing something in memory. They are different languages “spoken” to the processor. Like with spoken languages, there are different words that mean the same thing (e.g. different words for “add these two numbers”). CPUs are not multilingual, however: speaking ARM to an x86 is like speaking French to somebody in Germany. Due to this, programs compiled for a specific architecture cannot run on a different architecture.
Most processors use one of a few standard instruction sets today. x86, an instruction set created by Intel, became the standard on non-mobile computers, largely due to Intel’s market share for decades. ARM is used on mobile devices, and is a more recent instruction set relative to x86. It also has nothing to do with actual arms, in case you were wondering.
ARM and x86 share a core number of features. They both have instructions for common tasks, like basic arithmetic or accessing memory. In other ways, though, they have a different design philosophy. ARM has a simple and reduced instruction set. x86, in contrast, has a much more complex instruction set. The ARM philosophy is to decompose complex operations into a number of simpler instructions. The x86 philosophy is to support more complex operations in hardware with specific instructions made for them. These philosophies are known as RISC (reduced instruction set computer) and CISC (complex instruction set computer) for ARM and x86 respectively. Technically speaking, modern x86 CPUs are RISC at the core with a CISC wrapper, where complex instructions are broken apart into simpler ones. In contrast, ARM instruction are simple in the first place.
This simple/complex distinction typically extends beyond just the instruction set as well. Because ARM chips stick to doing a limited number of tasks, they have less circuitry. As with lithography, this reduced circuitry results in less power usage and less heat output, making them very efficient. They are also cheaper to produce. On the other hand, x86 CPUs dedicate much more chip space to advanced techniques such as branch prediction and other complex circuitry. This makes them use more power and create more heat but also allows them to achieve greater speed.
This explains why these respective architecture types are seen in certain use cases. x86 rules on desktops and laptops, and ARM is on mobile devices. Mobile devices need much better battery life, and cannot generate much heat because there is nowhere to dissipate it. ARM CPUs fit the bill perfectly. In the same way, desktops favor performance over power usage and x86 is more suitable in this area.
Different CPUs can use the same instruction set. For example, CPUs made by Intel and AMD both use the x86 instruction set. Likewise, a number of different vendors make ARM CPUs, like Qualcomm or Nvidia. Note that ARM (a company) doesn’t itself manufacture CPUs; instead, it designs them, and these designs are used, possibly with modifications, by chip manufacturers.
32-bit And 64-bit
Earlier I mentioned x64, then never went into detail on what it is. x64 is much like x86, with a notable difference: it has the number 64, rather than 86. Bam! I just blew your mind.
Anyway, x64 is 64-bit. x86 is 32-bit. Call now and get twice the bits at no extra cost! Computers, as you know, are binary at heart. A bit is a binary digit, a 0 or 1. With decimal numbers, we can express larger quantities by stringing together more digits. For example, with four decimal digits, the biggest number we can represent is 9999. The same concept applies with bits. With 8 bits (known as a byte), we can represent up to 255.
When processors do things like add numbers, the numbers are a specific size. For example, consider the problem 3 + 25. For the processor, this looks more like 0003 + 0025. The numbers have to be the same size, even if they do not use all the digit places. CPUs cannot add things like a 7-bit number to a 9-bit number. Instead, they work with numbers of the same size, such as all numbers being 8-bit. This is known as the word size of the processor.
Over time, the word size in processors has increased. Early CPUs were 8-bit, a term which is still used to refer to graphics often seen on these systems, like old game consoles. Over time, we got 16-bit CPUs, then 32-bit CPUs. Most recently, we got 64-bit CPUs.
One of the problems with 32-bit CPUs is that they can only address a certain amount of memory, roughly 4GB. This is because the largest 32-bit number is roughly 4 billion. Because of this, if you are using a 32-bit processor, you cannot use more than 4GB of RAM. This is one of the primary reasons for the move to 64-bit, which can address far more memory.
In general, 64-bit CPUs are a bit faster (a bit faster, you get it?), depending on the task – say, 10% – but programs that are 64-bit use more memory. Some might say… a bit more memory. Jokes aside, 64-bit programs do use more memory. It is like taking a list of 100 binary numbers, and doubling the size of each of them. It would take twice the memory space to store them. In the same way, 64-bit programs end up using bigger numbers in a lot of places. This is also true for operating systems. The recommended minimum for running Windows 7 32-bit is 1GB; the recommended minimum for Windows 7 64-bit is 2GB. Memory usage generally does not completely double from 32-bit to 64-bit, but it can be significant – say, 30%.
x64 is effectively a 64-bit version of x86. ARM has been, for the most part, 32-bit, but in recent years 64-bit ARM CPUs have begun to appear in the mobile space.
If you download software frequently, you have likely noticed download pages that offer both 32-bit and 64-bit versions of the software. If you are using a 32-bit processor, you cannot run 64-bit applications. However, if you are using a 64-bit processor, you can run 64-bit applications as well as 32-bit applications in a compatibility mode. Generally speaking, download the version that matches your processor, though for 64-bit processors both versions will work fine.
Wow. You just read a lot of words. Here, have a dollar.
So now you presumably know a bunch about CPUs, and you want to know how to compare them. Ok, here is the secret:
You have been had. I made you read all that, and for nothing. Well, not really. It is all useful information. It just isn’t very straightforward to compare CPUs.
For example, let’s compare your 4-core 2GHz desktop from earlier with a hypothetical 4-core 2GHz mobile phone. This configuration is fairly common. You would expect these to be roughly the same speed – same amount of cores, same clock rate. Reality is quite different – the phone would be far slower. This is primarily due to architectural differences, as the phone uses an ARM CPU and the desktop uses x86 or x64 CPU. This highlights a common no-no when making CPU comparisons:
Not all gigahertz are created equal. In fact, the majority of the time they aren’t equal at all. Recall from earlier that the overall work that gets done is the product of clock speed and how much gets done in a single cycle. But when how much gets done per cycle varies greatly, as it does across different architectures (both instruction set architectures and microarchitectures), comparing just the clock speed becomes meaningless. Even comparing clock speeds from an x86 CPU from Intel and an x86 CPU from AMD has this same problem – at the time of writing, AMD CPUs get less done, clock-per-clock, than Intel CPUs. However, AMD CPUs have more cores than similar Intel CPUs.
Speaking of more cores, there are some mobile devices with 8 cores. An octa-core, they call it. Let’s be honest, that is a fun word. It is slightly deceptive, though. In every case I have seen, octa-core setups aren’t 8 uniform cores. Instead, it is 4 slower cores and 4 faster cores. The slower cores are used for typical light activities – browsing the web, for example – while the faster cores are used for things like games. This allows the device to be more efficient overall. Still, it isn’t anywhere near an “actual” octa-core, where you could use all the cores at once. If you were comparing two phones where one was a quad-core and one was an octa-core, you would likely overestimate the difference between them.
This theme continues when comparing across use cases. Laptop CPUs and desktop CPUs are just different. For example, laptop versions of the same brand as the desktop version (say, an Intel i5) have different specifications entirely, as the laptop likely has two cores and the desktop has four. Even if laptop CPUs were made with the same specifications as desktop CPUs, they most likely wouldn’t be very useful, either because it has to live off the plug due to power consumption, or it cannot cool enough to avoid throttling.
Date of release is also important, as it effectively combines the microarchitecture and lithography factors. Let’s say you are trying to decide between a new Intel i5 and a three-year-old Intel i7. i7’s are better, right? Not quite. In this specific case, assuming other factors like use case are the same, the i5 is likely faster than the i7 due to improvements that have been made over time.
Thus, you have to be careful when comparing CPUs. As a general rule of thumb, the more similar things are, the better you can compare them. For example, imagine comparing a CPU to itself, where the difference is overclocking (setting the clock rate higher). The overclocked version will be faster, because all other factors are the same. Similarly, comparing two i7 models of the same generation can be done fairly accurately, because many factors like microarchitecture and lithography are shared between them. The further apart CPUs are, the more risky it is to compare them, like with the desktop vs phone example.
Clearly comparing CPUs that are far apart is difficult and not very useful. How, then, can you decide between them? The most objective answer is benchmarks. Benchmarks are automated tests that measure how fast things can be computed. For example, how fast can a CPU compute the 1000th digit of pi? The important point of benchmarks is that the tests don’t change; all CPUs are subject to the same conditions. There are a number of websites that publish various benchmark results that can provide you with useful information. Unfortunately, benchmarks are synthetic. They may or may not reflect real-world performance. Thus, it is best to use several different benchmarks to get a more conclusive comparison, and try to determine which particular benchmarks are more useful indicators for what you intend to do.