Does storing false bool values cost less electrical energy? - cpu-architecture

Going to sleep tonight I have been wondering: if bool, in C++ for example, is set to false that mean, that all of it’s (8 or 16)bits are set 0(seems to be).
Zero bit, as far as I know, means no current flowing in some transistor, so, hence false bool will waste energy in some device with battery less than true?
So if yes, it will be better to, for example, in functions set defaults boolean (or even maybe other) parameters as false:
Instead of:
void DrawImage(int x, int y, bool cached = true);
Do
void DrawImage(int x, int y, bool not_cached = false);

Zero bit, as far as I know, means no current flowing in some transistor, so, hence false bool will waste energy in some device with battery less than true?
No, not for that reason. CMOS logic has no current flowing in either static state, only in the transition between states (to charge / discharge the parasitic capacitance, and any shoot-through current that flows as the pull-up and pull-down transistors both partially conduct for a moment). Apart from leakage current, of course, which is somewhat significant at lower clock speeds.
CMOS is more or less symmetric, except for differences between N-channel and P-channel MOSFETs, so 1 isn't different from 0 in terms of voltage states and how transistors let charge flow.
You'd be right for the output of one gate in some other logic families like TTL (bipolar transistors with pull-up resistors), where a transistor would pull current to ground through a pull-up resistor or not. But only for one gate; usually logic involves multiple inversions, because an amplifier naturally inverts (in CMOS or TTL or RTL).
Also only for 1 bit out of 64 in the register used for arg-passing. The CPU's pipeline state, and the out-of-order execution machinery, take vastly more transistors (and gates) than just the actual architectural state (register values) and data being operated on. So the state of 1 bit is pretty negligible.
The large number of tiny transistors in a CPU is why CPUs have used CMOS logic for decades, otherwise those static currents through pull-up resistors in RTL or TTL would melt them.
Even with CMOS, power density has been a problem since the early 2000s (the "power wall" for frequency scaling, as described in Modern Microprocessors
A 90-Minute Guide! which is pretty essential reading if you want to know more about CPU design considerations). In CMOS, it takes higher voltages to switch faster (about linearly), and the energy in a capacitor scales with V^2. And current only flows in CMOS when a gate switches from 0 to 1 or vice versa, and the rate of that happening is some factor of the CPU clock. So running at the minimum voltage for a given frequency, power scales with about f^3.
Other factors that could make creating a 0 cheaper
On x86, xor edi, edi is a cheaper instruction than mov edi, 1. On Sandybridge-family CPUs, it doesn't even need an execution unit in the back-end, so that's definitely some transistors that didn't need to be switching. As well as a smaller instruction (2 bytes vs. 5, or 3 for mov dil, 3 to save code size at the cost of partial-register performance penalties). So passing a 0 can perhaps improve performance, letting the same number of instructions finish sooner, letting the CPU get back to sleep sooner (race to sleep). Or not, there might easily be no effect, or different code alignment of later instructions might happen to be better with the longer instruction.
Most other ISAs don't have as much different between zeroing vs. setting 1 in a register. And even on x86, this is not generally a very valuable optimization.
But still, if you have a choice for one value to be special, 0 is a good choice, especially for non-bool integers since it's slightly more efficient to test for 0 vs. non-0 than for any other number. (So for example if you're using plain int, x != 0 is cheaper than x == 1. With a bool, a compiler can already just test for non-zero if you do b == true.)

While this would probably technically save energy, the amount saved is negligible as it is only a single bit being set to true or false, and lengths you would have to go to to make this worth it are unreasonable considering the tradeoff.
Even then, making the reader jump through more mental hoops, making your code harder to read by having to think twice about a bool, is a bad idea. Interesting to think about, though.

Related

Why do we need to specify the number of flash wait cycles?

Especially when working with "faster" devices like STMF4xx/F7xx we need to specify the number of flash wait cycles, based on the supply voltage and the sys-clock frequency.
When the CPU fetches instructions/or constants this is done over the FLITF. Am I right with the assumption that the FLITF holds a CPU request as long as it can provide the requested data, making it impossible for other Bus-Masters to access flash meanwhile.
If this was true, why should it be important to any interface to know flash wait cycles. Like Cache does preload instructions so or so, independent if it knows how long to wait, no?
Because the flash interface isn't magic.
It has to meet the necessary setup and hold times for addressing and reading out the flash cells, which will vary somewhat depending on voltage. Taking the STM32F411 as an example (because I have that TRM handy), doing some maths with the voltage/frequency/wait-state table implies that a read from flash on one of those takes in the order of ~30ns above 2.7V, down to ~60ns below 2.1V.
Since the flash interface doesn't have its own asynchronous nanosecond-precision timekeeping ability (because that would be needlessly complicated, power-hungry, and silly), that translates to asserting its signals for n clock cycles, after which it can assume the data signals from the cells are stable enough to read back*. How does it know what the clock frequency is, and therefore what n should be? Simple: you, as the programmer who set the clock, tell it. Some hardware things are just infinitely easier to let software deal with.
* and then going through the further shenanigans of extracting the relevant 8, 16 or 32 bits out of the 128-bit line it's read, to finally spit that out the other side onto the AHB bus to the waiting CPU, obviously.

Why are some Haswell AVX latencies advertised by Intel as 3x slower than Sandy Bridge?

In the Intel intrinsics webapp, several operations seem to have worsened from Sandy Bridge to Haswell. For example, many insert operations like _mm256_insertf128_si256 show a cost table like the following:
Performance
Architecture Latency Throughput
Haswell 3 -
Ivy Bridge 1 -
Sandy Bridge 1 -
I found this difference puzzling. Is this difference because there are new instructions that replace these ones or something that compensates for it (which ones)? Does anyone know if Skylake changes this model further?
TL:DR: all lane-crossing shuffles / inserts / extracts have 3c latency on Haswell/Skylake, but 2c latency on SnB/IvB, according to Agner Fog's testing.
This is probably 1c in the execution unit + an unavoidable bypass delay of some sort, because the actual execution units in SnB through Broadwell have standardized latencies of 1, 3, or 5 cycles, never 2 or 4 cycles. (SKL makes some uops uops 4c, including FMA/ADDPS/MULPS).
(Note that on AMD CPUs that do AVX1 with 128b ALUs (e.g. Bulldozer/Piledriver/Steamroller), insert128/extract128 are much faster than shuffles like VPERM2F128.)
The intrinsics guide has bogus data sometimes. I assume it's meant to be for the reg-reg form of instructions, except in the case of the load intrinsics. Even when it's correct, the intrinsics guide doesn't give a very detailed picture of performance; see below for discussion of Agner Fog's tables/guides.
(One of my pet peeves with intrinsics is that it's hard to use PMOVZX / PMOVSX as a load, because the only intrinsics provided take a __m128i source, even though pmovzxbd only loads 4B or 8B (ymm). It and/or broadcast-loads (_mm_set1_* with AVX1/2) are great way to compress constants in memory. There should be intrinsics that take a const char* (because that's allowed to alias anything)).
In this case, Agner Fog's measurements show that SnB/IvB have 2c latency for reg-reg vinsertf128/vextractf128, while his measurements for Haswell (3c latency, one per 1c tput) agree with Intel's table. So it's another case where the numbers in Intel's intrinsics guide are wrong. It's great for finding the right intrinsic, but not a good source for reliable performance numbers. It doesn't tell you anything about execution ports or total uops, and often omits even the throughput numbers. Latency is often not the limiting factor in vector integer code anyway. This is probably why Intel let the latencies increase for Haswell.
The reg-mem form is significantly different. vinsertf128 y,y,m,i has lat/recip-tput of: IvB:4/1, Haswell/BDW:4/2, SKL:5/0.5. It's always a 2-uop instruction (fused domain), using one ALU uop. IDK why the throughput is so different. Maybe Agner tested slightly differently?
Interestingly, vextractf128 mem,reg, i doesn't use any ALU uops. It's a 2-fused-domain-uop instruction that only uses the store-data and store-address ports, not the shuffle unit. (Agner Fog's table lists it as using one p015 uop on SnB, 0 on IvB. But even on SnB, doesn't have a mark in any specific column, so IDK which one is right.)
It's silly that vextractf128 wastes a byte on an immediate operand. I guess they didn't know they were going to use EVEX for the next vector length extension, and were preparing for the immediate to go from 0..3. But for AVX1/2, you should never use that instruction with the immediate = 0. Instead, just movups mem, xmm or movaps xmm,xmm. (I think compilers know this, and do that when you use the intrinsic with index = 0, like they do for _mm_extract_epi32 and so on (movd).)
Latency is more often a factor in FP code, and Skylake is a monster for FP ALUs. They managed to drop the latency for FMA to 4 cycles, so mulps/addps/fma...ps are all 4c latency with one per 0.5c throughput. (Broadwell was mulps/addps = 3c latency, fma = 5c latency. Haswell was addps=3c latency, mul/fma=5c). Skylake dropped the separate add unit, so addps actually worsened from 3c to 4c, but with twice the throughput. (Haswell/BDW only did addps with one per 1c throughput, half that of mul/fma.) So using many vector accumulators is essential in most FP algorithms for keeping 8 or 10 FMAs in flight at once to saturate the throughput, if there's a loop-carried dependency. Otherwise if the loop body is small enough, out-of-order execution will have multiple iterations in flight at once.
Integer in-lane ops are typically only 1c latency, so you need a much smaller amount of parallelism to max out the throughput (and not be limited by latency).
None of the other options for getting data into/out-of the high half of a ymm are any better
vperm2f128 or AVX2 vpermps are more expensive. Going through memory will cause a store-forwarding failure -> big latency for insert (2 narrow stores -> wide load), so it's obviously bad. Don't try to avoid vinsertf128 in cases where it's useful.
As always, try to use the cheapest instruction sequences possible. e.g. for a horizontal sum or other reduction, always reduce down to a 128b vector first, because cross-lane shuffles are slow. Usually it's just vextractf128 / addps xmm, then the usual horizontal 128b.
As Mysticial alluded to, Haswell and later have half the in-lane vector shuffle throughput of SnB/IvB for 128b vectors. SnB/IvB can pshufb / pshufd with one per 0.5c throughput, but only one per 1c for shufps (even the 128b version); same for other shuffles that have a ymm version in AVX1 (e.g. vpermilps, which apparently exists only so FP load-and-shuffle can be done in one instruction). Haswell got rid of the 128b shuffle unit on port1 altogether, instead of widening it for AVX2.
re: skylake
Agner Fog's guides/insn tables were updated in December to include Skylake. See also the x86 tag wiki for more links. The reg,reg form has the same performance as in Haswell/Broadwell.

Measure the electricity consumed by a browser to render a webpage

Is there a way to calculate the electricity consumed to load and render a webpage (frontend)? I was thinking of a 'test' made with phantomjs for example:
load a web page
scroll to the bottom
And measure how much electricity was needed. I can perhaps extrapolate from CPU cycle. But phantomjs is headless, rendering in real browser is certainly different. Perhaps it's impossible to do real measurements.. but with an index it may be possible to compare websites.
Do you have other suggestions?
It's pretty much impossible to measure this internally in modern processors (anything more recent than 286). By internally, I mean by counting cycles. This is because different parts of the processor consume different levels of energy per cycle depending upon the instruction.
That said, you can make your measurements. Stick a power meter between the wall and the processor. Here's a procedure:
Measure the baseline energy usage, i.e. nothing running except the OS and the browser, and the browser completely static (i.e. not doing anything). You need to make sure that everything is stead state (SS) meaning start your measurements only after several minutes of idle.
Measure the usage doing the operation you want. Again, you want to avoid any start up and stopping work, so make sure you start measuring at least 15 seconds after you start the operation. Stopping isn't an issue since the browser will execute any termination code after you finish your measurement.
Sounds simple, right? Unfortunately, because of the nature of your measurements, there are some gotchas.
Do you recall your physics classes (or EE classes) that talked about signal to noise ratios? Well, a scroll down uses very little energy, so the signal (scrolling) is well in the noise (normal background processes). This means you have to take a LOT of samples to get anything useful.
Your browser startup energy usage, or anything else that uses a decent amount of processing, is much easier to measure (better signal to noise ratio).
Also, make sure you understand the underlying electronics. For example, power is VA (voltage*amperage) where both V and A are in phase. I don't think this will be an issue since I'm pretty sure they are in phase for computers. Also, any decent power meter understands the difference.
I'm guessing you intend to do this for mobile devices. Your measurements will only be roughly the same from processor to processor. This is due to architectural differences from generation to generation, and from manufacturer to manufacturer.
Good luck.

Virtual Memory Page Replacement Algorithms

I have a project where I am asked to develop an application to simulate how different page replacement algorithms perform (with varying working set size and stability period). My results:
Vertical axis: page faults
Horizontal axis: working set size
Depth axis: stable period
Are my results reasonable? I expected LRU to have better results than FIFO. Here, they are approximately the same.
For random, stability period and working set size doesnt seem to affect the performance at all? I expected similar graphs as FIFO & LRU just worst performance? If the reference string is highly stable (little branches) and have a small working set size, it should still have less page faults that an application with many branches and big working set size?
More Info
My Python Code | The Project Question
Length of reference string (RS): 200,000
Size of virtual memory (P): 1000
Size of main memory (F): 100
number of time page referenced (m): 100
Size of working set (e): 2 - 100
Stability (t): 0 - 1
Working set size (e) & stable period (t) affects how reference string are generated.
|-----------|--------|------------------------------------|
0 p p+e P-1
So assume the above the the virtual memory of size P. To generate reference strings, the following algorithm is used:
Repeat until reference string generated
pick m numbers in [p, p+e]. m simulates or refers to number of times page is referenced
pick random number, 0 <= r < 1
if r < t
generate new p
else (++p)%P
UPDATE (In response to #MrGomez's answer)
However, recall how you seeded your input data: using random.random,
thus giving you a uniform distribution of data with your controllable
level of entropy. Because of this, all values are equally likely to
occur, and because you've constructed this in floating point space,
recurrences are highly improbable.
I am using random, but it is not totally random either, references are generated with some locality though the use of working set size and number page referenced parameters?
I tried increasing the numPageReferenced relative with numFrames in hope that it will reference a page currently in memory more, thus showing the performance benefit of LRU over FIFO, but that didn't give me a clear result tho. Just FYI, I tried the same app with the following parameters (Pages/Frames ratio is still kept the same, I reduced the size of data to make things faster).
--numReferences 1000 --numPages 100 --numFrames 10 --numPageReferenced 20
The result is
Still not such a big difference. Am I right to say if I increase numPageReferenced relative to numFrames, LRU should have a better performance as it is referencing pages in memory more? Or perhaps I am mis-understanding something?
For random, I am thinking along the lines of:
Suppose theres high stability and small working set. It means that the pages referenced are very likely to be in memory. So the need for the page replacement algorithm to run is lower?
Hmm maybe I got to think about this more :)
UPDATE: Trashing less obvious on lower stablity
Here, I am trying to show the trashing as working set size exceeds the number of frames (100) in memory. However, notice thrashing appears less obvious with lower stability (high t), why might that be? Is the explanation that as stability becomes low, page faults approaches maximum thus it does not matter as much what the working set size is?
These results are reasonable given your current implementation. The rationale behind that, however, bears some discussion.
When considering algorithms in general, it's most important to consider the properties of the algorithms currently under inspection. Specifically, note their corner cases and best and worst case conditions. You're probably already familiar with this terse method of evaluation, so this is mostly for the benefit of those reading here whom may not have an algorithmic background.
Let's break your question down by algorithm and explore their component properties in context:
FIFO shows an increase in page faults as the size of your working set (length axis) increases.
This is correct behavior, consistent with Bélády's anomaly for FIFO replacement. As the size of your working page set increases, the number of page faults should also increase.
FIFO shows an increase in page faults as system stability (1 - depth axis) decreases.
Noting your algorithm for seeding stability (if random.random() < stability), your results become less stable as stability (S) approaches 1. As you sharply increase the entropy in your data, the number of page faults, too, sharply increases and propagates the Bélády's anomaly.
So far, so good.
LRU shows consistency with FIFO. Why?
Note your seeding algorithm. Standard LRU is most optimal when you have paging requests that are structured to smaller operational frames. For ordered, predictable lookups, it improves upon FIFO by aging off results that no longer exist in the current execution frame, which is a very useful property for staged execution and encapsulated, modal operation. Again, so far, so good.
However, recall how you seeded your input data: using random.random, thus giving you a uniform distribution of data with your controllable level of entropy. Because of this, all values are equally likely to occur, and because you've constructed this in floating point space, recurrences are highly improbable.
As a result, your LRU is perceiving each element to occur a small number of times, then to be completely discarded when the next value was calculated. It thus correctly pages each value as it falls out of the window, giving you performance exactly comparable to FIFO. If your system properly accounted for recurrence or a compressed character space, you would see markedly different results.
For random, stability period and working set size doesn't seem to affect the performance at all. Why are we seeing this scribble all over the graph instead of giving us a relatively smooth manifold?
In the case of a random paging scheme, you age off each entry stochastically. Purportedly, this should give us some form of a manifold bound to the entropy and size of our working set... right?
Or should it? For each set of entries, you randomly assign a subset to page out as a function of time. This should give relatively even paging performance, regardless of stability and regardless of your working set, as long as your access profile is again uniformly random.
So, based on the conditions you are checking, this is entirely correct behavior consistent with what we'd expect. You get an even paging performance that doesn't degrade with other factors (but, conversely, isn't improved by them) that's suitable for high load, efficient operation. Not bad, just not what you might intuitively expect.
So, in a nutshell, that's the breakdown as your project is currently implemented.
As an exercise in further exploring the properties of these algorithms in the context of different dispositions and distributions of input data, I highly recommend digging into scipy.stats to see what, for example, a Gaussian or logistic distribution might do to each graph. Then, I would come back to the documented expectations of each algorithm and draft cases where each is uniquely most and least appropriate.
All in all, I think your teacher will be proud. :)

Why are CPU registers fast to access?

Register variables are a well-known way to get fast access (register int i). But why are registers on the top of hierarchy (registers, cache, main memory, secondary memory)? What are all the things that make accessing registers so fast?
Registers are circuits which are literally wired directly to the ALU, which contains the circuits for arithmetic. Every clock cycle, the register unit of the CPU core can feed a half-dozen or so variables into the other circuits. Actually, the units within the datapath (ALU, etc.) can feed data to each other directly, via the bypass network, which in a way forms a hierarchy level above registers — but they still use register-numbers to address each other. (The control section of a fully pipelined CPU dynamically maps datapath units to register numbers.)
The register keyword in C does nothing useful and you shouldn't use it. The compiler decides what variables should be in registers and when.
Registers are a core part of the CPU, and much of the instruction set of a CPU will be tailored for working against registers rather than memory locations. Accessing a register's value will typically require very few clock cycles (likely just 1), as soon as memory is accessed, things get more complex and cache controllers / memory buses get involved and the operation is going to take considerably more time.
Several factors lead to registers being faster than cache.
Direct vs. Indirect Addressing
First, registers are directly addressed based on bits in the instruction. Many ISAs encode the source register addresses in a constant location, allowing them to be sent to the register file before the instruction has been decoded, speculating that one or both values will be used. The most common memory addressing modes indirect through a register. Because of the frequency of base+offset addressing, many implementations optimize the pipeline for this case. (Accessing the cache at different stages adds complexity.) Caches also use tagging and typically use set associativity, which tends to increase access latency. Not having to handle the possibility of a miss also reduces the complexity of register access.
Complicating Factors
Out-of-order implementations and ISAs with stacked or rotating registers (e.g., SPARC, Itanium, XTensa) do rename registers. Specialized caches such as Todd Austin's Knapsack Cache (which directly indexes the cache with the offset) and some stack cache designs (e.g., using a small stack frame number and directly indexing a chunk of the specialized stack cache using that frame number and the offset) avoid register read and addition. Signature caches associate a register name and offset with a small chunk of storage, providing lower latency for accesses to the lower members of a structure. Index prediction (e.g., XORing offset and base, avoiding carry propagation delay) can reduce latency (at the cost of handling mispredictions). One could also provide memory addresses earlier for simpler addressing modes like register indirect, but accessing the cache in two different pipeline stages adds complexity. (Itanium only provided register indirect addressing — with option post increment.) Way prediction (and hit speculation in the case of direct mapped caches) can reduce latency (again with misprediction handling costs). Scratchpad (a.k.a. tightly coupled) memories do not have tags or associativity and so can be slightly faster (as well as have lower access energy) and once an access is determined to be to that region a miss is impossible. The contents of a Knapsack Cache can be treated as part of the context and the context not be considered ready until that cache is filled. Registers could also be loaded lazily (particularly for Itanium stacked registers), theoretically, and so have to handle the possibility of a register miss.
Fixed vs. Variable Size
Registers are usually fixed size. This avoids the need to shift the data retrieved from aligned storage to place the actual least significant bit into its proper place for the execution unit. In addition, many load instructions sign extend the loaded value, which can add latency. (Zero extension is not dependent on the data value.)
Complicating Factors
Some ISAs do support sub-registers, notable x86 and zArchitecture (descended from S/360), which can require pre-shifting. One could also provide fully aligned loads at lower latency (likely at the cost of one cycle of extra latency for other loads); subword loads are common enough and the added latency small enough that special casing is not common. Sign extension latency could be hidden behind carry propagation latency; alternatively sign prediction could be used (likely just speculative zero extension) or sign extension treated as a slow case. (Support for unaligned loads can further complicate cache access.)
Small Capacity
A typical register file for an in-order 64-bit RISC will be only about 256 bytes (32 8-byte registers). 8KiB is considered small for a modern cache. This means that multiplying the physical size and static power to increase speed has a much smaller effect on the total area and static power. Larger transistors have higher drive strength and other area-increasing design factors can improve speed.
Complicating Factors
Some ISAs have a large number of architected registers and may have very wide SIMD registers. In addition, some implementations add additional registers for renaming or to support multithreading. GPUs, which use SIMD and support multithreading, can have especially high capacity register files; GPU register files are also different from CPU register files in typically being single ported, accessing four times as many vector elements of one operand/result per cycle as can be used in execution (e.g., with 512-bit wide multiply-accumulate execution, reading 2KiB of each of three operands and writing 2KiB of the result).
Common Case Optimization
Because register access is intended to be the common case, area, power, and design effort is more profitably spent to improve performance of this function. If 5% of instructions use no source registers (direct jumps and calls, register clearing, etc.), 70% use one source register (simple loads, operations with an immediate, etc.), 25% use two source registers, and 75% use a destination register, while 50% access data memory (40% loads, 10% stores) — a rough approximation loosely based on data from SPEC CPU2000 for MIPS —, then more than three times as many of the (more timing-critical) reads are from registers than memory (1.3 per instruction vs. 0.4) and
Complicating Factors
Not all processors are design for "general purpose" workloads. E.g., processor using in-memory vectors and targeting dot product performance using registers for vector start address, vector length, and an accumulator might have little reason to optimize register latency (extreme parallelism simplifies hiding latency) and memory bandwidth would be more important than register bandwidth.
Small Address Space
A last, somewhat minor advantage of registers is that the address space is small. This reduces the latency for address decode when indexing a storage array. One can conceive of address decode as a sequence of binary decisions (this half of a chunk of storage or the other). A typical cache SRAM array has about 256 wordlines (columns, index addresses) — 8 bits to decode — and the selection of the SRAM array will typically also involve address decode. A simple in-order RISC will typically have 32 registers — 5 bits to decode.
Complicating Factors
Modern high-performance processors can easily have 8 bit register addresses (Itanium had more than 128 general purpose registers in a context and higher-end out-of-order processors can have even more registers). This is also a less important consideration relative to those above, but it should not be ignored.
Conclusion
Many of the above considerations overlap, which is to be expected for an optimized design. If a particular function is expected to be common, not only will the implementation be optimized but the interface as well. Limiting flexibility (direct addressing, fixed size) naturally aids optimization and smaller is easier to make faster.
Registers are essentially internal CPU memory. So accesses to registers are easier and quicker than any other kind of memory accesses.
Smaller memories are generally faster than larger ones; they can also require fewer bits to address. A 32-bit instruction word can hold three four-bit register addresses and have lots of room for the opcode and other things; one 32-bit memory address would completely fill up an instruction word leaving no room for anything else. Further, the time required to address a memory increases at a rate more than proportional to the log of the memory size. Accessing a word from a 4 gig memory space will take dozens if not hundreds of times longer than accessing one from a 16-word register file.
A machine that can handle most information requests from a small fast register file will be faster than one which uses a slower memory for everything.
Every microcontroller has a CPU as Bill mentioned, that has the basic components of ALU, some RAM as well as other forms of memory to assist with its operations. The RAM is what you are referring to as Main memory.
The ALU handles all of the arthimetic logical operations and to operate on any operands to perform these calculations, it loads the operands into registers, performs the operations on these, and then your program accesses the stored result in these registers directly or indirectly.
Since registers are closest to the heart of the CPU (a.k.a the brain of your processor), they are higher up in the chain and ofcourse operations performed directly on registers take the least amount of clock cycles.