Relation between CPI and number of execution units when looking at SIMD intrinsics [duplicate] - cpu-architecture

This question already has answers here:
latency vs throughput in intel intrinsics
(1 answer)
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
(1 answer)
How many CPU cycles are needed for each assembly instruction?
(5 answers)
Closed 10 days ago.
I understand that the term Cycle Per Instruction closely relates to the superscalarity of the processor, a term which I have not fully understood. According to Wikipedia, "...a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor". In the same article, there is a hint that superscalarity is not necessarily related to instruction pipelining, a concept with which I'm fairly familiar.
Now, let's get concrete by taking the example of _mm256_shuffle_ps, which, according to https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#avxnewtechs=AVX,AVX2,FMA, has a CPI of 0.5 for the Alder Lake micro-architecture.
Questions:
Can I assume that there are exactly 2 identical execution units which execute _mm256_shuffle_ps in all Alder Lake chips?
How can a programmer know which separate instructions involve the same executions units?
If there are different numbers of execution units for different instructions (such as _mm256_shuffle_ps), how does the statement "X is a 4-way superscalar processor" make sense, seeing as no one number could describe the distinct multiplicities of each execution unit?
Thanks in advance for the transfer of knowledge.

Superscalar is usually a term you'd apply to CPU's of old, e.g. the original pentium. Back in those days, you'd have two seperate pipes, the U (primary) and V (secondary) pipe, which would allow you to potentially dispatch two instructions at the same time (i.e. it had 2 execution units). It was effectively a way of getting slightly better performance from an in-order processor core (although that came with caveats - e.g. pipeline bubbles could be an issue)
These days processors tend to use Out of Order Execution (OOOE) backed by a larger number of execution units. Alder Lake CPU's have 12 execution units, however those execution units tend to be specialised to some extent - e.g. load/store, pointer arithmetic, SIMD FPU units, etc. That's why you won't see 12 execution units capable of performing a shuffle. It can dispatch 12 micro-ops per cycle, but those ops can't all be the same instruction.
Can I assume that there are exactly 2 identical execution units which execute _mm256_shuffle_ps in all Alder Lake chips?
No, you can't assume that. You can assume that there are two execution units which are capable of executing _mm256_shuffle_ps, but that doesn't mean those two units are identical. For example, we can see there are 3 execution units that can work on 256bit YMM registers, and we can see from the instruction timings that all 3 can perform _mm_add_epi32. However, only 2 can perform _mm_shuffle_ps, and only 1 can perform _mm_div_ps, so they are clearly not the same....
How can a programmer know which separate instructions involve the same executions units?
Unless the manufacturer explicitly states the capabilities of each execution port (sometimes you'll find that info in the technical manual for the CPU), you're pretty much limited to making educated guesses (e.g. the Apple M1)
If there are different numbers of execution units for different instructions (such as _mm256_shuffle_ps), how does the statement "X is a 4-way superscalar processor" make sense, seeing as no one number could describe the distinct multiplicities of each execution unit?
Modern Intel processors are not superscalar, therefore describing them as such makes no sense at all.
Alder Lake is able to dispatch 12 instructions per clock, using Out-Of-Order-Execution. The types of instruction the execution units can handle, is typically geared up to cover a range of common cases. For example, consider this code:
void func(float* r, float* a, float* b) {
// basic integer ops: increment and less-than
for(int i = 0; i < 128; ++i) {
// 2 address manipulation instructions
float* addr_a = a + i * 4;
float* addr_b = b + i * 4;
// 2 load instructions
__m128 A = _mm_load_ps(addr_a);
__m128 B = _mm_load_ps(addr_b);
// an addition
__m128 R = _mm_add_ps(A, B);
// another address manipulation op
float* addr_r = r + i * 4;
// a store instruction
_mm_store_ps(addr_r, R);
}
}
Providing 12 execution units that are all capable of executing an _mm_add_ps instruction doesn't really make any sense. It makes more sense to balance the number of SIMD execution units with all those other common tasks (e.g. address manipulation, looping, etc).

Related

Reciprocal cost allocation between units servicing each other (typical managerial accounting problem) in T-SQL

I am desperately searching for an efficient way - if there is one - to solve some kind of a recursive task in T-SQL (I could successfully model it in excel and on paper with an iterative solution - as many CMAs would for a small example, re-allocating shares of cost between pairs of support units serving each other in iterations and minimising the balancing unit's unallocated cost leftover to a reasonably small number to stop iterations/recursion).
Now I am trying to find a good scalable solution (or at least a feasible approach to it) how to achieve the same in T-SQL for this typical computational task in the managerial accounting area: when some internal support units service each other (and incur periodic costs, like salary etc) to produce at the end let's say 2 or 3 final products together as a firm, and as a result their respective shares of internally generated support overheads need to be reasonably (according to some physical base distribution, lets say - man hrs spent in each) allocated to these products' cost at the end of the costing exercise.
It would be quite simple if there was no reciprocal services: one support unit providing some service to other support units during the period (and a need to allocate respective costs too alongside this service qty flow) and the second and third support units doing the same thing to other support peers, before all their costs get properly berried into production costs and spread between respective products they jointly serviced (not equally for all support units, I'm using activity-based-costing approach here)... And in a real case there could be many more than just 2-3 units one could manually solve in excel or on paper. So, it really needs some dynamic parameters algorithm (X number of support units servicing X-1 peers and Y products in the period serviced based on some qty-measure/% square matrix allocation table) to spread their periodic cost to one unit of each product at the end. Preferably, somehow natively in SQL without using external .NET or other assembly references.
Some numeric example:
each of 3 support units A,B,C incurred $100, $200, $300 of expenses in the period and worked 50 man hrs each, respectively
A-unit serviced B-unit for 10 hrs and C-unit for 5 hrs, B-unit serviced A-unit for 5 hrs, C-unit serviced A-unit for 3 hrs and B-unit for 10 hrs
The rest of the support units' work time (A-unit 35 hrs: 30% for P1 and 70% for P2, B-unit 45 hrs: 35% for P1 and 65% for P2, C-unit 37 hrs for P2 for 100%) they spent servicing the output of two products (P1 and P2); this portion of their direct time/effort easily allocates to products - but due to reciprocal services to each other some share of support units' cost needs to be shifted to a respective product cost pool unequal to their direct time to product allocation (needs an adjusted mix coefficient for step 2 effects).
I could solve this in excel with iterating algorithm and use of VBA arrays:
(a) vector of period costs by each support unit (to finally reallocate to products and leave 0),
(b) 2dim array/matrix of coefficients of self-service between support units (based on man hrs - one to another),
(c) 2dim array/matrix of direct hrs service for each product by support units,
(d) minimal tolerable error of $1 (leftover of unallocated cost in a unit to stop iteration)
For just 2 or 3 elements (while still manually provable on paper) it is a feasible approach, but this becomes impossible to manually prove for a correct solution once I have 10-20+ support units and many products in a matrix; and I want to switch from excel and VBA to MS SQL server and t-sql for other reasons.
Since this business case as such is not new at all, I was hoping more experienced colleagues could throw an advise how to best solve this - I believed there must have been a solution to this task before (not in pure programming environment/external code).
I am thinking to combine CTE(recursive), table variables and aggregate window functions - but hesitate/struggle how to best/exactly put all puzzle elements together so it is truly scalable for my potentially growing unit/product matrix dimensions.
For my current level it's a little mind blowing, so I'd be grateful for an advice.

CPU clock cycle misunderstanding

I can't well understand about CPU clock such as 3.4Ghz. I know this is that 3.4 billions clock cycle per second.
So here if machine use single clock cycle instruction, then It can execute about 3.4 billions instructions per second.
But in pipeline, basically it needs more cycles per instruction, but each cycle length is shorter than single clock cycle.
But although pipeline has more throughput, anyway cpu can do 3.4 billions cycle per second. So, it can execute 3.4 billions/5 instructions(if one instruction needs 5 cycles), which means less than single cycle implementation(3.4 > 3.4/5). What am I missing?
Does CPU clock such as 3.4Ghz just means for based on pipeline cycle, not for based on single cycle implentation?
Pipelining
Pipelining doesn't involve cycles shorter than a single clock cycle. Here's how pipelining works:
We have a complicated task to do. We take that task and break it down into a number of stages, each of which is relatively simple to carry out. We study the amount of work in each stage to make sure each stage takes about the same amount of time as any other.
With a processor, we do roughly the same thing--but in this case, it's not "install these fourteen bolts", it's things like fetching and decoding instructions, reading operands, executing (often a couple of stages here), and writing back results.
Like the automotive production line, we provide each stage of the pipeline with a specialized set of tools for doing exactly (and only) what is needed at that stage. When we finish doing one stage of processing on a car/instruction, it moves along to the next stage, and this stage gets the next car/instruction to process.
In an ideal situation, the process works (roughly) like this:
It took Ford about 12 hours to build one N car (the predecessor to the model T). Thanks primarily to pipelining the production line, it took only about 2 and a half hours to build a Model T. More importantly, even though a model T took 2.5 hours start to finish, that time was broken down into no fewer than 84 discrete steps, so when everything ran smoothly the production line as a whole could produce another car (about) every two minutes.
That didn't always happen though. If one stage ran short of parts, the stages after it had to wait. If the pause lasted very long, it would back things up so the preceding stages had to wait too.
The same can happen in a processor pipeline. For example, when a branch happens, the processor may have to wait a while before the next instruction can be fetched. If an instruction needs an operand from memory, that can lead to a pause (a "pipeline bubble") as well.
To prevent pauses in his pipeline, Henry Ford hired people to study the stages, figure out how many of each kind of part would need to be on hand for each stage, and so on. I don't know for sure, but I think it's a fair guess that there were probably a few people designated to watch the supply of parts at different stations, and send somebody running to let a warehouse manager know if (for whatever reason) the supply of parts for a particular stage looked like it was running short so they'd need more soon.
Processors do a little of the same thing--they have things like branch predictors and prefetchers that attempt to figure out ahead of time what will be needed by the stream of instructions being executed, and trying to ensure that everything is on hand when its needed (with caches, for example, to temporarily store things that seem likely to be needed soon).
So, like the Model T, it takes some relatively long amount of time for each instruction to execute start to finish, but we get another product finished at much shorter intervals--ideally once a clock (but see my other answer--modern designs often execute more than one instruction per clock).
A typical modern CPU can execute a number of unrelated instructions (those that don't depend on the same resources) concurrently.
To do that, it typically ends up with a basic structure vaguely like this:
So, we have an instruction stream coming in on the left. We have three decoders, each of which can decode one instruction each clock cycle (but there may be limitations, so complex instructions all have to pass through one decoder, and the other two decoders can only do simple instructions).
From there, the instructions pass into a reorder buffer, which keeps a "scoreboard" of which resources are used by each instruction, and which resources are affected that instruction (where a "resource" would typically be something like a CPU register or a flag in the flags register).
The circuitry then compares those scoreboards to determine dependencies. For example, if one instruction writes to register 0, and a later one reads from register 0, then those instructions must execute serially. At each clock, it tries to find the N oldest instructions that don't have dependencies for execution.
There are then a number of independent execution units. Each of these is basically a "pure" function--it takes some inputs, carries out a specified transformation on it, and produces an output. This makes it easy to replicate them as needed, and have as many running in parallel as we want/can afford. Those are typically grouped, with one port going to each group. In each clock, we can send one instruction through that port to one of the execution units in that group. Once an instruction arrives at the execution unit, it may take more than one clock to finish execution.
Once those execute, we have a set of retirement units that take the results, and write them back to the registers in execution order. Again we have multiple units so we can retire multiple instructions per clock.
Note: this drawing tries to be semi-realistic about the rough number of decoders, retirement units, and ports that it depicts, but what it shows is a general idea--different CPUs will have more or fewer specific resources. For almost any of them, the number of decoded instructions in the scoreboard units is low though--a realistic number would be more like 50 instructions.
In any case, actual execution of instructions is one of the hardest parts of this to measure or reason about. The number of ports gives us a hard upper limit on the number of instructions that can start executing in any given clock. The number of decoders and retirement units give an upper limit on the number of instructions that can be started/finished per clock. The execution itself...well, there are a lot of execution units, and each one (at least potentially) takes a different number of clocks to execute an instruction.
With the design as shown above, you'd have a hard upper limit of three instructions per clock. That's the most you can decode or retire. With a different design, that could obviously go up or down (e.g., with 4 decoders, 4 ports and 4 retirement units, the upper limit could go up to 4).
Realistically, with that design you wouldn't normally expect to see three instructions execute in most clock cycles. There are enough dependencies between instructions that you'd probably expect closer to 2 as a long term average (and much more likely a little less than 2). Increasing the available resources (more decoders, more retirement units, etc.) will rarely help that a whole lot--you might get to an average of three instructions per clock, but hoping for four is probably unrealistic.
As others have noted the full details of how a modern CPU operates are complicated. But part of your question has a simple answer:
Does CPU clock such as 3.4Ghz just means for based on pipeline cycle,
not for based on single cycle implentation?
The clock frequency of a CPU refers to how many times per second the clock signal switches. The clock signal is not divided into smaller pipelined segments. The purpose of pipelining is to allow for faster clock switching speeds. So 3.4GHz refers to the number of times per second that a single pipeline stage can perform whatever work it needs to do when executing an instruction. The total work for executing an instruction is done over multiple cycles each of which could be in a different pipeline stage.
Your question also shows a some misconceptions about how pipelining works:
But although pipeline has more throughput, anyway cpu can do 3.4
billions cycle per second. So, it can execute 3.4 billions/5
instructions(if one instruction needs 5 cycles), which means less than
single cycle implementation(3.4 > 3.4/5). What am I missing?
In the simple case the throughput of a single cycle CPU and a pipelined CPU is the same. The latency of the pipelined CPU is higher because it requires more cycles (i.e. 5 in your example) to execute a single instruction. But after the pipeline is full the throughput could be the same as for a single cycle non-pipelined CPU. So in the simple case using your example a single-cycle CPU could execute 3.4 billion instructions in 1 seconds, while the pipelined CPU with 5 stages could execute 3.4 billion minus 5 instructions in 1 second. Subtracting 5 from 3.4 billion is a negligible difference, whereas dividing by 5 would be a very significant difference.
A couple of other things to note are that the simple case I described isn't really true because of dependencies between instructions that require pipeline stalls. And most modern CPUs can execute more than one instructions per cycle.

Calculate the performance of a multicore architecture?

Cal a multicore architecture with 10 computing cores: 2 processor cores and 8 coprocessors. Each processor core can deliver 2.0 GFlops, while each coprocessor can deliver 1.0 GFlops. All computing cores can perform calculation simultaneously. Any instruction can execute in either processor or coprocessor cores unless there are any explicit restrictions.
If 70% of dynamic instructions in an application are parallelizable, what is the maximum average performance (Flops) you can get in the optimal situation? Please note that the remaining 30% instructions can be executed only after the execution of the parallel 70% is over.
Consider another application where all the dynamic instructions can be partitioned into 6 groups (A, B, C, D, E, F) with the following dependency. For example, A --> C implies that all the instructions in A need to be completed before starting the execution of instructions in C. Each of the first four groups (A, B, C and D) contains 20% of the dynamic instructions whereas each of the remaining two groups (E and F) contains 10% of the dynamic instructions. All the instructions in each group must be executed sequentially on the same processor or coprocessor core. How to schedule them on the multicore architecture to achieve the best possible performance? What is the maximum average performance (Flops) now?
A(20%) --> C(20%) -->
E(10%)-->F(10%)
B(20%) --> d(20%) -->
For the first part, you need to use Amdahl's Law, which is:
max speed-up = 1/(1-p+p/n)
where p is the parallelizable part. n is the improvement factor in executing the parallel portion.
(Note that the Amdahl's Law formula can be used for first order estimates on other types of changes. E.g., given a factor of N reduction in ALU energy use and P fraction of energy used by the ALU, one can find the improvement in total energy use.)
In your case, since the serial portion would be executed on the higher performance (2 GFLOPS) processor core, n is 6 ([8 coprocessor cores * 1 GFLOPS/core + 2 processor cores * 2 GFLOPS/core]/ 2 GFLOPS/processor core).
A quick calculation shows the max speed-up you can get is 2.4 related to 1 processor core. The maximum FLOPS would therefore be the speed-up times the speed if the whole program was executed serially on one processor core, i.e., 2.4 * 2 GFLOPS = 4.8 GFLOPS.
For the second part, note that initially there are two independent instruction streams: A -> C and B -> C. Since the system has two processor cores, both can be executed in parallel on the higher performance processor cores. Furthermore, both have the same amount of work (40% of total for each stream), so one the same performance core they will complete at the same time.
Since E depends on results from both C and D, it must be started after both finish. E and F would execute on a processor core (which core is arbitrary since E must wait for the tasks running on both processor cores to complete).
As you can see 80% of the program (40% for A+C; 40% for B+D) can be parallelized by a factor of 2 and 20% of the program (E+F) is serial. You can then just plug the numbers into the Amdahl's Law formula (p=0.8, n=2).

Amdahl's law example

Can someone help me with this example please and show me how to work the second part?
the question is :
If one third of a weather prediction algorithm is inherently serial and the remainder
parallelizable, what is the minimum number of cores needed to guarantee a 150% speedup over a
single core implementation?
ii. Your boss revises the figure to 200%. What is your new answer?
Thanks very much in advance !!
Guess: If the algorithm is 1/3 serial and 2/3 parallel...I would think that each core you added would give you a 66% increase in performance...So for 150% increase, you'd need 3 more cores, and for a 200% increase, you'd need 4.
This is a guess. Your textbook might be more helpful :)
If the algorithm runs on a single core and takes 90 minutes then 30 minutes is for the serial part and 60 minutes for the parallel part.
Add a CPU:
30 is for the serial part and 30 for the parallel part(half of the 60 overlaps with the serial part).
90 / 60 = 150% increase.
I am a bit late, but here are the answers:
1) 150% increase -> 2 cores at least required as dbasnett said;
2) 200% increase -> 4 cores at least required basing on the Amahld's law:
Here, 90 minutes overall required to perform the calculation. P is the actually enhanced part of the algorithm (the parallelizable part) which is 2/3 of 90, N is the number of cores, so when there's a core only:
You get 1, which means 100%, which is how the algorithm performs the standard way (without multi-core acceleration and therefore no parallelization speedup).
Now, we must find N number of cores for which the previous equation equals 2, where 2 means that the algorithm performs in half time (45 minutes instead of 90 when there's no parallelization) and therefore with a 200% speedup:
Since:
We see that:
So with 4 cores computing in parallel the 2/3 of the algoritm you get 200% speedup. The same goes for 150%, you will get 2, as dbasnett already told you.
Pretty simple.
Note that a complex algorithm may imply further divisions of its parallelizable parts (and in theory you can have a different number of processing units per parallelizable part concurrently):
You can further look at Wikipedia (there's also an example):
http://en.wikipedia.org/wiki/Amdahl%27s_law#Description
Anyway, the principle is the same:
Let T be the time an algorithm needs to execute in order to complete, A be the serial part of it, B its parallelizable part and N the number of parallel CPUs, you can divide B in further small sections and perform calculations on each part:
You may for C, D, G e.g. adopt M CPUs instead of N (the speedup will of course differ if M != N).
And at the end, you will arrive at a point when having more CPUs doesn't matter anymore, since:
And your algorithm speedup will at most tend to total execution time (T) divided by the execution time of the Serial part only (A).
Therefore parallel calculation comes really handy only when you have low execution time for the serial part of your algorithm.

iPhone ARMv6 VFP asm latency, throughput and hazards

in this document:
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0301g/DDI0301G_arm1176jzfs_r0p7_trm.pdf
on page 21-25 (pdf page 875) the througput and latency timings are given for the assembly instructions of the VFP unit.
Are those numbers independant of vectorsize?
1:
let's take FMULS which has throughput of 1 and latency of 8. does it mean that i can start in each cycle a new FMULS operation if i don't use a register which is not currently calculated by a previous function?
for example:
FMULS s8, s16, s20
FMULS s12, s21, s25
will those exectue right after each other?
2:
what happens if I have two FMULS functions after each other where one argument depends upon the previous computation
FMULS s8, s16, s20
FMULS s12, s21, s8
will the VFP wait for 8 cycles before starting to process the second instruction?
3:
what if we are in vectormode with 4 elements and on the second FMULS instruction all inputregisters but one are available. what will happen?
4: sqrt and division:
will a sqrt or division operation prevent any subsequent operation from being started for 19 cycles?
thanks!
Your questions are all answered in the document that you linked. You should read it carefully.
Are those numbers independent of vectorsize?
No. See, for example, Table 21-15 in the document you linked. Note the latency of the short vector FADDS.
does it mean that I can start a new FMULS operation every cycle if it doesn't depend on an earlier result that isn't available yet?
Yes, that's the definition of throughput.
what happens if I have two FMULS functions after each other where one argument depends upon the previous computation
Execution will stall until the result of the first FMULS is available. See 21.6 "Operation of the scoreboards" for more detail.
what if we are in vectormode with 4 elements and on the second FMULS instruction all inputregisters but one are available. what will happen?
It will stall. Same reference.
sqrt and division: will a sqrt or division operation prevent any subsequent operation from being started for 19 cycles?
No. See section 21.10 "Parallel Execution". An example is given in Table 21-15, in which a non-dependent FADDS executes immediately following FDIVS.
Note that it can be a bit of a challenge (though not impossible) to write short-vector VFP code that performs substantially faster than scalar code for many types of computation. Even if you learn how to do it, it will be of questionable value since the NEON unit seems to be the new model for vector computation on ARM. You may be better served in the long run by ignoring the short-vector operation for now and focusing on learning NEON for the future.