Understanding stalls and branch delay slots - cpu-architecture

I am taking a course on Computer Architecture. I found this website from another University which has notes and videos which are helping me thus far: CS6810, Univ of Utah. I am working through some old homework assignments posted on that site, in particular this one. I am trying to understand pipelining and related concepts, specifically stalls and branch delay slots.
I am looking now at the first question from that old homework assignment and am unsure of how to do these problems.
The question is as follows:
Consider the following code segment, where the branch is taken 30% of the time and not
taken 70% of the time.
R1 = R2 + R3
R4 = R5 + R6
R7 = R8 + R9
if R10 = 0, branch to linex
R11 = R12 + R13
R14 = R11 + R15
R16 = R14 + R17
...
linex: R18 = R19 + R20
R21 = R18 + R22
R23 = R18 + R21
...
Consider a 10-stage in-order processor, where the instruction is fetched in the first
stage, and the branch outcome is known after three stages. Estimate the CPI of the
processor under the following scenarios (assume that all stalls in the processor are
branch-related and branches account for 15% of all executed instructions):
On every branch, fetch is stalled until the branch outcome is known.
Every branch is predicted not-taken and the mis-fetched instructions are squashed if the branch is taken.
The processor has two delay slots and the two instructions following the branch are always fetched and executed, and
3.1. You are unable to find any instructions to fill the delay slot.
3.2. You are able to move two instructions before the branch into the delay slot.
3.3. You are able to move two instructions after label "linex" into the delay slot.
3.4. You are able to move one (note: one, not two!) instruction immediately after the branch (in the original code) into the delay slot.
I am unsure of how to even begin to look at this question. I have read all the notes and watched the videos on that site and have read sections from the H&P book but am still confused on this problem. If anyone has the time, I would appreciate someone helping me step through this question. I just need to know how to begin to conceptualize the answers.

In the described pipeline the direction and target of a conditional branch is not available until the end of the third cycle, so the correct next instruction after the branch cannot be fetched (with certainty) until the beginning of the fourth cycle.
Design 1
An obvious way to handle the delayed availability of the address of the instruction after the branch is simply to wait. This is what the design 1 does by stalling for two cycles (which is equivalent to fetching two no-ops that are not part of the actual program). This means that for both taken and not taken paths two cycles will wasted, just as if two no-op instructions had been inserted by the compiler.
Here are diagrams of the pipeline (ST is a stall, NO is a no-op, XX is a canceled instruction, UU is a useless instruction, I1, I2, and I3 are the three instructions before the branch [in the original program order before filling any delay slots], BI is the branch instruction, I5, I6, and I7 are the fall-through instructions after the branch, I21, I22, and I23 are the instructions at the start of the taken path; IF is the instruction fetch stage, DE is decode, BR is branch resolve, S1 is the stage after BR):
Taken Not taken
IF DE BR S1 ... IF DE BR S1 ...
cycle 1 BI I3 I2 I1 BI I3 I2 I1
cycle 2 ST BI I3 I2 ST BI I3 I2
cycle 3 ST ST BI I3 ST ST BI I3
cycle 4 I21 ST ST BI I5 ST ST BI
cycle 5 I22 I21 ST ST I6 I5 ST ST
Design 2
To avoid having to detect the presence of a branch by the end of the IF stage and to allow some useful work to be done sometimes (in the not taken case), rather than having hardware effectively insert no-ops into the pipeline (i.e., stall fetch after the branch) the hardware can treat the branch as any other instruction until it is resolved in the third pipeline stage. This is predicting all branches as not taken. If the branch is taken, then the two instructions fetched after the branch are canceled (effectively turned into no-ops). This is the design 2:
Taken Not taken
IF DE BR S1 ... IF DE BR S1 ...
cycle 1 BI I3 I2 I1 BI I3 I2 I1
cycle 2 I5 BI I3 I2 I5 BI I3 I2
cycle 3 I6 I5 BI I3 I6 I5 BI I3
cycle 4 I21 XX XX BI I7 I6 I5 BI
cycle 5 I22 I21 XX XX I8 I7 I6 I5
Design 3
Always predicting a branch to be not taken will waste two cycles whenever a branch is taken, so a third mechanism was developed to avoid this waste--the delayed branch. In a delayed branch, the hardware always executes (does not cancel) the delay slot instructions after the branch (two instructions in the example). By always executing the delay slot instructions, the pipeline simplified. The compiler's job is to try to fill these delay slots with useful instructions.
Instructions taken from before the branch (in the program without delayed branches) will be useful regardless of which path is taken (but dependencies can prevent the compiler from scheduling any such instructions after the branch). The compiler can fill a delay slot with an instruction from the taken or not taken path, but such an instruction cannot be one that overwrites state used by the other path (or after the paths join) since delay slot instructions are not canceled (unlike with prediction). (If both paths join--as is common for if-then-else constructs--, then delay slots could potentially be filled from the join point; but such instructions are usually dependent on instructions from at least one of the paths before the join, which dependency would prevent them from being used in delay slots.) If the compiler cannot find a useful instruction, it must fill the delay slot with a no-op.
In case 3.1 (the worst case for a delayed branch design), the compiler could not find any useful instructions to fill the delay slots and so must fill them with no-ops:
Taken Not taken
IF DE BR S1 ... IF DE BR S1 ...
cycle 1 BI I3 I2 I1 BI I3 I2 I1
cycle 2 NO BI I3 I2 NO BI I3 I2
cycle 3 NO NO BI I3 NO NO BI I3
cycle 4 I21 NO NO BI I5 NO NO BI
cycle 5 I22 I21 NO NO I6 I5 NO NO
This is equivalent in performance to design 1 (stall two cycles).
In case 3.2 (the best case for a delayed branch design), the compiler found two instructions from before the branch to fill the delay slots:
Taken Not taken
IF DE BR S1 ... IF DE BR S1 ...
cycle 1 BI I1 ... BI I1 ...
cycle 2 I2 BI I1 ... I2 BI I1 ...
cycle 3 I3 I2 BI I1 I3 I2 BI I1
cycle 4 I21 I3 I2 BI I5 I3 I2 BI
cycle 5 I22 I21 I3 I2 I6 I5 I3 I2
In this case, all pipeline slots are filled with useful instructions regardless of whether the branch is taken or not taken. The performance (CPI) is the same as for an ideal pipeline without delayed resolution of branches.
In case 3.3, the compiler filled the delay slots with instructions from the taken path:
Taken Not taken
IF DE BR S1 ... IF DE BR S1 ...
cycle 1 BI I3 I2 I1 BI I3 I2 I1
cycle 2 I21 BI I3 I2 I21 BI I3 I2
cycle 3 I22 I21 BI I3 I22 I21 BI I3
cycle 4 I23 I22 I21 BI I5 UU UU BI
cycle 5 I24 I23 I22 I21 I6 I5 UU UU
In the not taken path I21 and I22 are useless. Although they are actually executed (and update state), this state is not used in the not taken path (or after any joining of the paths). For the not taken path, it is as if the delay slots had been filled with no-ops.
In case 3.4, the compiler could only find one safe instruction from the not taken path and must fill the other delay slot with a no-op:
Taken Not taken
IF DE BR S1 ... IF DE BR S1 ...
cycle 1 BI I3 I2 I1 BI I3 I2 I1
cycle 2 I5 BI I3 I2 I5 BI I3 I2
cycle 3 NO I5 BI I3 NO I5 BI I3
cycle 4 I21 NO UU BI I6 NO I5 BI
cycle 5 I22 I21 NO UU I7 I6 NO I5
For the taken path, one useless instruction and one no-op are executed, wasting two cycles. For the not taken path, one no-op is executed, wasting one cycle.
Calculating CPI
The formula for calculating CPI in this case is:
%non_branch * CPI_non_branch + %branch * CPI_branch
CPI_branch is calculated by accounting for the time taken for the branch itself (baseCPI_branch) and the percentage of times the branch is taken with the wasted cycles when it is taken and the percentage of times the branch is not taken with the wasted cycles when it is not taken. So the CPI_branch is:
baseCPI_branch + (%taken * wasted_cycles_taken) +
(%not_taken * wasted_cycles_not_taken)
In an ideal scalar pipeline, each instruction takes one cycle, i.e., the Cycles Per Instruction is 1. In this example, non-branch instructions behave as if the pipeline were ideal ("all stalls in the processor are branch-related"), so each non-branch instruction has a CPI of 1. Likewise, the baseCPI_branch (excluding wasted cycles from stalls, no-ops, et al.) is 1.
Based on the pipeline diagrams above, one can determine the number of cycles that are wasted in the taken and in the not taken paths. The example gives the percentage of branches and the percentages of branches that are taken and not taken.
For the design 1, both taken and not taken paths waste 2 cycles, so the CPI_branch is:
1 + (0.3 * 2) + (0.7 *2) = 3
and the total CPI is therefore:
(0.85 * 1) + (0.15 * 3) = 1.3

Related

Reverse CAN BTR value from register value of stm32

i'm trying to get the baud rate of a chip by reverse engineering it.
the register value for BTR is reading: 0x23000B
As per http://www.bittiming.can-wiki.info/ it seems that real values are "-1" in the register. So it seems that
SJW -> 0x0 -> becomes 1
TS2 -> 0x2 -> becomes 3
TS1 -> 0x3 -> becomes 4
preampl -> 0xB -> 11d -> becomes 12d
so if my decoding is correct (can't really find a reference of what the register should contain officially in any docs):
The chip in question has a 48MHz clock
So 48Mhz/(preampl) => 48MHz/12 => 4Mhz
4.000.000 / (SJW + TS1 + TS2) => 500kbps
does this make any sense? also if you can find reference to the register value in a pdf i would greatly appreciate that.
Besides the calculation i'm not sure about the 48Mhz clock.
A CAN bit is divided into time quanta (tq). The tq are clocked with your CAN prescaler clock which needs to be accurate enough (<1% inaccuracy). When setting up baudrate, you should strive to place the sample point close to 87.5% of the bit length, which comes from an industry standard (CANopen).
(In case you a reverse-engineering something, they did not necessarily follow industry standards though and the sample point could be anywhere...)
Ideally 87.5% sample point is achieved by having a total of 16 tq, 14 tq before the sample point and 2 tq behind it. The desired baudrate is then obtained by:
1 tq fixed sync segment (can't be configured)
x tq propagation segment
y tq phase segment 1 (before sample point)
2 tq phase segment 2 (after sample point)
Different CAN controllers might name propagation segment + phase segment 1 as a single "propagation segment". It doesn't matter, it's the number of tq between the sync segment and the sample point that matters. One ideal example would be:
1 tq sync + 13 tq prop seg/phase seg 1 + 2 tq phase seg 2.
For a CAN clock of 4MHz this would give a bit rate of 4*10^6 / 16 = 250kbps.
Note that some CAN controllers do indeed expect you to subtract 1 tq from each segment length when you write to the register.
SJW, (re)synchronization jump width doesn't play a part in the baudrate calculation. It is a setting which allows a receiving node some room to re-sync in case of inaccurate baudrates. A "hard sync" is performed at the sync segment (bit edge) and then a re-synch is performed at the sample point. SJW allows some inaccuracies to happen here. It is typically just set to 1 and that works fine for all common baudrates. If you go up to 1MHz, it is recommended to increase SJW some, to 2 or 3.

Determine the value of data paths from a given instruction

During a particular clock cycle, consider the CPU shown in the drawing.
Assume that the following initial data is present
(all values are shown in decimal, DM is Data Memory):x3=8, x14=40
During the cycle in question, assume that the following instruction is executed
(the first column is the instruction's address; all values are shown in decimal):
50788 beq x3,x14,80
How to determine the value of L1, L2 and L3
As per what I understand the L1 will have the program counter
But how do I determine the value of program counter
L2 will have 0 or 1 depending upon whether it uses MemtoReg
Now sure about L3. Although above is a guesswork.
Any hints or pointers how to procees on this ?
L1 has 50788, which is the address of the current branch instruction being executed — it is the address that was fed into the Instruction Memory that results in fetching the beq x3, x14, 80.
You can follow that after L1, there's an ADDer, which will add 4 to the PC value (offering that value to subsequent circuitry), adding 4 to skip past this 4 byte instruction, and thus referring to the next sequential memory address, which holds the next sequential instruction.
L2 is "don't care", which means it doesn't matter whether it is 0 or 1, so different implementations of the same basic design could use either value.  Why doesn't it matter whether this signal is 0 or 1?  Because this instruction is a conditional branch, and as such does not update a register, so RegWrite will be 0, thus the value of WriteData is unimportant and is ignored.
The hardware is a union of all the necessary components to execute any instruction in the instruction set, and as such, some circuitry here or there goes unused during execution of different instructions.  Rather than turning off the unused circuitry (which is an advance technique that takes work to design & implement, not employed here) the unused circuitry for any given instruction is allowed to execute — but (whether turned off or allowed to execute) the control signals further down the line of datapaths are set up to ignore the results of these unused circuits based on the current instruction.
L3 is the branch condition signal, that dynamically informs the PC update circuitry whether to take the branch or not.  Here that condition is effectively generated in the ALU from the expression x3 == x14 and determines the value of this control signal: if they are equal then that control signal needs to be 1 to make it take the branch (as per the definition of the conditional branch instruction) and that control signal needs to be 0 to make it not take the branch and instead continue with sequential execution.
Hopefully, you can see that for conditional branch instructions, the Branch control signal is asserted (1/true) — this signal combined with Zero goes into an AND gate, which results in 1 for take the branch vs. 0 for don't take the branch, by controlling that MUX after the AND gate.  So, the only condition in which the branch can be taken [pc := pc + sxt(imm)*2] is when both Branch and Zero are true.  If Branch is false, it is not a branch instruction, so Zero doesn't matter, and if Zero is false, the branch condition is false, so Branch is overridden [pc := pc + 4].
More explicitly, the PC update circuitry says:
PC := (Branch & Zero) ? PC + sxt(imm)*2 : PC + 4;
Using C ternary operator (could also be written using if-then-else).
Zero is a rather poor choice for the name of this dynamic control signal.  I would have chosen Take or Taken instead.  I believe the name Zero is historical from older RISC architectures.
This circuitry follows the RISC V standard of multiplying the branch target immediate field by 2 (instead of 4 as with MIPS), and this standard makes it so that regular 4 byte instructions are identical (unchanged) in the presence of compressed instructions — thus, on hardware that supports compressed instructions, no mode switching is needed, and, compressed instructions can be interleaved with uncompressed instructions (unlike with MIPS16 or ARM Thumb).  However, this block diagram does not provide the other features necessary to execute compressed instructions (for one, there is no increment by 2 option diagrammed in this PC update circuitry, for another there is no compressed instruction expander, which would go in between the Instruction Memory output and the Decode logic).
What you asking is very implementation depended. From the diagram I guess it is some MIPS or MIPS like uarch. Some of real RISC implementations have (curr_instr_addr+2instrsize) in the PC according to ISA. This has some historical reasons, because on old machines depth of the pipeline was 3 levels. So L1 has addr of the next or some of the next instructions. I can't say which exactly. If beq instruciton is in ALU, then L2 has MemToReg of the previous instruction to determine if writeback phase is needed. L3 keeping the zero flag to bypass the pipeline directly to the PC if the next instruction is branch.

When does the pipeline take 2 decode stages when there is a RAW dependency in 2 successive instructions

Consider a RISC pipeline having 5 stages, Find how many cycles are required for the instruction given below, assume operand forwarding, branch prediction is used in which the branch is not taken, ACS is the branch instruction and the five stages are Instruction fetch, Decode, Execute, Memory and Write back.
I1: ACS R0, R1,X
I2: LOAD R2, 0(R3)
I3: SUB R4 R2, R2
I4: X: ADD R5, R1, R2
I5: LOAD R1, 0(R5)
I6: SUB R1, R1, R4
I7: ADD R1, R1, R5
A. 11
B. 12
C. 13
D. 14
Solution:
In the solution, I coludn't understand why have they neglected 2 DECODE cycles in I6 and I7 although they have a RAW dependency?
Source of the question:
Question 41 of https://practice.geeksforgeeks.org/contest-quiz/sudo-gate-2020-mock-iii
I think the answer gives the right total (13 cycles) but put the stall in the wrong instruction.
I5 doesn't need to stall; I4 (ADD R5, R1, R2) produces R5 in time to forward it to the next instruction's EX for address calculation (LOAD R1, 0(R5)). (Your 5-stage classic RISC pipeline has bypass forwarding).
But I6 reads the result of a load instruction, and loads produce their result a cycle later than the ALU in EX. So like I3, I6 needs to stall, not I5.
(I7 depends on I6, but I6 is an ALU instruction so it can forward without stalling.)
They stalls in the D stage because the ID stage can't fetch registers that the I2 / I5 load hasn't produced yet.
Separately from that, your diagram shows I4 (and what should be I7) not even being fetched when the previous instruction stalls. That doesn't make sense to me. At the start of that cycle, the pipeline doesn't even know that it needs to stall because it hasn't yet decoded I3 (and I6) and detected that it reads a not-ready register so an interlock is needed.
Fetch doesn't wait until after decoding the previous instruction to see if it stalled or not; that would defeat the entire purpose of pipelining. It should look like
I3 IF D D EX MEM WB
I4 IF IF D EX MEM WB
BTW, load latency is the reason that classic MIPS has a load-delay slot (unpredictable behaviour if you try to use a register in the next instruction after loading into it). Later MIPS added interlocks to stall if you do that, instead of making it an error, so you can keep static code-size smaller (no NOP filler) in cases where you can't find any other instruction to put in that slot. (And some even later MIPS did out-of-order exec which can hide latency.)

What is the speedup? Can't understand the solution

I'm going through a Computer Architecture MOOC on my time. There is a problem I can't solve. The solution is provided but I can't understand the solution. Can someone help me out. Here is the problem and the solution to it:
Consider an unpipelined processor. Assume that it has 1-ns clock cycle
and that it uses 4 cycles for ALU operations and 5 cycles for branches
and 4 cycles for memory operations. Assume that the relative
frequencies of these operations are 50 %, 35 % and 15 % respectively.
Suppose that due to clock skew and set up, pipelining the processor
adds 0.15 ns of overhead to the clock. Ignoring any latency impact,
how much speed up in the instruction execution rate will we gain from
a pipeline?
Solution
The average instruction execution time on an unpipelined processor is
clockcycle * Avg:CP I = 1ns * ((0.5 * 4) + (0.35 * 5) + (0.15 * 4)) =
4.35ns The avg. instruction execution time on pipelined processor is = 1ns + 0.15ns = 1.15ns So speed up = 4.35 / 1.15 = 3.78
My question:
Where is 0.15 coming from in the average instruction execution time on a pipelines processor? Can anyone explain.
Any help is really appreciated.
As the question says those 0.15ns are due to clock skew and pipeline setup.
Forget about pipeline setup and imagine that all of the 0.15ns are from clock skew.
I think the solution implies the CPI (Cycle Per Instruction) is one (1) (w/o the overhead), i.e., 1-ns clock cycle which I'm assuming it's the CPU running clock (1 GHz).
However, I'm not seeing anywhere the CPI is clearly identified as one (1).
Did I misunderstand anything here?

Calculate the performance of a multicore architecture?

Cal a multicore architecture with 10 computing cores: 2 processor cores and 8 coprocessors. Each processor core can deliver 2.0 GFlops, while each coprocessor can deliver 1.0 GFlops. All computing cores can perform calculation simultaneously. Any instruction can execute in either processor or coprocessor cores unless there are any explicit restrictions.
If 70% of dynamic instructions in an application are parallelizable, what is the maximum average performance (Flops) you can get in the optimal situation? Please note that the remaining 30% instructions can be executed only after the execution of the parallel 70% is over.
Consider another application where all the dynamic instructions can be partitioned into 6 groups (A, B, C, D, E, F) with the following dependency. For example, A --> C implies that all the instructions in A need to be completed before starting the execution of instructions in C. Each of the first four groups (A, B, C and D) contains 20% of the dynamic instructions whereas each of the remaining two groups (E and F) contains 10% of the dynamic instructions. All the instructions in each group must be executed sequentially on the same processor or coprocessor core. How to schedule them on the multicore architecture to achieve the best possible performance? What is the maximum average performance (Flops) now?
A(20%) --> C(20%) -->
E(10%)-->F(10%)
B(20%) --> d(20%) -->
For the first part, you need to use Amdahl's Law, which is:
max speed-up = 1/(1-p+p/n)
where p is the parallelizable part. n is the improvement factor in executing the parallel portion.
(Note that the Amdahl's Law formula can be used for first order estimates on other types of changes. E.g., given a factor of N reduction in ALU energy use and P fraction of energy used by the ALU, one can find the improvement in total energy use.)
In your case, since the serial portion would be executed on the higher performance (2 GFLOPS) processor core, n is 6 ([8 coprocessor cores * 1 GFLOPS/core + 2 processor cores * 2 GFLOPS/core]/ 2 GFLOPS/processor core).
A quick calculation shows the max speed-up you can get is 2.4 related to 1 processor core. The maximum FLOPS would therefore be the speed-up times the speed if the whole program was executed serially on one processor core, i.e., 2.4 * 2 GFLOPS = 4.8 GFLOPS.
For the second part, note that initially there are two independent instruction streams: A -> C and B -> C. Since the system has two processor cores, both can be executed in parallel on the higher performance processor cores. Furthermore, both have the same amount of work (40% of total for each stream), so one the same performance core they will complete at the same time.
Since E depends on results from both C and D, it must be started after both finish. E and F would execute on a processor core (which core is arbitrary since E must wait for the tasks running on both processor cores to complete).
As you can see 80% of the program (40% for A+C; 40% for B+D) can be parallelized by a factor of 2 and 20% of the program (E+F) is serial. You can then just plug the numbers into the Amdahl's Law formula (p=0.8, n=2).