Branch folding example unclear - cpu-architecture

I was learning about branch folding from a book called "Computer Organization" by Carl Hamacher (5th edition) when I came across this example:
Aditional details:
Queue length here denotes the number of instructions present in the instruction queue
F,D,E and W denotes the fetch, decode, execute and write stages of a pipeline respectively
The dotted lines at instructions 2,3 and 4 (I2,I3 and I4) denote that the stage the instruction currently is in is idle (i.e. waiting for the next stage to complete)
I5 is a branch instruction with branch target Ik
The pipeline starts with an instruction fetch unit which is connected to an instruction queue, which is connected to a decode/dispatch unit, which is connected to an execution unit and finally it ends with a write stage. There exists intermediate buffers between the decode, execute and write stages.
My doubt here is how D3 and D5 of I3 and I5 are being performed in the same clock cycle in spite of the fact that there is only one decode unit (given) ? Further more, the instruction queue should be of length 2 at cycle 4, why is it still 1 ? As both F3 and F4 seem to be in the instruction queue and none of them have been dispatched at cycle 4.

Related

reorder buffer problem (computer architecture Udacity course)

can someone explains to me why the issue time for instruction I5 is cycle 6 and not cycle 5 according to the solution manual provided to that problem.
Notes: 1) the problem and its published solution is mentioned below 2) this problem is part of the problem set for the computer architecture course on Udacity
problem:
Using Tomasulo's algorithm, for each instruction in the following
sequence determine when (in which cycle, counting from the start) it
issues, begins execution, and writes its result to the CDB. Assume
that the result of an instruction can be written in the cycle after it
finishes its execution, and that a dependent instruction can (if
selected) begin its execution in the cycle after that. The execution
time of all instructions is two cycles, except for multiplication
(which takes 4 cycles) and division (which takes 8 cycles). The
processor has one multiply/divide unit and one add/subtract unit. The
multiply/divide unit has two reservation stations and the add/subtract
unit has four reservation stations. None of the execution units is
pipelined – each can only be executing one instruction at a time. If
a conflict for the use of an execution unit occurs when selecting
which instruction should start to execute, the older instruction (the
one that appears earlier in program order) has priority. If a conflict
for use of the CBD occurs, the result of the add/subtract unit has
priority over the result of the multiply/divide unit. Assume that at
start all instructions are already in the instruction queue, but none
has yet been issued to any reservation stations. The processor can
issue only one instruction per cycle, and there is only one CDB for
writing results. A way of handling exceptions in the processor
described above would be to simply delete all instructions from
reservation stations and the instruction queue, set all RAT entries to
point to the register file, and jump to the exception handler as soon
as possible (i.e. in the cycle after the one in which divide-by-zero
is detected). 1)Find the cycle time of each instruction for Issue,
Exection, and Write back stages. 2)What would be printed in the
exception handler if exceptions are handled this way?
provided solution:
timing diagram
solution for second question
The exception occurs in cycle 20, so the cycle in which we start executing the exception handler
is cycle 21. At that time, the processor has completed instructions I1-I4, but it has also completed
instructions I6 and I10. As a result, register F4 in the register file would have the result of I10,
which is -1 (5-6). The exception handler would print 2,0, -2, -1, which is incorrect.
Is there a limited ROB or RS (scheduler) size that would stop the front-end from issuing more instructions until some have dispatched to make more room (RS size), or until some have retired (ROB size)? It's common for the front-end's best case to be better throughput than the back-end, precisely so the back-end can get a look at possible independent instructions later on. But there has to be some limit to how many un-executed instructions can be tracked by the back-end.
In this case, yes:
The multiply/divide unit has two reservation stations and the add/subtract unit has four reservation stations
So I think that's the limiting factor there: the first two instructions are mul and div, and the first of those finishes on cycle 5. Apparently this CPU doesn't free the RS entry until the cycle after writeback. (And instead of one unified scheduler, it has queues (reservation stations) for each kind of execution unit separately.)
Some real CPUs may be more aggressive, e.g. I think Intel CPUs can free an RS entry sooner, even though they sometimes need to replay a uop if it was optimistically dispatched early in anticipation of a cache hit (when an input is the result of a load): Are load ops deallocated from the RS when they dispatch, complete or some other time?

Why do longer pipelines make a single delay slot insufficient?

I read the following statement in Patterson & Hennessy's Computer Organization and Design textbook:
As processors go to both longer pipelines and issuing multiple instructions per clock cycle, the branch delay becomes longer, and a single delay slot is insufficient.
I can understand why "issuing multiple instructions per clock cycle" can make a single delay slot insufficient, but I don't know why "longer pipelines" cause it.
Also, I do not understand why longer pipelines cause the branch delay to become longer. Even with longer pipelines (step to finish one instruction), there's no guarantee that the cycle will increase, so why will the branch delay increase?
If you add any stages before the stage that detects branches (and evaluates taken/not-taken for conditional branches), 1 delay slot no longer hides the "latency" between the branch entering the first stage of the pipeline and the correct program-counter address after the branch being known.
The first fetch stage needs info from later in the pipeline to know what to fetch next, because it doesn't itself detect branches. For example, in superscalar CPUs with branch prediction, they need to predict which block of instructions to fetch next, separately and earlier from predicting which way a branch goes after it's already decoded.
1 delay slot is only sufficient in MIPS I because branch conditions are evaluated in the first half of a clock cycle in EX, in time to forward to the 2nd half of IF which doesn't need a fetch address until then. (Original MIPS is a classic 5-stage RISC: IF ID EX MEM WB.) See Wikipedia's article on the classic RISC pipeline for much more details, specifically the control hazards section.
That's why MIPS is limited to simple conditions like beq (find any mismatches from an XOR), or bltz (sign bit check). It cannot do anything that requires an adder for carry propagation (so a general blt between two registers is only a pseudo-instruction).
This is very restrictive: a longer front-end can absorb the latency from a larger/more associative L1 instruction cache that takes more than half a cycle to respond on a hit. (MIPS I decode is very simple, though, with the instruction format intentionally designed so machine-code bits can be wired directly as internal control signals. So you can maybe make decode the "half cycle" stage, with fetch getting 1 full cycle, but even 1 cycle is still low with shorter cycle times at higher clock speeds.)
Raising the clock speed might require adding another fetch stage. Decode does have to detecting data hazards and set up bypass forwarding; original MIPS kept that simpler by not detecting load-use hazards, instead software had to respect a load-delay slot until MIPS II. A superscalar CPU has many more possible hazards, even with 1-cycle ALU latency, so detecting what has to forward to what requires more complex logic for matching destination registers in old instructions against sources in younger instructions.
A superscalar pipeline might even want some buffering in instruction fetch to avoid bubbles. A multi-ported register file might be slightly slower to read, maybe requiring an extra decode pipeline stage, although probably that can still be done in 1 cycle.
So, as well as making 1 branch delay slot insufficient by the very nature of superscalar execution, a longer pipeline also increases branch latency, if the extra stages are between fetch and branch resolution. e.g. an extra fetch stage and a 2-wide pipeline could have 4 instructions in flight after a branch instead of 1.
But instead of introducing more branch delay slots to hide this branch delay, the actual solution is branch prediction. (However some DSPs or high performance microcontrollers do have 2 or even 3 branch delay slots.)
Branch-delay slots complicate exception handling; you need a fault-return and a next-after-that address, in case the fault was in a delay slot of a taken branch.

How could the test_and_set() instruction still work on a multiprocessor?

Dear stack overflow community,
I'm reading Operating System Concepts (2012) by Silberschatz, Galvin and Gagne, it says "if two test_and_set() instructions are executed simultaneously (each on a different CPU), they will be executed sequentially in some arbitrary order." on page 210, I cannot understand why two such institutions will be executed sequentially even on a multiprocessor. What if each instruction is executed on a different processor? To the best of my knowledge, these two institutions be executed simultaneously.
My understanding of the atomicity of instructions and the multiprocessor stays at a quite superficial level, so I may take the problem for granted. Could anyone help me out here?
The whole point of test-and-set is that one processor will execute it first, and then the other processor will execute it, and they will not do it simultaneously.
To achieve this, there will be some communication between both processors. Basically, one processor will load the cache line including the memory location from memory, and tell the other processor that it can't have that cache line until both test and set are finished.
The outcome depends upon the machine instruction. Let me use the VAX as an example as it is an easy to understand processor:
http://www.ece.lsu.edu/ee4720/doc/vax.pdf
The VAX has a BBSS (Branch on Bit Set and Set) instruction and a BBSSI (Branch on Bit Set and Set Interlocked) instruction.
If you have 2 processors doing a BBSS on the same clear bit you could get:
P1 Tests Bit (Clear)
P2 Tests Bit (Clear)
P1 Sets Bit and does not branch
P2 Sets Bit and dot not branch
If you do a BBSSI on the same bit, the processor locks the memory. You get
p1 locks the memory
p1 Tests Bit (Clear)
P2 Tests Bit and is Blocked
P1 Sets Bit and does not branch
p1 unlocks the memory
P2 Tests Bit (SET)
P2 Branches
Instructions do not execute in a single step for the most part and the processors can operate independently of each other.

Understanding multilevel feedback queue scheduling

I'm trying to understand multilevel feedback queue scheduling and I came across the following example from William Stallings Operating Systems Internal and Principles Design (7th ed).
I got this process:
And the result in the book is this:
I believe I'm doing the first steps wright but when I get to process E CPU time my next process is B not D as in the book example.
I can't understand if there are n RQ and after each time a process get's CPU time it is demoted to a lower priority time RQ or if, for example, process A is in RQ1 and there are no process at the çower RQ, the process is promoted to that ready queue (this is how I am doing).
Can someone explain me the process how, at the above example, after E is processed, D gets CPU time and them E (and not B) is served?
The multilevel feedback algortihm selects always the first job of the lowest queue (i.e., the queue with the highest priority) that is not empty.
When job E leaves RQ1 (time 9), job D is in queue RT2 but job B in RT3. Thus, B is executed. Please consider the modified figure, where the red numbers give the queue in which the job is executed.
As you can see, job B has already left RT2 at time 9 (more preceisly, it leaves RT2 at time 6), whereas job D has just entered.

Why speed of variable length pipeline is determined by the slowest stage + whats the total execution time of program?

I am new to pipelining and I need some help regarding the fact that
The speed of the pipelining is determined by the speed of the slowest stage
Not only this, if I am given a 5 stage pipeline with duration of them 5 ns,10 ns, 8 ns, 7 ns,7 ns respectively , it is said that each instruction would take 10 ns time.
Can I get a clear explanation for this?
(edited)
Also let my program has 3 instructions I1,I2,I3 and I take 1 clk cycle duration = 1ns
such that the above stages take - 5, 10, 8 , 7 , 7 clock cycles respectively.
Now according to theory a snapshot of the pipeline would be -
But that gives me a total time to be -no of clk cycles*clk cycle duration = 62 * 1 = 62 ns
But according to theory total time should be - (slowest stage) * no. of instructions = 10 * 3 = 30 ns
Though I have an idea why slowest stage is important (each pipeline stage needs to wait hence 1 instruction is produced after every 10 clk cycle- but the result is inconsistent when i calculate it using clk cycles.Why this inconsistency? What am I missing??
(edited)
Assume a car manufacturing process. Assume it's used two stage pipe lining. Say it takes 1 day to manufacture an engine. 2 days to manufacture the rest. You can do both stages in parallel. What is your car output rate? It should be one car per 2 days. Although you manufacture the rest in 1 day, you have to wait another day to get the engine.
In your case, although other stages finish their job in lesser time, you have to wait 10ns to get the whole process done
Staging allows you to do the "parts" of the same operation at onces.
I'll create a smaller example here, dropping the last 2 stages of your example: 5, 10, 8 ns
Let's take two operations:
5 10 8
5 10 8
| The first operation starts here
| At stage 2 the second operation can start it's fist stage
| However, since the stages take different amount of times,
| the longest ones determines the runtime
| the thirds stage can only start after the 2nd has completed: after 15ns
| this is also true for the 2nd stage of the 2nd operation
I am not sure about the source of your confusion. If one unit of your pipeline is taking longer, the units behind it cannot push the pipeline to move ahead until that unit is finished, even though they themselves are finished with their work. Like DPG said, try to look at it from the car manufacture line example. It is one of the most common ones given to explain a pipeline. If the units AHEAD of the slowest unit after finished quicker, it still doesn't matter because they have to wait until the slower unit finishes its work. So yes, your pipeline is executing 3 instructions for a total execution time of 30ns.
Thank you all for your answers
I think I have got it clear by now.
This is what I think the answer is -
Question- 1 :- Why pipeline execution dependent on the slowest step
Well clearly from the diagram each stage has to wait for the slowest stage to complete.
So the total time after which each instruction is completed bounded by the time of wait.
(In my example after a gap of 10 ns)
Question-2 :- Whats the total execution time of program
I wanted to know how long will the particular program containing 3 instructions take to execute
NOT how long will it take for 3 instructions to execute- which is obviously 30ns pertaining to the fact
that every instruction gets completed every 10 ns.
Now suppose I1 is fetched in the pipeline then already 4 other instructions are executing in it.
4 instructions are completed in 40 ns time.
After that I1,I2,I3 executed in order 30 ns.(Assuming no pipeline stalls)
This gives a total of 30+40=70ns .
In fact for n instruction program,k- stage pipeline
I think it is (n + k-1 ) *C *T
where C= no. of clk cycles in slowest stage
T= clock cycle time
Please review my understanding .... to know if I am thinking anything wrong and so that
I can accept my own answer!!!