What is the number of clock cycles required in the given sequence of Instruction using 5- stage pipelined CPU? - cpu-architecture

A 5 stage pipelined CPU has the following sequence of stages:
IF – Instruction fetch from instrution memory.
RD – Instruction decode and register read.
EX – Execute: ALU operation for data and address computation.
MA – Data memory access – for write access, the register read at RD state is
used.
WB – Register write back.
Consider the following sequence of instructions:
I1: L R0, loc 1 ; R0 <=M[loc1]
I2: A R0, R0 1 ; R0 <= R0 + R0
I3: S R2, R0 1 ; R2 <= R2 - R0
Let each stage take one clock cycle.
What is the number of clock cycles taken to complete the above sequence of
instructions starting from the fetch of I1?
So here's my solution.
1 2 3 4 5 6 7 8 9 10 11 12 13
I1: IF RD EX MA WB
I2: IF - - - RD EX MA WB
I3: IF - - - - - - RD EX MA WB
In this way I'm getting total 13 cycles. I'm assuming that since operand forwarding is not explicitly mentioned in the question. So register will be only available after WB stage. But option are following:
A. 8
B. 10
C. 12
D. 15

For Write access the register read at RD stage is used- this means if we cannot operand forwards a value to MA stage. So, we can assume operand forwarding can be done for other stages.
With data forwarding:
T1 T2 T3 T4 T5 T6 T7 T8
IF RD EX MA WB
-IF RD - EX MA WB
--IF - RD EX MA WB
Hence, answer will be 8.
http://www.cs.iastate.edu/~prabhu/Tutorial/PIPELINE/forward.html

The given problem is based on structural hazard because of the below line
" MA – Data memory access – for write access, the register read at RD state is used "
and not on data dependency although it seems to have data dependency. And hence, nothing is mentioned about data forwarding in the question.
Structural hazard is for the load instruction. And hence the execution of the next instruction cannot start until the execution of the first instruction, because the effective address of the memory location referred by M[loc1] will be calculated only during the execution phase of the pipeline. So till then the bus will not be freed and hence the second instruction cannot be fetched. Thus second instruction will take extra 2 clock cycles.
And the third instruction cannot start execution till the first instruction successfully loads the data to register R0. Which results the third instruction to have 3 more clock cycles.
Hence, total clock cycles = (CC for I1) + (CC for I2) + (CC for I3)
= 5 + 2 + 3
= 10 clock cycles

Related

the usage of offset for storing (stw) in powerpc assembly

long long int i=57745158985; #the C code
0000000000100004: li r7,13
0000000000100008: lis r8,29153
000000000010000c: ori r8,r8,0x3349
0000000000100010: stw r7,24(rsp)
0000000000100014: stw r8,28(rsp)
0000000000100018: lfd fp0,24(rsp)
000000000010001c: stfd fp0,8(rsp)
Can anyone explain the part of after the ori instruction? Thanks in advance.
It looks like this is on a 32-bit big endian machine. I will assume i is a local variable.
Starting with these instructions...
li r7,13
lis r8,29153
ori r8,r8,0x3349
After these instructions:
r7 contains 13
r8 contains ((29153 << 16) | 0x3349)
The required value for i is 57745158985, which is equal to
(13<<32) | ((29153 << 16) | 0x3349)
Clearly this value is too big to fit in a single 32-bit register.
The next instructions are where the 64-bit local variable i is "created" on the stack.
stw r7,24(rsp)
stw r8,28(rsp)
rsp is the stack pointer for the function.
Here i is being initialized to it's initial value of 57745158985.
stw r7,24(rsp) stores the four bytes of r7 starting at an offset of 24 bytes into the stack.
stw r8,28(rsp) stores the four bytes r8 starting at an offset of 28 bytes into the stack.
So i is the 8 bytes starting from an offset of 24 on the stack.
As this is a big-endian architecture the most significant bytes are placed first in memory.
Placing the value of r7 at lower address performs acts like the (13<<32) when considering the 8 bytes as one long long int.
These next instructions load the value of i into a floating point register and save it at a different location on the stack.
lfd fp0,24(rsp)
stfd fp0,8(rsp)
These 3 are loading up two 32 bit literal values into GPRs r7 and r8
0000000000100004: li r7,13
0000000000100008: lis r8,29153
000000000010000c: ori r8,r8,0x3349
These two are storing the two 32 bit values out consecutive 32 bit memory locations pointed to by rsp (which is the stack pointer == r1) + 24
0000000000100010: stw r7,24(rsp)
0000000000100014: stw r8,28(rsp)
This is a 64 bit load from the same location (ie rsp + 24) into floating point register 0 (ie fp0). (you can't move GPRs to FPR directly on this processor, so you go via memory)
0000000000100018: lfd fp0,24(rsp)
This is storing the same 64 bit FPR0 out to a different offset from the stack point.
000000000010001c: stfd fp0,8(rsp)

Determining number of cache misses for in-place reversal of an array, on a 2-way associative cache

Consider a 32-bit computer. With 2-way associative cache, cache blocks are 8 words each and has 512 sets.
Consider the following code block
int A[N];
for(int i =0;i<N/2;i++){
int tmp = A[i];
A[i] = A[N-i-1];
A[N-i-1] = tmp;
}
Assume N is a multiple of 8. Determine a formula for the number of cache misses, with variable N.
My research :
Each block in memory will have 8 words (1 words = 32 bit = 4 bytes)
A block # b will map to b mod 512
There are 512 sets. Where each set has two lines in cache and each line is a block.
So things will be transferred block wise from memory to cache.
Now as per my understanding memory will be organized in blocks of 8 words.
Suppose N = 8 * K
Since ints are 32 bits. IT will be like :
A[0], A[1], A[2],A[3],..A[7]
A[8],...................A[15]
.
A[8*(K-1)],...............A[N-1]
So this is how Array will be laid out in memory with K blocks.
And when there is a cache miss a complete line will be transferred into cache.
In the first iteration code will access A[0] and A[N-1]
So Line 0 and Line K-1 will be put into cache.
At second iteration it will access A[1] and A[N-2] and both of them are already in cache
Following this logic, There will be two cache misses for i=0.
None for i=1,2,3,4,5,6,7
Then for i = 8, code will access A[8] and A[N-9] again two cache misses.
for i = N/2-5 it will acess A[N/2-5] and A[N/2+4] these two will be in different blocks so again two cache misses.
Not sure if I am proceeding in the right direction. I need to come up with the formula of number of cache misses in terms of N.

Calculating Q value in dqn with experience replay

consider the Deep Q-Learning algorithm
1 initialize replay memory D
2 initialize action-value function Q with random weights
3 observe initial state s
4 repeat
5 select an action a
6 with probability ε select a random action
7 otherwise select a = argmaxa’Q(s,a’)
8 carry out action a
9 observe reward r and new state s’
10 store experience <s, a, r, s’> in replay memory D
11
12 sample random transitions <ss, aa, rr, ss’> from replay memory D
13 calculate target for each minibatch transition
14 if ss’ is terminal state then tt = rr
15 otherwise tt = rr + γmaxa’Q(ss’, aa’)
16 train the Q network using (tt - Q(ss, aa))^2 as loss
17
18 s = s'
19 until terminated
In step 16 the value of Q(ss, aa) is used to calculate the loss. When is this Q value calculated? At the time the action was taken or during the training itself?
Since replay memory only stores < s,a,r,s' > and not the q-value, is it safe to assume the q value will be calculated during the time of training?
Yes, in step 16, when training the network, you are using the the loss function (tt - Q(ss, aa))^2 because you want to update network weights in order to approximate the most recent Q-values, computed as rr + γmaxa’Q(ss’, aa’) and used as target. Therefore, Q(ss, aa) is the current estimation, which is typically computed during training time.
Here you can find a Jupyter Notebook with a simply Deep Q-learning implementation that maybe is helpful.

Why is this Mutex solution incorrect?

There are 2 processes P1 and P2 that can enter the Critical Section.
Mutex Solution Requirements:
A Mutex Zone - (Critical Section) that can only hold one process maximum.
No Mutual Blocking - a process outside the critical section cannot block a process inside it.
No Starvation - a process interested in entering the critical section must not have to wait forever.
Success without Contention - a process interested in entering the critical section must succeed in doing so if there are no other processes interested.
Why is the below code an incorrect solution to the Mutual Exclusion problem?
i.e. which requirement does it not satisfy?
C1 and C2 are initialised to 1.
P1: LOOP
Non-Critical Section
LOOP UNTIL C2 = 1 END_LOOP;
C1 := 0;
Critical Section
C1 := 1;
END
P2: LOOP
Non-Critical Section
LOOP UNTIL C1 = 1 END_LOOP;
C2 := 0;
Critical Section
C2 := 1;
END
To interpret the question with the best of intentions, I would have to assume that each read or write is deemed to be atomic and well-ordered. Then it makes more sense.
The problem you have here is that the inner loop in each process can independently complete. Those loops are the "wait for my turn" part of the mutex.
However, the terminating of that loop and the following assignment (which would stop the other loop from terminating) are not atomic. We therefore have the following possible scenario:
P1: exit wait loop because C2 is 1
P2: exit wait loop because C1 is 1
P2: set C2 to 0
P2: enter critical section
P1: set C1 to 0
P1: enter critical section
The above violates that first requirement of having a Mutex Zone. Conditions that lead to such a violation are commonly known as a race condition.
You could also expect one process to starve the other. There is a possibility that P2 will always execute its critical section and acquire the lock again before P1 (who is waiting for the lock) gets a slice of CPU time. The control variable C2 would therefore always be 0 as seen by P1. Or at least, may be that way for a disproportionate number of slices.
P2: exit wait loop because C1 is 1
P2: set C2 to 0
P1: spin on C2 != 1 for entire time slice
P2: enter critical section
P2: set C2 to 1
P2: exit wait loop because C1 is 1
P2: set C2 to 0
P1: spin on C2 != 1 for entire time slice
P2: enter critical section
P2: set C2 to 1
P2: exit wait loop because C1 is 1
P2: set C2 to 0
P1: spin on C2 != 1 for entire time slice
...
Except in some environments, it's unlikely that P1 would be starved forever. But since P1 has asserted that it is waiting, it should not expect P2 to get multiple cracks at the critical section before it gets a turn.
This might also be a violation of the Success Without Contention requirement, but that's hard to argue really. But I would suggest that if P2 is no longer running, then consider what state C2 might be left in, and why P1 should need to know about P2 at all.

MIPS: Branch using indirect jump?

Suppose a label called L1. On MIPS, one can easily do:
beq $t1, $t2, L1
But is there a way to do the same using indirect addressing? By that, I mean using a register that holds the address where L1 is found. I know of the jr command, but I don't see how it could be used for this purpose.
beq requires an immediate value in its 3rd argument, never a register or memory address.
According to page 55 of this manual (page 63 in the PDF), the range of beq is -128 KB to +128KB, which is exactly 4 times as much as a signed 16-bit integer can represent: -32 KB to +32 KB (since instructions are 4 bytes long, a multiplier of 4 is automatically applied).
I think jr should be able to accomplish what you want. Instead of using a register to point to memory address XX, just load the value of address XX into a register and use that to jump.
lw $t0, XX
jr $t0