Determine the value of data paths from a given instruction - cpu-architecture

During a particular clock cycle, consider the CPU shown in the drawing.
Assume that the following initial data is present
(all values are shown in decimal, DM is Data Memory):x3=8, x14=40
During the cycle in question, assume that the following instruction is executed
(the first column is the instruction's address; all values are shown in decimal):
50788 beq x3,x14,80
How to determine the value of L1, L2 and L3
As per what I understand the L1 will have the program counter
But how do I determine the value of program counter
L2 will have 0 or 1 depending upon whether it uses MemtoReg
Now sure about L3. Although above is a guesswork.
Any hints or pointers how to procees on this ?

L1 has 50788, which is the address of the current branch instruction being executed — it is the address that was fed into the Instruction Memory that results in fetching the beq x3, x14, 80.
You can follow that after L1, there's an ADDer, which will add 4 to the PC value (offering that value to subsequent circuitry), adding 4 to skip past this 4 byte instruction, and thus referring to the next sequential memory address, which holds the next sequential instruction.
L2 is "don't care", which means it doesn't matter whether it is 0 or 1, so different implementations of the same basic design could use either value.  Why doesn't it matter whether this signal is 0 or 1?  Because this instruction is a conditional branch, and as such does not update a register, so RegWrite will be 0, thus the value of WriteData is unimportant and is ignored.
The hardware is a union of all the necessary components to execute any instruction in the instruction set, and as such, some circuitry here or there goes unused during execution of different instructions.  Rather than turning off the unused circuitry (which is an advance technique that takes work to design & implement, not employed here) the unused circuitry for any given instruction is allowed to execute — but (whether turned off or allowed to execute) the control signals further down the line of datapaths are set up to ignore the results of these unused circuits based on the current instruction.
L3 is the branch condition signal, that dynamically informs the PC update circuitry whether to take the branch or not.  Here that condition is effectively generated in the ALU from the expression x3 == x14 and determines the value of this control signal: if they are equal then that control signal needs to be 1 to make it take the branch (as per the definition of the conditional branch instruction) and that control signal needs to be 0 to make it not take the branch and instead continue with sequential execution.
Hopefully, you can see that for conditional branch instructions, the Branch control signal is asserted (1/true) — this signal combined with Zero goes into an AND gate, which results in 1 for take the branch vs. 0 for don't take the branch, by controlling that MUX after the AND gate.  So, the only condition in which the branch can be taken [pc := pc + sxt(imm)*2] is when both Branch and Zero are true.  If Branch is false, it is not a branch instruction, so Zero doesn't matter, and if Zero is false, the branch condition is false, so Branch is overridden [pc := pc + 4].
More explicitly, the PC update circuitry says:
PC := (Branch & Zero) ? PC + sxt(imm)*2 : PC + 4;
Using C ternary operator (could also be written using if-then-else).
Zero is a rather poor choice for the name of this dynamic control signal.  I would have chosen Take or Taken instead.  I believe the name Zero is historical from older RISC architectures.
This circuitry follows the RISC V standard of multiplying the branch target immediate field by 2 (instead of 4 as with MIPS), and this standard makes it so that regular 4 byte instructions are identical (unchanged) in the presence of compressed instructions — thus, on hardware that supports compressed instructions, no mode switching is needed, and, compressed instructions can be interleaved with uncompressed instructions (unlike with MIPS16 or ARM Thumb).  However, this block diagram does not provide the other features necessary to execute compressed instructions (for one, there is no increment by 2 option diagrammed in this PC update circuitry, for another there is no compressed instruction expander, which would go in between the Instruction Memory output and the Decode logic).

What you asking is very implementation depended. From the diagram I guess it is some MIPS or MIPS like uarch. Some of real RISC implementations have (curr_instr_addr+2instrsize) in the PC according to ISA. This has some historical reasons, because on old machines depth of the pipeline was 3 levels. So L1 has addr of the next or some of the next instructions. I can't say which exactly. If beq instruciton is in ALU, then L2 has MemToReg of the previous instruction to determine if writeback phase is needed. L3 keeping the zero flag to bypass the pipeline directly to the PC if the next instruction is branch.

Related

If the PC register is simultaneously read and written, does its read data contain the previous data or the newly-written data?

If the PC register is simultaneously read and written, does its read data contain the previous data or the newly-written data? Based on my understanding of sequential circuits, the effect of the write command does not instantly take effect in the PC register due to propagation delay so, at the rising edge of the clock, the read command will get the old value. But corollary to my question is if this is the case, shouldn't the read command would also have a delay in some sense and could possibly read the newly-written data?
A program counter is normally special enough that it's not part of a register file with other registers. You don't have a "read command", its output is just always wired up to other parts that read it when appropriate. (i.e. when its output is stable and has the value you want). e.g. see various block diagrams of MIPS pipelines, or non-pipelined single-cycle or multi-cycle designs.
You'd normally build such a physical register out of edge-triggered flip-flops, I think. (https://en.wikipedia.org/wiki/Flip-flop_(electronics)). Note that a D flip-flop does latch the previous input as the current output on a clock edge, and then the input is allowed to change after that.
There's a timing window before the clock edge where the input has to remain stable, it can start to change a couple gate delays after. Note the example of a shift register built by chaining D flip-flops all with the same clock signal.
If you have a problem arranging to capture a value before it starts changing, you could design in some intentional clock skew so the flip-flop reliably latches its input before you trigger the thing providing the input to change it. (But normally whatever you're triggering will itself have at least a couple gate delays before its output actually changes, hence the shift-register made of chained D flip-flops.)
That wiki article also mentions the master-slave edge-triggered D Flip-Flop that chains 2 gated (not clocked) D latches with an inverted clock, so capturing the input happens on the opposite clock edge from updating the output with the previously-captured data.
By comparison and for example, in register files for general-purpose registers in classic RISC pipelines like MIPS, IIRC it's common to build them so write happens in the first half-cycle and read happens in the second half-cycle of the ID stage. (So write-back can "forward" to decode/fetch through the register file, keeping the window of bypass-forwarding or hazards shorter than if you did it in the other order.)
This means the write data has a chance to stabilize before you need to read it.
Overall, it depends how you design it!
If you want the same clock edge to update a register with inputs while also latching the old value to the output, you a master-slave flip-flop will do that (capture the old input into internal state, and latch the old internal state onto the outputs).
Or you could design it so the input is captured on the clock edge, and propagates to the output after a few gate delays and stays latched there for the rest of this clock cycle (or half cycle). That would be a single D flip-flop (per bit).

Implementation of zero flag without zero flag in program status word

In a typical processor’s PSW, zero flag is not implemented, however, it has Carry, Sign, Parity, and Overflow flags. In this architecture how would a programmer implement JZ (jump on zero).
You can't implement JZ after any arbitrary instruction like add reg, reg if there is no zero flag, none of the other flags carry the same information. e.g. 8-bit -128 + -128 overflows and carries to 0, but you can't distinguish that from -128 + -1 that overflows / carries to 127. Or various other combinations that you can't distinguish even with the help of SF and PF.
That's why we have a Zero Flag in normal ISAs, including 8080 or x86 whose flags and mnemonics you're using.
Did you actually just want to emulate x86 test eax,eax / jz or ARM cbz reg, target (conditional-branch on a register being zero) to test a register and jump if it was zero?
Note that 0 is the only number unsigned-below 1, so you can cmp / jnc. This looks like homework so I'm not going to spell it out more than that.
Or do what MIPS does, and provide an instruction like beq $reg, $reg, target that you can use to compare-and-branch on any pair of regs. (MIPS doesn't have a PSW / FLAGS at all). MIPS has an architectural zero register that always reads as zero, so you can always branch on any other register being zero with one machine instruction.
ARM Thumb, and AArch64, provide a limited version of that: cbz and cbnz that compare/branch on a single register being zero or non-zero, separate from the ARM flags register.
But really if you're going to have a FLAGS / PSW register at all, implement a zero flag. That's one of the most basic useful things. Although to be fair, a carry flag is even more critical. If you could only have one flag, it would probably be carry because you can still test for zero efficiently. Signed compares for greater or less are harder to emulate with SF and OF, though.

Why does instruction cache alignment improve performance in set associative cache implementations?

I have a question regarding instruction cache alignment. I've heard that for micro-optimizations, aligning loops so that they fit inside a cache line can slightly improve performance. I don't see why that would do anything.
I understand the concept of cache hits and their importance in computing speed.
But it seems that in set associative caches, adjacent blocks of code will not be mapped to the same cache set. So if the loop crosses a code block the CPU should still get a cache hit since that adjacent block has not been evicted by the execution of the previous block. Both blocks are likely to remain cached during the loop.
So all I can figure is if there is truth in the claim that alignment can help, it must be from some sort of other effect.
Is there a cost in switching cache lines?
Is there a difference in cache hits, one where you get a hit and one where you hit the same cache line you're currently reading from?
Keeping a whole function (or the hot parts of a function, i.e. the fast path through it) in fewer cache lines reduces I-cache footprint. So it can reduce the number of cache misses, including on startup when most of the cache is cold. Having a loop end before the end of a cache line could give HW prefetching time to fetch the next one.
Accessing any line that's present in L1i cache takes takes the same amount of time. (Unless your cache uses way-prediction: that introduces the possibility of a "slow hit". See these slides for a mention and brief description of the idea. Apparently MIPS r10k's L2 cache used it, and so did Alpha 21264's L1 instruction cache with "branch target" vs. "sequential" ways in its 2-way associative 64kiB L1i. Or see any of the academic papers that come up when you google cache way prediction like I did.)
Other than that, the effects aren't so much about cache-line boundaries but rather aligned instruction-fetch blocks in superscalar CPUs. You were correct that the effects are not from things you were considering.
See Modern Microprocessors
A 90-Minute Guide! for an intro to superscalar (and out-of-order) execution.
Many superscalar CPUs do their first stage of instruction fetch using aligned accesses to their I-cache. Lets simplify by considering a RISC ISA with 4-byte instruction width1 and 4-wide fetch/decode/exec. (e.g. MIPS r10k, although IDK if some of the other stuff I'm going to make up reflects that microarch exactly).
...
.top_of_loop:
insn1 ; at address 16*n + 12
; 16-byte boundary here
insn2 ; at address 16*n + 0
insn3 ; at address 16*n + 4
b .top_of_loop ; at address 16*n + 8
... after loop ; at address 16*n + 12
... after loop ; at address 16*n + 0
Without any kind of loop buffer, the fetch stage has to fetch the loop instructions from I-cache one for every time it executes. But this takes a minimum of 2 cycles per iteration because the loop spans two 16-byte aligned fetch blocks. It's not capable of fetching the 16 bytes of instructions in one unaligned fetch.
But if we align the top of the loop, it can be fetched in a single cycle, allowing the loop to run at 1 cycle / iteration if the loop body doesn't have other bottlenecks.
...
nop ; at address 16*n + 12 ; NOP padding for alignment
.top_of_loop: ; 16-byte boundary here
insn1 ; at address 16*n + 0
insn2 ; at address 16*n + 4
insn3 ; at address 16*n + 8
b .top_of_loop ; at address 16*n + 12
... after loop ; at address 16*n + 0
... after loop ; at address 16*n + 4
With a larger loop that's not a multiple of 4 instructions, there's still going to a partially-wasted fetch somewhere. It's generally best that it's not the top of the loop, though. Getting more instructions into the pipeline sooner rather than later helps the CPU find and exploit more instruction-level parallelism, for code that isn't purely bottlenecked on instruction-fetch.
In general, aligning branch targets (including function entry points) by 16 can be a win (at the cost of greater I-cache pressure from lower code density). A useful tradeoff can be padding to the next multiple of 16 if you're within 1 or 2 instructions. e.g. so in the worst case, a fetch block contains at least 2 or 3 useful instructions, not just 1.
This is why the GNU assembler supports .p2align 4,,8 : pad to the next 2^4 boundary if it's 8 bytes away or closer. GCC does in fact use that directive for some targets / architectures, depending on tuning options / defaults.
In the general case for non-loop branches, you also don't want to jump near the end of a cache line. Then you might have another I-cache miss right away.
Footnote 1:
The principle also applies to modern x86 with its variable-width instructions, at least when they have decoded-uop cache misses forcing them to actually fetch x86 machine code from L1I-cache. And applies to older superscalar x86 like Pentium III or K8 without uop caches or loopback buffers (that can make loops efficient regardless of alignment).
But x86 decoding is so hard that it takes multiple pipeline stages, e.g. to some to simple find instruction boundaries and then feed groups of instructions to the decoders. Only the initial fetch-blocks are aligned and buffers between stages can hide bubbles from the decoders if pre-decode can catch up.
https://www.realworldtech.com/merom/4/ shows the details of Core2's front-end: 16-byte fetch blocks, same as PPro/PII/PIII, feeding a pre-decode stage that can scan up to 32 bytes and find boundaries between up to 6 instructions IIRC. That then feeds another buffer leading to the full decode stage which can decode up to 4 instructions (5 with macro-fusion of test or cmp + jcc) into up to 7 uops...
Agner Fog's microarch guide has some detailed info about optimizing x86 asm for fetch/decode bottlenecks on Pentium Pro/II vs. Core2 / Nehalem vs. Sandybridge-family, and AMD K8/K10 vs. Bulldozer vs. Ryzen.
Modern x86 doesn't always benefit from alignment. There are effects from code alignment but they're not usually simple and not always beneficial. Relative alignment of things can matter, but usually for things like which branches alias each other in branch predictor entries, or for how uops pack into the uop cache.

Indexed addressing mode and implied addressing mode

Indexed addressing mode is usually used for accessing arrays as arrays are stored contiguosly. We have a index register which gets incremented in every iteration which when added to base address gives the array element address.
I don't understand the actual need of this addressing mode. Why can't we do this with direct addressing ? We have the base address and we can just add 1 to it every time when accessing. Why do we need indexed addressing mode which has a overhead of index register ?
I am not sure about the instruction format for implied addressing mode. Suppose we have a instruction INC AC. Is the address of AC specified in the instruction or is there a special opcode which means 'INC AC' and we don't include the address of AC (accumulator)?
I don't understand the actual need of this addressing mode. Why can't we do this with direct addressing?
You can; MIPS only has one addressing mode and compilers can still generate code for it just fine. But sometimes it has to use an extra shift + add instruction to calculate an address (if it's not just looping through an array).
The point of addressing modes is to save instructions and save registers, especially in 2-operand instruction sets like x86, where add eax, ecx overwrites eax with the result (eax += ecx), unlike MIPS or other 3-instruction ISAs where addu $t2, $t1, $t0 does t2 = t1 + t0. On x86, that would require a copy (mov) and an add. (Or in that special case, lea edx, [eax+ecx]: x86 can copy-and-add (and shift) using the same instruction-encoding it uses for memory operands.)
Consider a histogram problem: you generate array indices in unpredictable order, and have to index an array. On x86-64, add dword [rbx + rdi*4], 1 will increment a 32-bit counter in memory using a single 4-byte instruction, which decodes to only 2 uops for the front-end to issue into the out-of-order core on modern Intel CPUs. (http://agner.org/optimize/). (rbx is the base register, rdi is a scaled index). Having a scaled index is very powerful; x86 16-bit addressing modes support 2 registers, but not a scaled index.
Classic MIPS only has separate shift and add instructions, although MIPS32 did add a scaled-add instruction for address calculation. That would save an instruction here. Being a load-store machine, the loads and stores always have to be separate instructions (unlike on x86 where that add decodes as a micro-fused load+add and a store. See INC instruction vs ADD 1: Does it matter?).
Probably ARM would be a better comparison for MIPS: It's also a load-store RISC machine. But it does have a selection of addressing modes, including scaled index using the barrel shifter. So instead of needing a separate shift / add for each array index, you'd use LDR R0, [R1, R2, LSL #2], add r0, r0, #1 / str with the same addressing mode.
Often when looping through an array, it is best to just increment pointers on x86. But it's also an option to use an index, especially for loops with multiple arrays using the same index, like C[i] = A[i] + B[i]. Indexed addressing mode can sometimes be slightly less efficient in hardware, though, so when a compiler is unrolling a loop it usually should use pointers, even though it has to increment all 3 pointers separately instead of one index.
The point of instruction-set design is not merely to be Turing complete, it's to enable efficient code that gets more work done with fewer clock cycles and/or smaller code-size, or give programmers the option of aiming for either of those goals.
The minimum threshold for a computer to be programmable is extremely low, see for example various One instruction set computer architectures. (None implemented for real, just designed on paper to show that it's possible to write programs with nothing but a subtract-and-branch-if-less-than-zero instruction, with memory operands encoded in the instruction.
There's a tradeoff between easy to decode (especially to decode in parallel) vs. compact. x86 is horrible because it evolved as a series of extensions, often without a lot of planning to leave room for future extensions. If you're interested in ISA design decisions, have a look at Agner Fog's blog for interesting discussion about designing an ISA for high-performance CPUs that combines the best of x86 (lots of work with one instruction, e.g. memory operand as part of an ALU instruction) with the best features of RISC (easy to decode, lots of registers): Proposal for an ideal extensible instruction set.
There's also a tradeoff in how you spend the bits in an instruction word, especially in a fixed instruction width ISA like most RISCs. Different ISAs made different choices.
PowerPC uses lots of the coding space for powerful bitfield instructions like rlwinm (rotate left and mask off a window of bits), and lots of opcodes. IDK if the generally unpronounceable and hard-to-remember mnemonics are related to that...
ARM uses the high 4 bits for predicated execution of any instruction based on condition codes. It uses more bits for the barrel shifter (the 2nd source operand is optionally shifted or rotated by an immediate or a count from another register).
MIPS has relatively large immediate operands, and is basically simple.
x86 32/64-bit addressing modes use a variable-length encoding, with an extra byte SIB (scale/index/base) byte when there's an index, and an optional disp8 or disp32 immediate displacement. (e.g. add esi, [rax + rdx + 12340] takes 2 + 1 + 4 bytes to encode, vs. 2 bytes for add esi, [rax].
x86 16-bit addressing modes are much more limited, and pack everything except the optional disp8/disp16 displacement into the ModR/M byte.
Suppose we have a instruction INC AC. Is the address of AC specified in the instruction or is there a special opcode which means 'INC AC' and we don't include the address of AC (accumulator)?
Yes, the machine-code format for some instructions in some ISAs includes implicit operands. Many machines have push / pop instructions that implicitly use a specific register as the stack pointer. For example, in x86-64's push rax, RAX is an explicit register operand (encoded in the low 3 bits of the one-byte opcode using the push r64 short form), while RSP is an implicit operand.
Older 8-bit CPUs often had instructions like DECA (to decrement the accumulator, A). i.e. there was a specific opcode for that register. This could be the same thing as having a DEC instruction with some bits in the opcode byte specifying which register (like x86 does before x86-64 repurposed the short INC/DEC encodings as REX prefixes: note the "N.E" (Not Encodeable) in the 64-bit mode column for dec r32). But if there's no regular pattern then it can definitely be considered an implicit operand.
Sometimes putting things into neat categories breaks down, so don't worry too much about whether using bits with the opcode byte counts as implicit or explicit for x86. It's a way of spending more opcode space to save code-size for commonly used instructions while still allowing use with different registers.
Some ISAs only use a certain register as the stack pointer by convention, with no implicit uses. MIPS is like this.
ARM32 (in ARM, not Thumb mode) also uses explicit operands in push/pop. Its push/pop mnemonics are just aliases for store-multiple decrement-before / load-multiple increment-after (LDMIA / STMDB) to implement a full-descending stack. See ARM's docs for LDM/STM which explains this, and what you can do with the general case of these instructions, e.g. LDMDB to decrement a pointer and then load (in the opposite direction of POP).

Carry bit of addition in IJVM

The IADD instruction in IJVM adds two 1-word numbers. When I add EEEEEEEE to itself I get DDDDDDDC. What happens to the carry 1? How can I get it? Is it saved in a register?
It appears that the carry-out bit is lost.
No version of the IJVM Assembly Language Specification that I've come across says anything about a carry-out bit, or carry flag.
IADD Pop two words from stack; push their sum
downeyt adds:
The MIC1 that interprets IJVM only has two condition codes, N and Z. The carry out from the ALU is not stored. The microarchitecture could be modified to store the carry out, like it stores the N and Z bits.