MIPS R4000 Latency and Initiation Intervals

MIPS R4000 Latency and Initiation Intervals - cpu-architecture

Why MIPS R4000 has latency of 112 cycles and initiation interval of 111 cycles for square root functional unit ?

The MIPS R4000 Microprocessor User’s Manual provides a somewhat detailed description of the R4000 floating point pipeline (see section 6.7). For floating point operations, the R4000 FPU provides eight operation stages (mantissa add, divide pipeline, exception test, first multiplier, second multiplier, rounding, operand shift, unpack FP numbers). Double precision square root uses the unpack FP numbers for the first cycle, the exception test for the second, both mantissa add and rounding for the next 108 cycles, mantissa add for the next cycle, and rounding for the last cycle.
Since the unpack FP numbers and exception test (the first two cycles) are not used in later cycles, a following square root operation can start two cycles earlier than if square root was completely unpipelined. This can be diagrammed as follows:
1 2 3 4 ... 110 111 112 114 115
SQRT.D U E A+R A+R A+R A R
SQRT.D 110 stall cycles for second SQRT.D
U E A+R A+R
(You can see that the initiation interval counts the cycle when the first SQRT.D is issued, i.e., an initiation interval of zero would mean parallel issue and an initiation interval of one would support back-to-back issue.)

Related

Why are registers restrained about the number of registers?

I wonder why register must be only 32.
I know vaguely about the reason but i want to know more exactly.

Let's look at what we have with 32 general purpose integer-oriented registers, and what would happen if we went to 256 registers:
Diminishing returns
Normal compiled code demonstrates that with 32 registers, most function leave some of the registers unused.  So, adding more registers than 32 doesn't help most code.
Encoding size
On a register machine, binary operators like addition, subtraction, comparison, others require three operands: left source, right source, and target.  On a RISC machine, each of these uses a register operand, so that means 3 register operands in one instruction.  This means that 3 x 5 bits = 15 bits are used in such an instruction on a machine with 32 registers.
If we were to increase the number of registers to, say 256, then we would need 8 bits for each register operand.  That would mean 3 x 8 bits = 24 bits.  Instructions become larger, and this decreases the efficiency of the instruction cache — a critical component to performance.
Many instruction sets do have more than 32 registers
They add specialized registers, such as a whole second set for floating point, and also another set of extra wide registers for SIMD and vector operations.
In context, these additional register sets don't necessarily suffer the same code expansion as described above because these additional register sets don't intermix with each other: in other words we can have 32 integer registers and also 32 floating point registers, and still maintain 5 bit register fields in the instructions, because the instructions involved know which register set they are using and don't support mixing of the register sets in the same instruction.
Also, to be clear, many instruction sets have used different numbers of registers, many less than 32 yet some more than 32.

Shift 32 bit Numbers on a 16 bit Datapath

How to Shift 32 Bit Numbers on a 16 Bit Datapath
This is a computer architecture question.
My datapath is only 16 bit wide, meaning my ALU can only process 16 bit operands at a time.
My registers are 32 bits wide and are addressable in lower and upper 16 bit portions.
Every time I read the lower half of a register I also read an extra bit telling me weather the upper half contains any 1s at all (ref 1).
So far I implemented the shift left logical. (sll rd, rs1, rs2)
Read the lower half of the rs1 register, and shift it by the amount specified in the lower rs2 register
The bits that I shifted out of these 16 bits, are being stored in a temporary 16 bit register inside the alu
The shifted value will be written back to the lower rd register and the status bit is set (see ref 1)
Now if there is no data written in the higher rs1 register (see ref 1) and the bits that are in the temp alu register are all 0, then my shift operation is done.
Otherwise a second cycle for the upper half is needed
Read the high rs1 register and shift it by the amount stored in the lower rs2 register
But now fill the rs1 with the values stored in the alu temp register (not with 0 as in the first cycle)
Bits shifted out of the 16bit space will be dropped
The result is written back to the higher rd register, and the rd status bit is set (see ref 1)
Example 1: Lets say rs1 is 0x00001234 and rs2 is 0x00000002 (Perform a left shift by 2)
First I read the lower 16bits of rs1 and rs2, presenting 0x1234 and 0x0002. But by reading that I do also get the status bits of both registers, in this case being 0 for rs1 and 0 for rs2 since the upper 16 bits of both registers are all 0. With the data given I can perform a left shift by 2. Resulting in 0x48D0. Since no 1s were shifted out of the sll I can store the result in lower rd and set its status bit to 0. (This is all done in one cycle)
Example 2: Lets say rs1 is 0x0000D234 and rs2 is 0x00000005 (Perform a left shift by 5)
First I read the lower 16bits of rs1 and rs2, presenting 0xD234 and 0x0005. But by reading that I do also get the status bits of both registers, in this case being 0 for rs1 and 0 for rs2 since the upper 16 bits of both registers are all 0. With the data given I can perform a left shift by 5. Resulting in 0x4680. But now I shifted 11010 (0x1A) out of the 16 bit space. This value is stored in the Alu temp register and since it contains 1s I have to perform another cycle.
In the second cycle I read the upper rs1 and lower rs2, presenting 0x0000 and 0x0005. I perform another shift left by 5, but now the Alu temp register is used to fill up the shifted values. 0x0000 -> 0x00__ -> 0x001A. This result is then written back to the upper rd. Therefore finishing my 32 bit sll in two 16 bit cycles.
Example 3: Lets say rs1 is 0x01231234 and rs2 is 0x00000002 (Perform a left shift by 2)
First I read the lower 16bits of rs1 and rs2, presenting 0x1234 and 0x0002. But by reading that I do also get the status bits of both registers, in this case being 1 for rs1 and 0 for rs2 since the upper 16 bits of rs1 are non-zero. Since the status bit of rs1 was non-zero I have to preform a second cycle even if no 1s are shifted out of the lower 16 bits (See Example 2). From now on it follows Example 2, by writing back to rd and preforming a second cycle for the upper bits.
I hope these examples gave a better insight.
Now I want to implement a right shift operation (arithmetical and logical). But how to do that in at most 2 cycles and if I have to read the lower rs1 register first (including the status bit)?
Thanks for reading; this is my first question here, so please don't go to harsh on me :D

Start by reading the higher part.
The first cycle you will read the high part. Do the right shift, the shifted out bits will be present in the high part of the tmp register and the result of the shift will be written back.
The second cycle , you read the low part , you do a shift and an or with the result present in the tmp register. then the result of this part will be written back.

Why does instruction cache alignment improve performance in set associative cache implementations?

I have a question regarding instruction cache alignment. I've heard that for micro-optimizations, aligning loops so that they fit inside a cache line can slightly improve performance. I don't see why that would do anything.
I understand the concept of cache hits and their importance in computing speed.
But it seems that in set associative caches, adjacent blocks of code will not be mapped to the same cache set. So if the loop crosses a code block the CPU should still get a cache hit since that adjacent block has not been evicted by the execution of the previous block. Both blocks are likely to remain cached during the loop.
So all I can figure is if there is truth in the claim that alignment can help, it must be from some sort of other effect.
Is there a cost in switching cache lines?
Is there a difference in cache hits, one where you get a hit and one where you hit the same cache line you're currently reading from?

Keeping a whole function (or the hot parts of a function, i.e. the fast path through it) in fewer cache lines reduces I-cache footprint. So it can reduce the number of cache misses, including on startup when most of the cache is cold. Having a loop end before the end of a cache line could give HW prefetching time to fetch the next one.
Accessing any line that's present in L1i cache takes takes the same amount of time. (Unless your cache uses way-prediction: that introduces the possibility of a "slow hit". See these slides for a mention and brief description of the idea. Apparently MIPS r10k's L2 cache used it, and so did Alpha 21264's L1 instruction cache with "branch target" vs. "sequential" ways in its 2-way associative 64kiB L1i. Or see any of the academic papers that come up when you google cache way prediction like I did.)
Other than that, the effects aren't so much about cache-line boundaries but rather aligned instruction-fetch blocks in superscalar CPUs. You were correct that the effects are not from things you were considering.
See Modern Microprocessors
A 90-Minute Guide! for an intro to superscalar (and out-of-order) execution.
Many superscalar CPUs do their first stage of instruction fetch using aligned accesses to their I-cache. Lets simplify by considering a RISC ISA with 4-byte instruction width1 and 4-wide fetch/decode/exec. (e.g. MIPS r10k, although IDK if some of the other stuff I'm going to make up reflects that microarch exactly).
...
.top_of_loop:
insn1 ; at address 16*n + 12
; 16-byte boundary here
insn2 ; at address 16*n + 0
insn3 ; at address 16*n + 4
b .top_of_loop ; at address 16*n + 8
... after loop ; at address 16*n + 12
... after loop ; at address 16*n + 0
Without any kind of loop buffer, the fetch stage has to fetch the loop instructions from I-cache one for every time it executes. But this takes a minimum of 2 cycles per iteration because the loop spans two 16-byte aligned fetch blocks. It's not capable of fetching the 16 bytes of instructions in one unaligned fetch.
But if we align the top of the loop, it can be fetched in a single cycle, allowing the loop to run at 1 cycle / iteration if the loop body doesn't have other bottlenecks.
...
nop ; at address 16*n + 12 ; NOP padding for alignment
.top_of_loop: ; 16-byte boundary here
insn1 ; at address 16*n + 0
insn2 ; at address 16*n + 4
insn3 ; at address 16*n + 8
b .top_of_loop ; at address 16*n + 12
... after loop ; at address 16*n + 0
... after loop ; at address 16*n + 4
With a larger loop that's not a multiple of 4 instructions, there's still going to a partially-wasted fetch somewhere. It's generally best that it's not the top of the loop, though. Getting more instructions into the pipeline sooner rather than later helps the CPU find and exploit more instruction-level parallelism, for code that isn't purely bottlenecked on instruction-fetch.
In general, aligning branch targets (including function entry points) by 16 can be a win (at the cost of greater I-cache pressure from lower code density). A useful tradeoff can be padding to the next multiple of 16 if you're within 1 or 2 instructions. e.g. so in the worst case, a fetch block contains at least 2 or 3 useful instructions, not just 1.
This is why the GNU assembler supports .p2align 4,,8 : pad to the next 2^4 boundary if it's 8 bytes away or closer. GCC does in fact use that directive for some targets / architectures, depending on tuning options / defaults.
In the general case for non-loop branches, you also don't want to jump near the end of a cache line. Then you might have another I-cache miss right away.
Footnote 1:
The principle also applies to modern x86 with its variable-width instructions, at least when they have decoded-uop cache misses forcing them to actually fetch x86 machine code from L1I-cache. And applies to older superscalar x86 like Pentium III or K8 without uop caches or loopback buffers (that can make loops efficient regardless of alignment).
But x86 decoding is so hard that it takes multiple pipeline stages, e.g. to some to simple find instruction boundaries and then feed groups of instructions to the decoders. Only the initial fetch-blocks are aligned and buffers between stages can hide bubbles from the decoders if pre-decode can catch up.
https://www.realworldtech.com/merom/4/ shows the details of Core2's front-end: 16-byte fetch blocks, same as PPro/PII/PIII, feeding a pre-decode stage that can scan up to 32 bytes and find boundaries between up to 6 instructions IIRC. That then feeds another buffer leading to the full decode stage which can decode up to 4 instructions (5 with macro-fusion of test or cmp + jcc) into up to 7 uops...
Agner Fog's microarch guide has some detailed info about optimizing x86 asm for fetch/decode bottlenecks on Pentium Pro/II vs. Core2 / Nehalem vs. Sandybridge-family, and AMD K8/K10 vs. Bulldozer vs. Ryzen.
Modern x86 doesn't always benefit from alignment. There are effects from code alignment but they're not usually simple and not always beneficial. Relative alignment of things can matter, but usually for things like which branches alias each other in branch predictor entries, or for how uops pack into the uop cache.

Amdahl's law example

Can someone help me with this example please and show me how to work the second part?
the question is :
If one third of a weather prediction algorithm is inherently serial and the remainder
parallelizable, what is the minimum number of cores needed to guarantee a 150% speedup over a
single core implementation?
ii. Your boss revises the figure to 200%. What is your new answer?
Thanks very much in advance !!

Guess: If the algorithm is 1/3 serial and 2/3 parallel...I would think that each core you added would give you a 66% increase in performance...So for 150% increase, you'd need 3 more cores, and for a 200% increase, you'd need 4.
This is a guess. Your textbook might be more helpful :)

If the algorithm runs on a single core and takes 90 minutes then 30 minutes is for the serial part and 60 minutes for the parallel part.
Add a CPU:
30 is for the serial part and 30 for the parallel part(half of the 60 overlaps with the serial part).
90 / 60 = 150% increase.

I am a bit late, but here are the answers:
1) 150% increase -> 2 cores at least required as dbasnett said;
2) 200% increase -> 4 cores at least required basing on the Amahld's law:
Here, 90 minutes overall required to perform the calculation. P is the actually enhanced part of the algorithm (the parallelizable part) which is 2/3 of 90, N is the number of cores, so when there's a core only:
You get 1, which means 100%, which is how the algorithm performs the standard way (without multi-core acceleration and therefore no parallelization speedup).
Now, we must find N number of cores for which the previous equation equals 2, where 2 means that the algorithm performs in half time (45 minutes instead of 90 when there's no parallelization) and therefore with a 200% speedup:
Since:
We see that:
So with 4 cores computing in parallel the 2/3 of the algoritm you get 200% speedup. The same goes for 150%, you will get 2, as dbasnett already told you.
Pretty simple.
Note that a complex algorithm may imply further divisions of its parallelizable parts (and in theory you can have a different number of processing units per parallelizable part concurrently):
You can further look at Wikipedia (there's also an example):
http://en.wikipedia.org/wiki/Amdahl%27s_law#Description
Anyway, the principle is the same:
Let T be the time an algorithm needs to execute in order to complete, A be the serial part of it, B its parallelizable part and N the number of parallel CPUs, you can divide B in further small sections and perform calculations on each part:
You may for C, D, G e.g. adopt M CPUs instead of N (the speedup will of course differ if M != N).
And at the end, you will arrive at a point when having more CPUs doesn't matter anymore, since:
And your algorithm speedup will at most tend to total execution time (T) divided by the execution time of the Serial part only (A).
Therefore parallel calculation comes really handy only when you have low execution time for the serial part of your algorithm.

Actual note duration from MIDI duration

I'm currently implementing an application to perform some tasks on MIDI files, and my current problem is to output the notes I've read to a LilyPond file.
I've merged note_on and note_off events to single notes object with absolute start and absolute duration, but I don't really see how to convert that duration to actual music notation. I've guessed that a duration of 376 is a quarter note in the file I'm reading because I know the song, and obviously 188 is an eighth note, but this certainly does not generalise to all MIDI files.
Any ideas?

By default a MIDI file is set to a tempo of 120 bpm and the MThd chunk in the file will tell you the resolution in terms of "pulses per quarter note" (ppqn).
If the ppqn is, say, 96 than a delta of 96 ticks is a quarter note.
Should you be interested in the real duration (in seconds) of each sound you should also consider the "tempo" that can be changed by an event "FF 51 03 tt tt tt"; the three bytes are the microseconds for a quarter note.
With these two values you should find what you need. Beware that the duration in the midi file can be approximate, especially if that MIDI file it's the recording of a human player.
I've put together a C library to read/write midifiles a long time ago: https://github.com/rdentato/middl in case it may be helpful (it's quite some time I don't look at the code, feel free to ask if there's anything unclear).
I would suggest to follow this approach:
choose a "minimal note" that is compatible with your division (e.g. 1/128) and use it as a sort of grid.
Align each note to the closest grid line (i.e. to the closest integer multiple of the minimal node)
Convert it to standard notation (e.g a quarter note, a dotted eight note, etc...).
In your case, take 1/32 as minimal note and 384 as division (that would be 48 ticks). For your note of 376 tick you'll have 376/48=7.8 which you round to 8 (the closest integer) and 8/32 = 1/4.
If you find a note whose duration is 193 ticks you can see it's a 1/8 note as 193/48 is 4.02 (which you can round to 4) and 4/32 = 1/8.
Continuing this reasoning you can see that a note of duration 671 ticks should be a double dotted quarter note.
In fact, 671 should be approximated to 672 (the closest multiple of 48) which is 14*48. So your note is a 14/32 -> 7/16 -> (1/16 + 2/16 + 4/16) -> 1/16 + 1/8 + 1/4.
If you are comfortable using binary numbers, you could notice that 14 is 1110 and from there, directly derive the presence of 1/16, 1/4 and 1/8.
As a further example, a note of 480 ticks of duration is a quarter note tied with a 1/16 note since 480=48*10 and 10 is 1010 in binary.
Triplets and other groups would make things a little bit more complex. It's not by chance that the most common division values are 96 (3*2^5), 192 (3*2^6) and 384 (3*2^7); this way triplets can be represented with an integer number of ticks.
You might have to guess or simplify in some situations, that's why no "midi to standard notation" program can be 100% accurate.