reorder buffer problem (computer architecture Udacity course) - cpu-architecture

can someone explains to me why the issue time for instruction I5 is cycle 6 and not cycle 5 according to the solution manual provided to that problem.
Notes: 1) the problem and its published solution is mentioned below 2) this problem is part of the problem set for the computer architecture course on Udacity
problem:
Using Tomasulo's algorithm, for each instruction in the following
sequence determine when (in which cycle, counting from the start) it
issues, begins execution, and writes its result to the CDB. Assume
that the result of an instruction can be written in the cycle after it
finishes its execution, and that a dependent instruction can (if
selected) begin its execution in the cycle after that. The execution
time of all instructions is two cycles, except for multiplication
(which takes 4 cycles) and division (which takes 8 cycles). The
processor has one multiply/divide unit and one add/subtract unit. The
multiply/divide unit has two reservation stations and the add/subtract
unit has four reservation stations. None of the execution units is
pipelined – each can only be executing one instruction at a time. If
a conflict for the use of an execution unit occurs when selecting
which instruction should start to execute, the older instruction (the
one that appears earlier in program order) has priority. If a conflict
for use of the CBD occurs, the result of the add/subtract unit has
priority over the result of the multiply/divide unit. Assume that at
start all instructions are already in the instruction queue, but none
has yet been issued to any reservation stations. The processor can
issue only one instruction per cycle, and there is only one CDB for
writing results. A way of handling exceptions in the processor
described above would be to simply delete all instructions from
reservation stations and the instruction queue, set all RAT entries to
point to the register file, and jump to the exception handler as soon
as possible (i.e. in the cycle after the one in which divide-by-zero
is detected). 1)Find the cycle time of each instruction for Issue,
Exection, and Write back stages. 2)What would be printed in the
exception handler if exceptions are handled this way?
provided solution:
timing diagram
solution for second question
The exception occurs in cycle 20, so the cycle in which we start executing the exception handler
is cycle 21. At that time, the processor has completed instructions I1-I4, but it has also completed
instructions I6 and I10. As a result, register F4 in the register file would have the result of I10,
which is -1 (5-6). The exception handler would print 2,0, -2, -1, which is incorrect.

Is there a limited ROB or RS (scheduler) size that would stop the front-end from issuing more instructions until some have dispatched to make more room (RS size), or until some have retired (ROB size)? It's common for the front-end's best case to be better throughput than the back-end, precisely so the back-end can get a look at possible independent instructions later on. But there has to be some limit to how many un-executed instructions can be tracked by the back-end.
In this case, yes:
The multiply/divide unit has two reservation stations and the add/subtract unit has four reservation stations
So I think that's the limiting factor there: the first two instructions are mul and div, and the first of those finishes on cycle 5. Apparently this CPU doesn't free the RS entry until the cycle after writeback. (And instead of one unified scheduler, it has queues (reservation stations) for each kind of execution unit separately.)
Some real CPUs may be more aggressive, e.g. I think Intel CPUs can free an RS entry sooner, even though they sometimes need to replay a uop if it was optimistically dispatched early in anticipation of a cache hit (when an input is the result of a load): Are load ops deallocated from the RS when they dispatch, complete or some other time?

Related

Do memory instructions pass through the load-store queue and issue queue in the microarchitecture

What is the difference between the issue queue and lsq queue for
memory instructions? Do memory instructions pass through both queues, or do they only pass
through the lsq queue.
If they pass through both queues what is their order?
I'm assuming you use the arm-like nomenclature here so the issue queue is what Intel calls RS (reservation station) and by issue you mean sending a uop ready for execution.
The answer is that memory instructions need to pass both. All instructions need to be issued (except the ones that can be eliminated without execution, for example register moves, zero idioms, nops, etc..). Let's rephrase - all instructions that need to go through an ALU need to go through the issue process first. Memory instructions will simply use that step to calculate their addresses.
This is true for loads, for stores there is usually an internal split into store-address and store-data, so the store-address will behave like a load in that sense and calculate its address during that step.
There is usually a dedicated execution port for that and dedicated execution units because the address calculation usually follows one of few specific addressing modes (each architecture has a different set of these), but aside from that the execution needs to follow the same rules like any other operation in the CPU - it needs to have its sources ready and read from the register file or bypassed from an in flight operation, it needs to get arbitrated when the execution port is free and prioritized by the same aging rules, so it makes sense that it uses the common path.
Once the memory operation has finished execution, it will be sent to the LSU (load-store unit, or the DCU, data-cache unit on Intel) and perform the actual memory access using the generated address. The LSU pipe will take care of the address translation, TLB lookups, the page walk if needed (though this is sometimes done in a dedicated unit), the address range and property checks, the cache lookup (if cacheable) and sending a miss to the next cache level or memory if needed. It may also trigger prefetches as part of the process.
For a load, when the LSU pipe has completed (which may require multiple passes and wakeups if the data was not available in the L1), the LSU will signal the issue queue again in order to wakeup anyone who was depended on the result.
For a store, store-address may fetch the line to the cache in advance as an optimization but the actual next step is usually to wakeup after retirement (since stores may not be dispatched to memory while speculative, unless you have some tricks to handle that).
It's also worth to mention that some CPUs try to optimize loads that can forward the data directly from prior stores instead of fetching it from the cache/memory. This can include forwarding (very common) or memory renaming (less common). The former is usually handled by the LSU internally, but the latter can be done much earlier and without the LSU (though the LSU pipe is usually still activated to validate the result).

What happens with nested branches and speculative execution?

Alright, so I know that if a particular conditional branch has a condition that takes time to compute (memory access, for instance), the CPU assumes a condition result and speculatively executes along that path. However, what would happen if, along that path, yet another slow conditional branch pops up (assuming, of course, that the first condition hasn't been resolved yet and the CPU can't just commit the changes)? Does the CPU just speculate inside the speculation? What happens if the last condition is mispredicted but the first wasn't? Does it just rollback all the way?
I'm talking about something like this:
if (value_in_memory == y){
// computations
if (another_val_memory == x){
//computations
}
}
Speculative execution is the regular state of execution, not a special mode that an out of order CPU enters when it sees a branch and then leaves when the branch is no longer in flight.
This is easier to see if you consider that it's not just branches that can fault, but many instructions, including those that access memory, have restrictions on their input values, etc. So any substantial out of order execution implies constant speculation, and CPUs are built around that idea.
So "nested branches" doesn't end up being special in that sense.
Now, modern CPUs have a variety of methods for quick branch misprediction recovery, faster than recovery from other types of faults1. For example they may snapshot the state of the register mapping at some branches, to allow recovery to start before the branch is at the head of the reorder buffer. Since it is not always feasible to snapshot at all branches, there might be complicated heuristics involved to decide where to take snapshots.
I mention this last part because it is one way in which nested branches might matter: when there are lots of branches in flight, you might hit some microarchitectural limits related to the tracking of these branches for recovery purposes. For more details, you can look through patents for "branch order buffer" (for Intel techniques, but there are no doubt others).
1 The basic recovery method is keep executing until the faulting instruction is the next to retire, and then throw away all younger instructions. In the context of branch mispredictions, this means you could actually suffer two or more mispredictions only the oldest of which actually takes effect: e.g., a younger branch mispredicts, and while executing up to that branch (at which point recovery can occur), another mispredict occurs, so the younger one ends up getting discarded.
(Maybe not a complete answer, but I had some of this written when #BeeOnRope posted an answer. Posting this anyway for some more links and technical details in case anyone's curious.)
Everything is always speculative until it reaches retirement and becomes non-speculative, definitely happened, part of the architectural state.
e.g. any load might fault with a bad address, any div might trap on divide by zero. See also Out-of-order execution vs. speculative execution That and What exactly happens when a skylake CPU mispredicts a branch? mention that branch mispredicts are handled specially, because they're expected to be frequent. Fast-recovery can start before a mis-predicted branch reaches retirement, unlike the behaviour for a faulting load for example. (That's part of why Meltdown is exploitable.)
So even "regular" instructions are executed speculatively before being commited, and the only distinction between them is a human-made distinction, not computer-made? I presume, then, that the CPU stores multiple, possible rollback points? For instance if I have load instructions that may lead to page faults or simply use stale values, inside a conditional branch, the CPU identifies such instructions and scenarios and saves a state for each of them? I feel like I misunderstood because this may lead to a lot of storing register states and complicated dependencies.
The retirement state is always consistent so you can always roll back to there and discard all in-flight work, e.g. if an external interrupt arrives you want to handle it without waiting for a chain of a dozen cache miss loads to all execute. When an interrupt occurs, what happens to instructions in the pipeline?
This tracking basically happens for free or is something you need to do anyway to be able to detect which instruction faulted, not just that there was a problem somewhere. (This is called "precise exceptions")
The real distinction humans can usefully make is speculation that has a real chance of being wrong during execution of non-error cases. If your code gets a bad pointer, it doesn't really matter how it performs; it's going to page-fault and that's going to be very slow compared to local OoO exec details.
You're talking about a modern out-of-order (OoO) execution (not just fetch) CPU, like modern Intel or AMD x86, high-end ARM, MIPS r10000, etc.
The front-end is in-order (with speculation down predicted paths), and so is commit (aka retirement) from the out-of-order back-end into non-speculative retirement state. (aka known-good architectural state).
The CPU uses two major structures to track instructions (or on x86, uops = parts of instructions) in the back-end. The last stage of the front-end (after fetch / decode) allocates/renames instructions and adds them into both of these structures at once.
RS = Reservation Station = scheduler: not-yet-executed instructions, waiting for an execution unit. The RS tracks dependencies and sends the oldest-ready uops to execution units that are ready.
ROB = ReOrder Buffer: not-yet-retired instructions. Instructions enter and leave in-order so it can just be a circular buffer.
Includes a flag to mark each entry as executed or not, set once the RS has sent it to an execution unit which reports success. The oldest instructions in the ROB that all have their done-executing bit set can "retire".
Also includes a flag which indicates "fault if this reaches retirement". This avoids spending time handling page faults from load instruction on the wrong path of execution (that might well have pointers into an unmapped page), for example. Either in the shadow of a branch mispredict, or just after another instruction (in program order) that should have faulted first but OoO exec got to it later.
(I'm also leaving out register-renaming onto a large physical register file.
That's the "rename" part. Allocate includes choosing which execution port an instruction will use, and reserving a load or store buffer entry for memory instructions.)
(There's also a store-buffer; stores don't write directly to L1d cache, they write to the store buffer. This makes it possible to speculatively execute stores and still roll back without them becoming visible to other cores. It also decouples cache-miss stores from execution. Once a store instruction retires, the store-buffer entry "graduates" and is eligible to commit to L1d cache, once MESI gets exclusive access to the cache line, and once memory-ordering rules are satisfied.)
Execution units detect whether an instruction should fault, or was mis-speculated and should roll back, but don't necessarily act on that until the instruction reaches retirement.
In-order retirement is the step that recovers program-order after OoO exec, including the case of exceptions of mis-speculation.
Terminology: Intel calls it "issue" when instructions are sent from the front-end into the ROB + RS. Other computer architecture people often call that "dispatch".
Sending uops from the RS to execution units is called "dispatch" by Intel, "issue" by other people.

How the speculative load and store happen in modern Intel processor? [duplicate]

Given the small program shown below (handcrafted to look the same from a sequential consistency / TSO perspective), and assuming it's being run by a superscalar out-of-order x86 cpu:
Load A <-- A in main memory
Load B <-- B is in L2
Store C, 123 <-- C is L1
I have a few questions:
Assuming a big enough instruction-window, will the three instructions be fetched, decoded, executed at the same time? I assume not, as that would break execution in program order.
The 2nd load is going to take longer to fetch A from memory than B. Will the later have to wait until the first is fully executed? Will the fetching of B only start after Load A is fully executed? or until when does it have to wait?
Why would the store have to wait for the loads? If yes, will the instruction just wait to be committed in the store buffer until the loads finish or after decoding it will have to sit and wait for the loads?
Thanks
Terminology: "instruction-window" normally means out-of-order execution window, over which the CPU can find ILP. i.e. ROB or RS size. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
The term for how many instructions can go through the pipeline in a single cycle is pipeline width. e.g. Skylake is 4-wide superscalar out-of-order. (Parts of its pipeline, like decode, uop-cache fetch, and retirement, are wider than 4 uops, but issue/rename is the narrowest point.)
Terminology: "wait to be committed in the store buffer" store data + address gets written into the store buffer when a store executes. It commits from the store buffer to L1d at any point after retirement, when it's known to be non-speculative.
(In program order, to maintain the TSO memory model of no store reordering. A store buffer allows stores to execute inside this core out of order but still commit to L1d (and become globally visible) in-order. Executing a store = writing address + data to the store buffer.)
Can a speculatively executed CPU branch contain opcodes that access RAM?
Also what is a store buffer? and
Size of store buffers on Intel hardware? What exactly is a store buffer?
The front-end is irrelevant. 3 consecutive instructions might well be fetched in the same 16-byte fetch block, and might go through pre-decode and decode in the same cycle as a group. And (also or instead) issue into the out-of-order back-end as part of a group of 3 or 4 uops. IDK why you think any of that would cause any potential problem.
The front end (from fetch to issue/rename) processes instructions in program order. Processing simultaneously doesn't put later instructions before earlier ones, it puts them at the same time. And more importantly, it preserves the information of what program order is; that's not lost or discarded because it matters for instructions that depend on the previous one1!
There are queues between most pipeline stages, so (for example on Intel Sandybridge) instructions that pre-decode as part of a group of up-to-6 instructions might not hit the decoders as part of the same group of up-to-4 (or more with macro-fusion). See https://www.realworldtech.com/sandy-bridge/3/ for fetch, and the next page for decode. (And the uop cache.)
Executing (dispatching uops to execution ports from the out-of-order scheduler) is where ordering matters. The out-of-order scheduler has to avoid breaking single threaded code.2
Usually issue/rename is far ahead of execution, unless you're bottlenecked on the front-end. So there's normally no reason to expect that uops that issued together will execute together. (For the sake of argument, let's assume that the 2 loads you show do get dispatched for execution in the same cycle, regardless of how they got there via the front-end.)
But anyway, there's no problem here starting both loads and the store the same time. The uop scheduler doesn't know whether a load will hit or miss in L1d. It just sends 2 load uops to the load execution units in a cycle, and a store-address + store-data uop to those ports.
[load ordering]
This is the tricky part.
As I explained in an answer + comments on your last question, modern x86 CPUs will speculatively use the L2 hit result from Load B for later instructions, even though the memory model requires that this load happens after Load A.
But if no other cores write to cache line B before Load A completes, then nothing can tell the difference. The Memory-Order Buffer takes care of detecting invalidations of cache lines that were loaded from before earlier loads complete, and doing a memory-order mis-speculation pipeline flush (rollback to retirement state) in the rare case that allowing load re-ordering could change the result.
Why would the store have to wait for the loads?
It won't, unless the store-address depends on a load value. The uop scheduler will dispatch the store-address and store-data uops to execution units when their inputs are ready.
It's after the loads in program order, and the store buffer will make it even farther after the loads as far as global memory order is concerned. The store buffer won't commit the store data to L1d (making it globally visible) until after the store has retired. Since it's after the loads, they'll have also retired.
(Retirement is in-order to allow precise exceptions, and to make sure no previous instructions took an exception or were a mispredicted branch. In-order retirement allows us to say for sure that an instruction is non-speculative after it retires.)
So yes, this mechanism does ensure that the store can't commit to L1d until after both loads have taken data from memory (via L1d cache which provides a coherent view of memory to all cores). So this prevents LoadStore reordering (of earlier loads with later stores).
I'm not sure if any weakly-ordered OoO CPUs do LoadStore reordering. It is possible on in-order CPUs when a cache-miss load comes before a cache-hit store, and the CPU uses scoreboarding to avoid stalling until the load data is actually read from a register, if it still isn't ready. (LoadStore is a weird one: see also Jeff Preshing's Memory Barriers Are Like Source Control Operations). Maybe some OoO exec CPUs can also track cache-miss stores post retirement when they're known to be definitely happening, but the data just still hasn't arrived yet. x86 doesn't do this because it would violate the TSO memory model.
Footnote 1: There are some architectures (typically VLIW) where bundles of simultaneous instructions are part of the architecture in a way that's visible to software. So if software can't fill all 3 slots with instructions that can execute simultaneously, it has to fill them with NOPs. It might even be allowed to swap 2 registers with a bundle that contained mov r0, r1 and mov r1, r0, depending on whether the ISA allows instructions in the same bundle to read and write the same registers.
But x86 is not like that: superscalar out-of-order execution must always preserve the illusion of running instructions one at a time in program order. The cardinal rule of OoO exec is: don't break single-threaded code.
Anything that would violate this can only be done with checking for hazards, or speculatively with rollback upon detection of mistakes.
Footnote 2: (continued from footnote 1)
You can fetch / decode / issue two back-to-back inc eax instructions, but they can't execute in the same cycle because register renaming + the OoO scheduler has to detect that the 2nd one reads the output of the first.

Micro scheduler for real-time kernel in embedded C applications?

I am working with time-critical applications where the microsecond counts. I am interested to a more convenient way to develop my applications using a non bare-metal approach (some kind of framework or base foundation common to all my projects).
A considered real-time operating system such as RTX, Xenomai, Micrium or VXWorks are not really real-time under my terms (or under the terms of electronic engineers). So I prefer to talk about soft-real-time and hard-real-time applications. An hard-real-time application has an acceptable jitter less than 100 ns and a heat-beat of 100..500 microseconds (tick timer).
After lots of readings about operating systems I realized that typical tick-time is 1 to 10 milliseconds and only one task can be executed each tick. Therefore the tasks take usually much more than one tick to complete and this is the case of most available operating systems or micro kernels.
For my applications a typical task has a duration of 10..100 microseconds, with few exceptions that can last for more than one tick. So any real-time operating system cannot not fulfill my requirements. That is the reason why other engineers still not consider operating system, micro or nano kernels because the way they work is too far from their needs. I still want to struggle a bit and in my case I now realize I have to consider a new category of operating system that I never heard about (and that may not exist yet). Let's call this category nano-kernel or subtick-scheduler
In such dreamed kernels I would find:
2 types of tasks:
Preemptive tasks (that run in their own memory space)
Non-preemptive tasks (that run in the kernel space and must complete in less than one tick.
Deterministic kernel scheduler (fixed duration after the ISR to reach the theoretical zero second jitter)
Ability to run multiple tasks per tick
For a better understanding of what I am looking for I made this figure below that represents the two types or kernels. The first representation is the traditional kernel. A task executes at each tick and it may interrupt the kernel with a system call that invoke a full context switch.
The second diagram shows a sub-tick kernel scheduler where multiple tasks may share the same tick interrupt. Task 1 was summoned with a maximum execution time value so it needs 2 ticks to complete. Task 2 is set with low priority, so it consumes the remaining time of each tick upon completion. Task 3 is non-preemptive so it operates on the kernel space which save some precious context switch time.
Available operating systems such as RTOS, RTAI, VxWorks or µC/OS are not fully real-time and are not suitable for embedded hard real-time applications such as motion-control where a typical cycle would last no more than 50 to 500 microseconds. By analyzing my needs I land on different topology for my scheduler were multiple tasks can be executed under the same tick interrupt. Obviously I am not the only one with this kind of need and my problem might simply be a kind of X-Y problem. So said differently I am not really looking at what I am really looking for.
After this (pretty) long introduction I can formulate my question:
What could be a good existing architecture or framework that can fulfill my requirements other than a naive bare-metal approach where everything is written sequentially around one master interrupt? If this kind of framework/design pattern exists what would it be called?
Sorry, but first of all, let me say that your entire post is completely wrong and shows complete lack of understanding how preemptive RTOS works.
After lots of readings about operating systems I realized that typical tick-time is 1 to 10 milliseconds and only one task can be executed each tick.
This is completely wrong.
In reality, a tick frequency in RTOS determines only two things:
resolution of timeouts, sleeps and so on,
context switch due to round-robin scheduling (where two or more threads with the same priority are "runnable" at the same time for a long period of time.
During a single tick - which typically lasts 1-10ms, but you can usually configure that to be whatever you like - scheduler can do hundreds or thousands of context switches. Or none. When an event arrives and wakes up a thread with sufficiently high priority, context switch will happen immediately, not with the next tick. An event can be originated by the thread (posting a semaphore, sending a message to another thread, ...), interrupt (posting a semaphore, sending a message to a queue, ...) or by the scheduler (expired timeout or things like that).
There are also RTOSes with no system ticks - these are called "tickless". There you can have resolution of timeouts in the range of nanoseconds.
That is the reason why other engineers still not consider operating system, micro or nano kernels because the way they work is too far from their needs.
Actually this is a reason why these "engineers" should read something instead of pretending to know everything and seeking "innovative" solutions to non-existing problems. This is completely wrong.
The first representation is the traditional kernel. A task executes at each tick and it may interrupt the kernel with a system call that invoke a full context switch.
This is not a feature of a RTOS, but the way you wrote your application - if a high priority task is constantly doing something, then lower priority tasks will NOT get any chance to run. But this is just because you assigned wrong priorities.
Unless you use cooperative RTOS, but if you have such high requirements, why would you do that?
The second diagram shows a sub-tick kernel scheduler where multiple tasks may share the same tick interrupt.
This is exactly how EVERY preemptive RTOS works.
Available operating systems such as RTOS, RTAI, VxWorks or µC/OS are not fully real-time and are not suitable for embedded hard real-time applications such as motion-control where a typical cycle would last no more than 50 to 500 microseconds.
Completely wrong. In every known RTOS it is not a problem to get a response time down to single microseconds (1-3us) with a chip that has clock in the range of 100MHz. So you actually can run "jobs" which are as short as 10us without too much overhead. You can even have "jobs" as short as 10ns, but then the overhead will be pretty high...
What could be a good existing architecture or framework that can fulfill my requirements other than a naive bare-metal approach where everything is written sequentially around one master interrupt? If this kind of framework/design pattern exists what would it be called?
This pattern is called preemptive RTOS. Do note that threads in RTOS are NOT executed in "tick interrupt". They are executed in standard "thread" context, and tick interrupt is only used to switch context of one thread to another.
What you described in your post is a "cooperative" RTOS, which does NOT preempt threads. You use that in systems with extremely limited resources and with low timing requirements. In every other case you use preemptive RTOS, which is capable of handling the events immediately.

Why is response time important in CPU scheduling?

I'm looking for an example of a job for which response time is important.
One definition of response time is:
The time taken in an interactive program from the issuance of a command to the commence of a response to that command.
I've read that response time is important for interactivity, but I can't understand why. If the job isn't fully completed, what output could be produced that would be of interest to a user?
Wouldn't the user only care about how soon a job finishes, as that's the first time any output is produced?
For example, consider these two possible schedulings of two jobs:
Case 1: |---B---|---A---|
Case 2: |-A-|---B---|-A-|
Suppose that job A and B are issued at the same time, A being a command typed in by the user and B being some background process.
The response time for job A as I understand it would be shorter in case 2. As job A finishes (and produces output) at the same time in the two cases, I don't understand how the user benefits (or even notices) the better response time in case 2.
When writing an operating system, one has to take into consideration what will the intended audience be. In some cases it matters most to finish jobs as quickly as possible (supercomputer systems), in some cases it matters most to be as responsive as possible (regular desktop systems), and in some cases it matters most to be as predictable as possible (real-time systems).
For finishing jobs as fast as possible, tasks should be interrupted the rarest possible (so big intervals between task switches are the best option). Here response time doesn't really matter much. It should be noted that task switches usually take some time (thousands of CPU cycles usually) due to having to save the state (including registers and paging structures) of the old task to memory and restore the state (including registers and paging structures) of the new task from memory. This also causes cache and TLB misses, since the cached information doesn't usually belong to the current process.
For being the most responsive possible, tasks should be interrupted as often as possible so the user doesn't experience the so-called lag. This is where response time is important. Note however that on interrupt-driven architectures (like x86) an interrupt from the keyboard or the mouse would automatically pause execution of the current task and call the interrupt handler, which processes the input and sends it to the appropriate program.
For being the most predictable possible, input should be processed neither too fast, neither too slow. This means that response time is constrained from both ways, thus being much more important than in "most responsive possible" designs. A misprediction can even be a fatal failure in mission-critical systems.
In a nutshell, importance of response time varies from design to design and can range from nearly unimportant to critical.
I think I have an answer to my own question. The problem was, I was just thinking about simple processes like ls that once issued runs for some amount of time and then, when they're finished, deliver their first and only output.
However, suppose job A in the example from the question is a program with multiple print statements. Output will in that case be produced before the process is complete (and some of the printouts may well occur during the first scheduled burst). It would thus make sense for interactivity to want to begin running such a process as soon as possible.