Why don't operating systems schedule a new task whenever an unknown branch occurs? - operating-system

This is a purely conceptual question. Why don't OS's switch tasks whenever an branch that has never been taken occurs? Dynamic branch prediction only works with branches that have been taken in the past, and static branch prediction is only correct in certain scenarios. If you have no data on a branch, it seems like the OS and the processor should start putting a separate task into the pipeline rather than blindly guess the branch. Then you can compute the result of the branch, and execute that branch when the original task is scheduled again. Next time the branch is encountered, the processor can use dynamic prediction.
Is there a reason this method isn't used? Or is it used and I'm just unaware?

The overhead of context switching is so high compared to just executing the branch.

This is why SMT - simultanions multi-threading was invented, while one thread fumbles around with its branch mispredict the other threads on the same core can advance.
The new Power processor even has 8 hardware threads on each core to ensure maximum throughput in face of
branch mis-predictions
instruction cache misses
data cache misses
data dependencies

Related

How to configure channels and AMQ for spring-batch-integration where all steps are run as slaves on another cluster member

Followup to Configuration of MessageChannelPartitionHandler for assortment of remote steps
Even though the first question was answered (I think well), I think I'm confused enough that I'm not able to ask the right questions. Please direct me if you can.
Here is a sketch of the architecture I am attempting to build. Today, we have a job that runs a step across the cluster that works. We want to extend the architecture to run n (unbounded and different) jobs with n (unbounded and different) remote steps across the cluster.
I am not confusing job executions and job instances with jobs. We already run multiple job instances across the cluster. We need to be able to run other processes that are scalable in hte same way as the one we have defined already.
The source data is all coming from database which are known to the steps. The partitioner is defining the range of data for the "where" clause in the source database and putting that in the stepContext. All of the actual work happens in the stepContext. The jobContext simply serves to spawn steps, wait for completion, and provide the control API.
There will be 0 to n jobs running concurrently, with 0 to n steps from however many jobs running on the slave VM's concurrently.
Does each master job (or step?) require its own request and reply channel, and by extension its own OutboundChannelAdapter? Or are the request and reply channels shared?
Does each master job (or step?) require its own aggregator? By implication this means each job (or step) will have its own partition handler (which may be supported by the existing codebase)
The StepLocator on the slave appears to require a single shared replyChannel across all steps, but it appears to me that the messageChannelpartitionHandler requires a separate reply channel per step.
What I think is unclear (but I can't tell since it's unclear) is how the single reply channel is picked up by the aggregatedReplyChannel and then returned to the correct step. Of course I could be so lost I'm asking the wrong questions.
Thank you in advance
Does each master job (or step?) require its own request and reply channel, and by extension its own OutboundChannelAdapter? Or are the request and reply channels shared?
No, there is no need for that. StepExecutionRequests are identified with a correlation Id that makes it possible to distinguish them.
Does each master job (or step?) require its own aggregator? By implication this means each job (or step) will have its own partition handler (which may be supported by the existing codebase)
That should not be the case, as requests are uniquely identified with a correlation ID (similar to the previous point).
The StepLocator on the slave appears to require a single shared replyChannel across all steps, but it appears to me that the messageChannelpartitionHandler requires a separate reply channel per step.
The messageChannelpartitionHandler should be step or job scoped, as mentioned in the Javadoc (see recommendation in the last note). As a side note, there was an issue with message crossing in a previous version due to the reply channel being instance based, but it was fixed here.

What happens with nested branches and speculative execution?

Alright, so I know that if a particular conditional branch has a condition that takes time to compute (memory access, for instance), the CPU assumes a condition result and speculatively executes along that path. However, what would happen if, along that path, yet another slow conditional branch pops up (assuming, of course, that the first condition hasn't been resolved yet and the CPU can't just commit the changes)? Does the CPU just speculate inside the speculation? What happens if the last condition is mispredicted but the first wasn't? Does it just rollback all the way?
I'm talking about something like this:
if (value_in_memory == y){
// computations
if (another_val_memory == x){
//computations
}
}
Speculative execution is the regular state of execution, not a special mode that an out of order CPU enters when it sees a branch and then leaves when the branch is no longer in flight.
This is easier to see if you consider that it's not just branches that can fault, but many instructions, including those that access memory, have restrictions on their input values, etc. So any substantial out of order execution implies constant speculation, and CPUs are built around that idea.
So "nested branches" doesn't end up being special in that sense.
Now, modern CPUs have a variety of methods for quick branch misprediction recovery, faster than recovery from other types of faults1. For example they may snapshot the state of the register mapping at some branches, to allow recovery to start before the branch is at the head of the reorder buffer. Since it is not always feasible to snapshot at all branches, there might be complicated heuristics involved to decide where to take snapshots.
I mention this last part because it is one way in which nested branches might matter: when there are lots of branches in flight, you might hit some microarchitectural limits related to the tracking of these branches for recovery purposes. For more details, you can look through patents for "branch order buffer" (for Intel techniques, but there are no doubt others).
1 The basic recovery method is keep executing until the faulting instruction is the next to retire, and then throw away all younger instructions. In the context of branch mispredictions, this means you could actually suffer two or more mispredictions only the oldest of which actually takes effect: e.g., a younger branch mispredicts, and while executing up to that branch (at which point recovery can occur), another mispredict occurs, so the younger one ends up getting discarded.
(Maybe not a complete answer, but I had some of this written when #BeeOnRope posted an answer. Posting this anyway for some more links and technical details in case anyone's curious.)
Everything is always speculative until it reaches retirement and becomes non-speculative, definitely happened, part of the architectural state.
e.g. any load might fault with a bad address, any div might trap on divide by zero. See also Out-of-order execution vs. speculative execution That and What exactly happens when a skylake CPU mispredicts a branch? mention that branch mispredicts are handled specially, because they're expected to be frequent. Fast-recovery can start before a mis-predicted branch reaches retirement, unlike the behaviour for a faulting load for example. (That's part of why Meltdown is exploitable.)
So even "regular" instructions are executed speculatively before being commited, and the only distinction between them is a human-made distinction, not computer-made? I presume, then, that the CPU stores multiple, possible rollback points? For instance if I have load instructions that may lead to page faults or simply use stale values, inside a conditional branch, the CPU identifies such instructions and scenarios and saves a state for each of them? I feel like I misunderstood because this may lead to a lot of storing register states and complicated dependencies.
The retirement state is always consistent so you can always roll back to there and discard all in-flight work, e.g. if an external interrupt arrives you want to handle it without waiting for a chain of a dozen cache miss loads to all execute. When an interrupt occurs, what happens to instructions in the pipeline?
This tracking basically happens for free or is something you need to do anyway to be able to detect which instruction faulted, not just that there was a problem somewhere. (This is called "precise exceptions")
The real distinction humans can usefully make is speculation that has a real chance of being wrong during execution of non-error cases. If your code gets a bad pointer, it doesn't really matter how it performs; it's going to page-fault and that's going to be very slow compared to local OoO exec details.
You're talking about a modern out-of-order (OoO) execution (not just fetch) CPU, like modern Intel or AMD x86, high-end ARM, MIPS r10000, etc.
The front-end is in-order (with speculation down predicted paths), and so is commit (aka retirement) from the out-of-order back-end into non-speculative retirement state. (aka known-good architectural state).
The CPU uses two major structures to track instructions (or on x86, uops = parts of instructions) in the back-end. The last stage of the front-end (after fetch / decode) allocates/renames instructions and adds them into both of these structures at once.
RS = Reservation Station = scheduler: not-yet-executed instructions, waiting for an execution unit. The RS tracks dependencies and sends the oldest-ready uops to execution units that are ready.
ROB = ReOrder Buffer: not-yet-retired instructions. Instructions enter and leave in-order so it can just be a circular buffer.
Includes a flag to mark each entry as executed or not, set once the RS has sent it to an execution unit which reports success. The oldest instructions in the ROB that all have their done-executing bit set can "retire".
Also includes a flag which indicates "fault if this reaches retirement". This avoids spending time handling page faults from load instruction on the wrong path of execution (that might well have pointers into an unmapped page), for example. Either in the shadow of a branch mispredict, or just after another instruction (in program order) that should have faulted first but OoO exec got to it later.
(I'm also leaving out register-renaming onto a large physical register file.
That's the "rename" part. Allocate includes choosing which execution port an instruction will use, and reserving a load or store buffer entry for memory instructions.)
(There's also a store-buffer; stores don't write directly to L1d cache, they write to the store buffer. This makes it possible to speculatively execute stores and still roll back without them becoming visible to other cores. It also decouples cache-miss stores from execution. Once a store instruction retires, the store-buffer entry "graduates" and is eligible to commit to L1d cache, once MESI gets exclusive access to the cache line, and once memory-ordering rules are satisfied.)
Execution units detect whether an instruction should fault, or was mis-speculated and should roll back, but don't necessarily act on that until the instruction reaches retirement.
In-order retirement is the step that recovers program-order after OoO exec, including the case of exceptions of mis-speculation.
Terminology: Intel calls it "issue" when instructions are sent from the front-end into the ROB + RS. Other computer architecture people often call that "dispatch".
Sending uops from the RS to execution units is called "dispatch" by Intel, "issue" by other people.

How far does the branch prediction go?

I understand there is a branch predictor in modern CPU designs trying to guess which branch to go.
Assuming there is a jump instruction that will transfer control flow to either basic block A or basic block B.
If the predictor decides to go to A, when the actual calculation comes to the jump instruction, and finds out B should be the right choice instead of A, at this time, how far does the execution in basic block A go?
Are all the instructions in basic block A are done executed? Or just the first instruction is executed?
How can we find out the actual result and know more about the branch prediction strategies?
The CPU assumes branch prediction was correct and continues unless/until it discovers it wasn't. (HW can't detect "basic blocks": it doesn't know when it reaches an address that some other instruction branches to. And you wouldn't want to stop anyway. Modern branch prediction good enough to be usable in an out-of-order CPU is usually correct like 95 to 99% of the time.)
Discovering a mispredict (or confirming a correct prediction) happens when the branch instruction itself is decoded (unconditional direct branch) or executed (conditional and/or indirect).
In case of a mispredict, the CPU has to re-steer the front-end (fetch/decode) to the correct path. On an in-order CPU, no instructions after a branch can execute until the branch itself executes, so it's always just a matter of restarting fetch/decode.
(In-order superscalar could actually execute an instruction after a branch, but an in-order pipeline makes it relatively easy to squash before it reaches write-back and actually changes architectural state. A store is probably the trickiest because you need to discard that store-buffer entry; its visible effect would be on memory, not write-back to registers. But anyway, that along with decoupling execution from cache-miss stores, and other reasons, is why even in-order pipelined CPUs have store buffers.)
Or for an out-of-order CPU with speculative execution that allows instruction from the wrong path to actually execute while a conditional or indirect branch is waiting for its input, it has to flush the back-end and restart issue from the correct path of execution.
(With fast-recovery and a branch-order-buffer, this can happen even if some of the instructions before the branch haven't finished executing yet. e.g. in a loop with a simple loop condition and a longer dependency chain in the loop body, so execution of the loop-condition dependency chain can run ahead and discover a mispredict in the last iteration when the loop branch falls through, without waiting until that instruction is ready to retire. i.e. without waiting for the loop body to execute that far.)
Multiple branches can be in flight at once. And of course any load or store can fault. An OoO exec CPU basically treats everything as speculative until retirement.
How can we ... know more about the branch prediction strategies?
https://danluu.com/branch-prediction/ is pretty good. See also the first chapter of Agner Fog's microarch guide (for x86) where he covers what real Intel and AMD CPUs do, as well as some background. https://agner.org/optimize/

Why do longer pipelines make a single delay slot insufficient?

I read the following statement in Patterson & Hennessy's Computer Organization and Design textbook:
As processors go to both longer pipelines and issuing multiple instructions per clock cycle, the branch delay becomes longer, and a single delay slot is insufficient.
I can understand why "issuing multiple instructions per clock cycle" can make a single delay slot insufficient, but I don't know why "longer pipelines" cause it.
Also, I do not understand why longer pipelines cause the branch delay to become longer. Even with longer pipelines (step to finish one instruction), there's no guarantee that the cycle will increase, so why will the branch delay increase?
If you add any stages before the stage that detects branches (and evaluates taken/not-taken for conditional branches), 1 delay slot no longer hides the "latency" between the branch entering the first stage of the pipeline and the correct program-counter address after the branch being known.
The first fetch stage needs info from later in the pipeline to know what to fetch next, because it doesn't itself detect branches. For example, in superscalar CPUs with branch prediction, they need to predict which block of instructions to fetch next, separately and earlier from predicting which way a branch goes after it's already decoded.
1 delay slot is only sufficient in MIPS I because branch conditions are evaluated in the first half of a clock cycle in EX, in time to forward to the 2nd half of IF which doesn't need a fetch address until then. (Original MIPS is a classic 5-stage RISC: IF ID EX MEM WB.) See Wikipedia's article on the classic RISC pipeline for much more details, specifically the control hazards section.
That's why MIPS is limited to simple conditions like beq (find any mismatches from an XOR), or bltz (sign bit check). It cannot do anything that requires an adder for carry propagation (so a general blt between two registers is only a pseudo-instruction).
This is very restrictive: a longer front-end can absorb the latency from a larger/more associative L1 instruction cache that takes more than half a cycle to respond on a hit. (MIPS I decode is very simple, though, with the instruction format intentionally designed so machine-code bits can be wired directly as internal control signals. So you can maybe make decode the "half cycle" stage, with fetch getting 1 full cycle, but even 1 cycle is still low with shorter cycle times at higher clock speeds.)
Raising the clock speed might require adding another fetch stage. Decode does have to detecting data hazards and set up bypass forwarding; original MIPS kept that simpler by not detecting load-use hazards, instead software had to respect a load-delay slot until MIPS II. A superscalar CPU has many more possible hazards, even with 1-cycle ALU latency, so detecting what has to forward to what requires more complex logic for matching destination registers in old instructions against sources in younger instructions.
A superscalar pipeline might even want some buffering in instruction fetch to avoid bubbles. A multi-ported register file might be slightly slower to read, maybe requiring an extra decode pipeline stage, although probably that can still be done in 1 cycle.
So, as well as making 1 branch delay slot insufficient by the very nature of superscalar execution, a longer pipeline also increases branch latency, if the extra stages are between fetch and branch resolution. e.g. an extra fetch stage and a 2-wide pipeline could have 4 instructions in flight after a branch instead of 1.
But instead of introducing more branch delay slots to hide this branch delay, the actual solution is branch prediction. (However some DSPs or high performance microcontrollers do have 2 or even 3 branch delay slots.)
Branch-delay slots complicate exception handling; you need a fault-return and a next-after-that address, in case the fault was in a delay slot of a taken branch.

Jobs in the queue(pub-sub) distributed systems with dependencies?

How to approach a problem when there are jobs put in the queue(pub-sub) distributed systems, and they have a dependency between them.
For e.g. current state of the queue:
j3 -> j2 -> j1
rear front
j3 depends on the completion of j1.
The queue processor is consuming these jobs and started processing it in a distributed environment.
Based on some dependency resolution mechanism, dependency between j1 and j3 was found out.
Now, what I don't know is, the best way to deal with situation:
should I put j3 back in the queue, and again pick it up at the
later stage so that j1 would have completed by that time?
should I have some other mechanism - database to check if all the
j3 dependencies have met and then process j3?
Any help would be appreciated.
Thanks!
Having a job scheduler that's aware that these jobs are at the front of the queue, but are waiting on some dependencies, is the best way. That way, you can get other jobs done while waiting for the dependencies to finish, but still process them as much in order as possible.
Pushing items back onto the start of the queue is a good workaround, if it's relatively cheap to do so, if the queue length is relatively short and if there are quite few dependencies. If the item you push to the back is also a dependency of other tasks, they too need to be pushed to the back of the queue when they arrive at the front (or at once, but that's unnecessarily hard). If the queue length is long, you could see unexpected delays. For example, if the queue is a day long, you could end up waiting days for a task to finish. If that task is part of a chain of dependencies, the problem grows.
Either way, you're going to need to know if a task is queued/running/finished. You could store this information in your favourite database or use some gossip protocol or whatever you like. If it's not a correctness problem if the same job is executed twice, you can use an AP system (in the CAP sense, with eventual consistency, such as a gossip protocol). If running the same task twice is going to mess things up badly, you'll need some consensus mechanism, like a single source of truth, such as your favourite sql database or maybe couchbase.