How does operating system knows execution time of process - operating-system

I was revisiting Operating Systems CPU job scheduling and suddenly a question popped in my mind, How the hell the OS knows the execution time of process before its execution, I mean in the scheduling algorithms like SJF(shortest job first), how the execution time of process is calculated apriori ?

From Wikipedia:
Another disadvantage of using shortest job next is that the total execution time of a job must be known before execution. While it is not possible to perfectly predict execution time, several methods can be used to estimate the execution time for a job, such as a weighted average of previous execution times.[1]
More on http://en.wikipedia.org/wiki/Shortest_job_next

Also, O.S can compute the total needed time for each task, by means of first calculating its CPI.
(CPI: cycles per instruction)
There is a weighted average CPI for each job.
For example, floating point instructions weigh much more than fixed point instructions, meaning they take more time to perform. So a job dealing with fixed point operations: like add or increment is perceived to be shorter. Hence in a shortest job first, it shall be executed prior to the aforementioned job.

Related

Shortest Process Next Scheduling Algorithm

I am wondering which answer is correct?For answer 1, when P5 is finish executing, then we compared about P3,P6 and P4 ,if we compare them according to the arrival time then P3 will execute first. So,my question is about do we need to follow the arrival time? Which answer is correct? Thanks.
This is the question image
The first answer is correct.
A case could be made that both are correct, but the first is "more correct." The tiebreaker for this scheduling algorithm should be the arrival time. Your goal is to minimize the process wait time, and it makes more sense to first run the process that has been waiting longer.
Algorithms like this will often use a priority queue, which would sort the elements from shortest burst time to longest burst time. Queues use FIFO (first in, first out), which means if there are two elements with the same burst time, the one that was added first would be selected first.

What determines CPU cycle time

I'd like to find out whether there is a relationship between CPU cycle time and pipeline depth. I have always thought that CPU cycle time is entirely determined by CPU frequency (opposite to frequency). This video, however, mentions that with a larger number of pipeline stages the cycle time can be decreased since every cycle we would do less work per stage. So what actually determines the CPU cycle time: the frequency or the number of stages in the pipeline? Or can we say that pipeline depth affects the frequency?
cycle time is literally defined as the inverse of frequency. This is just basic physics: f = 1/t where t is the period. https://en.wikipedia.org/wiki/Frequency#Period_versus_frequency. Frequency has dimensions of 1/seconds.
Saying you can shorten the cycle time by lengthening the pipeline is just another way of stating the same thing as raising the frequency.
(And yes, chopping up one stage into two means you have two shorter critical paths instead of one long one that has to be ready before the end of a cycle to get latched for the next stage, removing that upper limit on cycle time. For a given gate delay propagation time, you can only fit a certain number of boolean operations into one clock cycle, and every stage has to have its output ready in time.)
See also Modern Microprocessors
A 90-Minute Guide!

SJF Scheduling: Selecting Process based on predicted CPU Burst Time

In SJF Algorithm, we predict the next CPU Burst time using the formula:
τ(n+1) = α*t(n) + (1-α)*τ(n). And then we select the process with the shortest predicted burst time.
Now my question is: do we already have an idea about the CPU burst times of the processes arriving?
If yes, then why predict the CPU burst time? We could rather just use the shortest time process for scheduling.
If no i.e., we do not have any idea about the burst times of the processes, how is the predicted burst time τ(n+1) helping us to pick a process?
Hope I am able to explain my confusion.
Thanks.
The answer is in the question itself. The later condition is true we don't have an idea about burst time of incoming processes this is the reason that we are predicting their burst time τ(n+1). Our prediction may not be 100% right all the time but it'll server the purpose of SJF to a very great extent !
I hope you would've coded this and saw the results if not then i recommend to do so, it'll help a lot understanding this.
This is the application I developed, for my teacher, on some scheduling techniques.enter image description here

CPU clock cycle misunderstanding

I can't well understand about CPU clock such as 3.4Ghz. I know this is that 3.4 billions clock cycle per second.
So here if machine use single clock cycle instruction, then It can execute about 3.4 billions instructions per second.
But in pipeline, basically it needs more cycles per instruction, but each cycle length is shorter than single clock cycle.
But although pipeline has more throughput, anyway cpu can do 3.4 billions cycle per second. So, it can execute 3.4 billions/5 instructions(if one instruction needs 5 cycles), which means less than single cycle implementation(3.4 > 3.4/5). What am I missing?
Does CPU clock such as 3.4Ghz just means for based on pipeline cycle, not for based on single cycle implentation?
Pipelining
Pipelining doesn't involve cycles shorter than a single clock cycle. Here's how pipelining works:
We have a complicated task to do. We take that task and break it down into a number of stages, each of which is relatively simple to carry out. We study the amount of work in each stage to make sure each stage takes about the same amount of time as any other.
With a processor, we do roughly the same thing--but in this case, it's not "install these fourteen bolts", it's things like fetching and decoding instructions, reading operands, executing (often a couple of stages here), and writing back results.
Like the automotive production line, we provide each stage of the pipeline with a specialized set of tools for doing exactly (and only) what is needed at that stage. When we finish doing one stage of processing on a car/instruction, it moves along to the next stage, and this stage gets the next car/instruction to process.
In an ideal situation, the process works (roughly) like this:
It took Ford about 12 hours to build one N car (the predecessor to the model T). Thanks primarily to pipelining the production line, it took only about 2 and a half hours to build a Model T. More importantly, even though a model T took 2.5 hours start to finish, that time was broken down into no fewer than 84 discrete steps, so when everything ran smoothly the production line as a whole could produce another car (about) every two minutes.
That didn't always happen though. If one stage ran short of parts, the stages after it had to wait. If the pause lasted very long, it would back things up so the preceding stages had to wait too.
The same can happen in a processor pipeline. For example, when a branch happens, the processor may have to wait a while before the next instruction can be fetched. If an instruction needs an operand from memory, that can lead to a pause (a "pipeline bubble") as well.
To prevent pauses in his pipeline, Henry Ford hired people to study the stages, figure out how many of each kind of part would need to be on hand for each stage, and so on. I don't know for sure, but I think it's a fair guess that there were probably a few people designated to watch the supply of parts at different stations, and send somebody running to let a warehouse manager know if (for whatever reason) the supply of parts for a particular stage looked like it was running short so they'd need more soon.
Processors do a little of the same thing--they have things like branch predictors and prefetchers that attempt to figure out ahead of time what will be needed by the stream of instructions being executed, and trying to ensure that everything is on hand when its needed (with caches, for example, to temporarily store things that seem likely to be needed soon).
So, like the Model T, it takes some relatively long amount of time for each instruction to execute start to finish, but we get another product finished at much shorter intervals--ideally once a clock (but see my other answer--modern designs often execute more than one instruction per clock).
A typical modern CPU can execute a number of unrelated instructions (those that don't depend on the same resources) concurrently.
To do that, it typically ends up with a basic structure vaguely like this:
So, we have an instruction stream coming in on the left. We have three decoders, each of which can decode one instruction each clock cycle (but there may be limitations, so complex instructions all have to pass through one decoder, and the other two decoders can only do simple instructions).
From there, the instructions pass into a reorder buffer, which keeps a "scoreboard" of which resources are used by each instruction, and which resources are affected that instruction (where a "resource" would typically be something like a CPU register or a flag in the flags register).
The circuitry then compares those scoreboards to determine dependencies. For example, if one instruction writes to register 0, and a later one reads from register 0, then those instructions must execute serially. At each clock, it tries to find the N oldest instructions that don't have dependencies for execution.
There are then a number of independent execution units. Each of these is basically a "pure" function--it takes some inputs, carries out a specified transformation on it, and produces an output. This makes it easy to replicate them as needed, and have as many running in parallel as we want/can afford. Those are typically grouped, with one port going to each group. In each clock, we can send one instruction through that port to one of the execution units in that group. Once an instruction arrives at the execution unit, it may take more than one clock to finish execution.
Once those execute, we have a set of retirement units that take the results, and write them back to the registers in execution order. Again we have multiple units so we can retire multiple instructions per clock.
Note: this drawing tries to be semi-realistic about the rough number of decoders, retirement units, and ports that it depicts, but what it shows is a general idea--different CPUs will have more or fewer specific resources. For almost any of them, the number of decoded instructions in the scoreboard units is low though--a realistic number would be more like 50 instructions.
In any case, actual execution of instructions is one of the hardest parts of this to measure or reason about. The number of ports gives us a hard upper limit on the number of instructions that can start executing in any given clock. The number of decoders and retirement units give an upper limit on the number of instructions that can be started/finished per clock. The execution itself...well, there are a lot of execution units, and each one (at least potentially) takes a different number of clocks to execute an instruction.
With the design as shown above, you'd have a hard upper limit of three instructions per clock. That's the most you can decode or retire. With a different design, that could obviously go up or down (e.g., with 4 decoders, 4 ports and 4 retirement units, the upper limit could go up to 4).
Realistically, with that design you wouldn't normally expect to see three instructions execute in most clock cycles. There are enough dependencies between instructions that you'd probably expect closer to 2 as a long term average (and much more likely a little less than 2). Increasing the available resources (more decoders, more retirement units, etc.) will rarely help that a whole lot--you might get to an average of three instructions per clock, but hoping for four is probably unrealistic.
As others have noted the full details of how a modern CPU operates are complicated. But part of your question has a simple answer:
Does CPU clock such as 3.4Ghz just means for based on pipeline cycle,
not for based on single cycle implentation?
The clock frequency of a CPU refers to how many times per second the clock signal switches. The clock signal is not divided into smaller pipelined segments. The purpose of pipelining is to allow for faster clock switching speeds. So 3.4GHz refers to the number of times per second that a single pipeline stage can perform whatever work it needs to do when executing an instruction. The total work for executing an instruction is done over multiple cycles each of which could be in a different pipeline stage.
Your question also shows a some misconceptions about how pipelining works:
But although pipeline has more throughput, anyway cpu can do 3.4
billions cycle per second. So, it can execute 3.4 billions/5
instructions(if one instruction needs 5 cycles), which means less than
single cycle implementation(3.4 > 3.4/5). What am I missing?
In the simple case the throughput of a single cycle CPU and a pipelined CPU is the same. The latency of the pipelined CPU is higher because it requires more cycles (i.e. 5 in your example) to execute a single instruction. But after the pipeline is full the throughput could be the same as for a single cycle non-pipelined CPU. So in the simple case using your example a single-cycle CPU could execute 3.4 billion instructions in 1 seconds, while the pipelined CPU with 5 stages could execute 3.4 billion minus 5 instructions in 1 second. Subtracting 5 from 3.4 billion is a negligible difference, whereas dividing by 5 would be a very significant difference.
A couple of other things to note are that the simple case I described isn't really true because of dependencies between instructions that require pipeline stalls. And most modern CPUs can execute more than one instructions per cycle.

Why does Matlab run faster after a script is "warmed up"?

I have noticed that the first time I run a script, it takes considerably more time than the second and third time1. The "warm-up" is mentioned in this question without an explanation.
Why does the code run faster after it is "warmed up"?
I don't clear all between calls2, but the input parameters change for every function call. Does anyone know why this is?
1. I have my license locally, so it's not a problem related to license checking.
2. Actually, the behavior doesn't change if I clear all.
One reason why it would run faster after the first time is that many things are initialized once, and their results are cached and reused the next time. For example in the M-side, variables can be defined as persistent in functions that can be locked. This can also occur on the MEX-side of things.
In addition many dependencies are loaded after the first time and remain so in memory to be re-used. This include M-functions, OOP classes, Java classes, MEX-functions, and so on. This applies to both builtin and user-defined ones.
For example issue the following command before and after running your script for the first run, then compare:
[M,X,C] = inmem('-completenames')
Note that clear all does not necessarily clear all of the above, not to mention locked functions...
Finally let us not forget the role of the accelerator. Instead of interpreting the M-code every time a function is invoked, it gets compiled into machine code instructions during runtime. JIT compilation occurs only for the first invocation, so ideally the efficiency of running object code the following times will overcome the overhead of re-interpreting the program every time it runs.
Matlab is interpreted. If you don't warm up the code, you will be losing a lot of time due to interpretation instead of the actual algorithm. This can skew results of timings significantly.
Running the code at least once will enable Matlab to actually compile appropriate code segments.
Besides Matlab-specific reasons like JIT-compilation, modern CPUs have large caches, branch predictors, and so on. Warming these up is an issue for benchmarking even in assembly language.
Also, more importantly, modern CPUs usually idle at low clock speed, and only jump to full speed after several milliseconds of sustained load.
Intel's Turbo feature gets even more funky: when power and thermal limits allow, the CPU can run faster than its sustainable max frequency. So the first ~20 seconds to 1 minute of your benchmark may run faster than the rest of it, if you aren't careful to control for these factors.
Another issue not mensioned by Amro and Marc is memory (pre)allocation.
If your script does not pre-allocate its memory it's first run would be very slow due to memory allocation. Once it completed its first iteration all memory is allocated, so consecutive invokations of the script would be more efficient.
An illustrative example
for ii = 1:1000
vec(ii) = ii; %// vec grows inside loop the first time this code is executed only
end