What is the speedup? Can't understand the solution - cpu-architecture

I'm going through a Computer Architecture MOOC on my time. There is a problem I can't solve. The solution is provided but I can't understand the solution. Can someone help me out. Here is the problem and the solution to it:
Consider an unpipelined processor. Assume that it has 1-ns clock cycle
and that it uses 4 cycles for ALU operations and 5 cycles for branches
and 4 cycles for memory operations. Assume that the relative
frequencies of these operations are 50 %, 35 % and 15 % respectively.
Suppose that due to clock skew and set up, pipelining the processor
adds 0.15 ns of overhead to the clock. Ignoring any latency impact,
how much speed up in the instruction execution rate will we gain from
a pipeline?
Solution
The average instruction execution time on an unpipelined processor is
clockcycle * Avg:CP I = 1ns * ((0.5 * 4) + (0.35 * 5) + (0.15 * 4)) =
4.35ns The avg. instruction execution time on pipelined processor is = 1ns + 0.15ns = 1.15ns So speed up = 4.35 / 1.15 = 3.78
My question:
Where is 0.15 coming from in the average instruction execution time on a pipelines processor? Can anyone explain.
Any help is really appreciated.

As the question says those 0.15ns are due to clock skew and pipeline setup.
Forget about pipeline setup and imagine that all of the 0.15ns are from clock skew.

I think the solution implies the CPI (Cycle Per Instruction) is one (1) (w/o the overhead), i.e., 1-ns clock cycle which I'm assuming it's the CPU running clock (1 GHz).
However, I'm not seeing anywhere the CPI is clearly identified as one (1).
Did I misunderstand anything here?

Related

How to calculate cache miss rate

I'm trying to answer computer architecture past paper question (NOT a Homework).
My question is how to calculate the miss rate.(complete question ask to calculate the average memory access time) The complete question is,
For a given application, 30% of the instructions require memory access. Miss rate is 3%. An instruction can be executed in 1 clock cycle. L1 cache access time is approximately 3 clock cycles while L1 miss penalty is 72 clock cycles. Calculate the average memory access time.
Needed equations,
Average memory access time = Hit time + Miss rate x Miss penalty
Miss rate = no. of misses / total no. of accesses (This was found from stackoverflow)
As I mentioned above I found how to calculate miss rate from stackoverflow ( I checked that question but it does not answer my question) but the problem is I cannot imagine how to find Miss rate from given values in the question.
What I have done up to now
Average memory access time = 30% * (1 + 3% * 72) + 100% * (1 + M*72)
M - miss rate
what I need to find is M. (If I am correct up to now if not please tell me what I've messed up)

Operating system- Time Slice

Suppose a multiprogramming operating system allocated time slices of 10 milliseconds and the machine executed an average of five instructions per nanosecond.
How many instructions could be executed in a single time slice?
please help me, how to do this.
This sort of question is about cancelling units out after finding the respective ratios.
There are 1,000,000 nanoseconds (ns) per 1 millisecond (ms) so we can write the ratio as (1,000,000ns / 1ms).
There are 5 instructions (ins) per 1 nanosecond (ns) so we can write the ratio as (5ins / 1ns).
We know that the program runs for 10ms.
Then we can write the equation like so such that the units cancel out:
instructions = (10ms) * (1,000,000ns/1ms) * (5ins/1ns)
instructions = (10 * 1,000,000ns)/1 * (5ins/1ns) -- milliseconds cancel
instructions = (10,000,000 * 5ins)/1 -- nanoseconds cancel
instructions = 50,000,000ins -- left with instructions
We can reason that it is at least the 'right kind' of ratio setup - even if the ratios are wrong or whatnot - because units we are left with instructions, which is matches the type of unit expected in the answer.
In the above I started with the 1,000,000ns/1ms ratio, but it could also have done 1,000,000,000ns/1,000ms (= 1 second / 1 second) and ended with the same result.

Average memory access time

I would like to know did I solve the equation correctly below
find the average memory access time for process with a process with a 3ns clock cycle time, a miss penalty of 40 clock cycle, a miss rate of .08 misses per instruction, and a cache access time of 1 clock cycle
AMAT = Hit Time + Miss Rate * Miss Penalty
Hit Time = 3ns, Miss Penalty = 40ns, Miss Rate = 0.08
AMAT = 3 + 0.08 * 40 = 6.2ns
Check the "Miss Penalty". Be more careful to avoid trivial mistakes.
The question that you tried to answer cannot actually be answered, since you are given 0.08 misses per instruction but you don't know the average number of memory accesses per instruction. In an extreme case, if only 8 percent of instructions accessed memory, then every memory access would be a miss.

To find execution time on a mult-icore machine

I'am preparing for a competitive exam and i have an operating system question.
I'am not getting how to solve it. please help me out.
Q-)
A program took 160 seconds to execute on a single processor but only 64 seconds on a
4 core multicore. What is the best estimate for the execution time on a 64 core machine?
I don't think this is strictly relevant to programming (you might find this more relevant on the Math StackExchange but I'll attempt to answer it anyway.
The answer will depend entirely on how you model execution time vs number of cores. You could model the execution time as inversely proportional to the number of cores. For example, I used the following model:
Where t is time in seconds and n is number of cores, c (could represent overhead) and k (a factor) are constants.
Solve simultaneously
to get k = 128 and c = 32.
Then just substitute n = 64
So, you get 34 seconds according to this model. Of course, since you don't know the exact model, this can only be a calculated guess.

Amdahl's law example

Can someone help me with this example please and show me how to work the second part?
the question is :
If one third of a weather prediction algorithm is inherently serial and the remainder
parallelizable, what is the minimum number of cores needed to guarantee a 150% speedup over a
single core implementation?
ii. Your boss revises the figure to 200%. What is your new answer?
Thanks very much in advance !!
Guess: If the algorithm is 1/3 serial and 2/3 parallel...I would think that each core you added would give you a 66% increase in performance...So for 150% increase, you'd need 3 more cores, and for a 200% increase, you'd need 4.
This is a guess. Your textbook might be more helpful :)
If the algorithm runs on a single core and takes 90 minutes then 30 minutes is for the serial part and 60 minutes for the parallel part.
Add a CPU:
30 is for the serial part and 30 for the parallel part(half of the 60 overlaps with the serial part).
90 / 60 = 150% increase.
I am a bit late, but here are the answers:
1) 150% increase -> 2 cores at least required as dbasnett said;
2) 200% increase -> 4 cores at least required basing on the Amahld's law:
Here, 90 minutes overall required to perform the calculation. P is the actually enhanced part of the algorithm (the parallelizable part) which is 2/3 of 90, N is the number of cores, so when there's a core only:
You get 1, which means 100%, which is how the algorithm performs the standard way (without multi-core acceleration and therefore no parallelization speedup).
Now, we must find N number of cores for which the previous equation equals 2, where 2 means that the algorithm performs in half time (45 minutes instead of 90 when there's no parallelization) and therefore with a 200% speedup:
Since:
We see that:
So with 4 cores computing in parallel the 2/3 of the algoritm you get 200% speedup. The same goes for 150%, you will get 2, as dbasnett already told you.
Pretty simple.
Note that a complex algorithm may imply further divisions of its parallelizable parts (and in theory you can have a different number of processing units per parallelizable part concurrently):
You can further look at Wikipedia (there's also an example):
http://en.wikipedia.org/wiki/Amdahl%27s_law#Description
Anyway, the principle is the same:
Let T be the time an algorithm needs to execute in order to complete, A be the serial part of it, B its parallelizable part and N the number of parallel CPUs, you can divide B in further small sections and perform calculations on each part:
You may for C, D, G e.g. adopt M CPUs instead of N (the speedup will of course differ if M != N).
And at the end, you will arrive at a point when having more CPUs doesn't matter anymore, since:
And your algorithm speedup will at most tend to total execution time (T) divided by the execution time of the Serial part only (A).
Therefore parallel calculation comes really handy only when you have low execution time for the serial part of your algorithm.