Instructions per second equation - cpu-architecture

This should be a simple question.
Say we have a 3.0 gHz processor with a CPI of 1.5 How many instructions per second does it execute? Just thinking logically, it would be the number of cycles per second times the number of instructions per cycle...which is...
3×109 cycles/second × 1.5 instructions/cycle = 4.5×109 instructions/second
Makes sense. Okay, so this is a question from my book and I look up the solutions just to make sure I understand and got it right. Well the solution says that it's:
3×109/1.5 = 2×109 instructions/sec
This answer comes from the clock rate/CPI part, but I am really failing to grasp how...if you sub in clock rate/cpi like this:
(clock cycles/sec)/(instructions/clock cycle), it's basically the opposite of the original equation because you divide cycles by instructions instead of multiplying them...and the units don't even cancel out, you end up with a unit of cycles2/instructions×seconds. I have to be missing something totally obvious here/botching basic math, but my pea brain is not getting it.

Some relatively basic maths here:
Instructions
IPS = ------------
Second
You can multiply something by 1 without changing the result, and since X / X = 1, we can do the following:
Instructions Instructions Clock Cycles
IPS = ------------ x 1 = ------------ x ------------
Seconds Seconds Clock Cycles
You can then rearrange the fractions as follows:
Instructions Clock Cycles
IPS = ------------ x ------------
Clock Cycles Seconds
This gives you the middle part of the provided formula.
Then, given:
Clock Cycles Clock Cycles
CPI = ------------ and Clock Rate = ------------
Instructions Seconds
And since 1 / (A/B) = B/A:
1 Instructions
--- = ------------
CPI Clock Cycles
Therefore:
1 Clock Rate
IPS = --- x Clock Rate = ----------
CPI CPI

Related

Operating system- Time Slice

Suppose a multiprogramming operating system allocated time slices of 10 milliseconds and the machine executed an average of five instructions per nanosecond.
How many instructions could be executed in a single time slice?
please help me, how to do this.
This sort of question is about cancelling units out after finding the respective ratios.
There are 1,000,000 nanoseconds (ns) per 1 millisecond (ms) so we can write the ratio as (1,000,000ns / 1ms).
There are 5 instructions (ins) per 1 nanosecond (ns) so we can write the ratio as (5ins / 1ns).
We know that the program runs for 10ms.
Then we can write the equation like so such that the units cancel out:
instructions = (10ms) * (1,000,000ns/1ms) * (5ins/1ns)
instructions = (10 * 1,000,000ns)/1 * (5ins/1ns) -- milliseconds cancel
instructions = (10,000,000 * 5ins)/1 -- nanoseconds cancel
instructions = 50,000,000ins -- left with instructions
We can reason that it is at least the 'right kind' of ratio setup - even if the ratios are wrong or whatnot - because units we are left with instructions, which is matches the type of unit expected in the answer.
In the above I started with the 1,000,000ns/1ms ratio, but it could also have done 1,000,000,000ns/1,000ms (= 1 second / 1 second) and ended with the same result.

What is the speedup? Can't understand the solution

I'm going through a Computer Architecture MOOC on my time. There is a problem I can't solve. The solution is provided but I can't understand the solution. Can someone help me out. Here is the problem and the solution to it:
Consider an unpipelined processor. Assume that it has 1-ns clock cycle
and that it uses 4 cycles for ALU operations and 5 cycles for branches
and 4 cycles for memory operations. Assume that the relative
frequencies of these operations are 50 %, 35 % and 15 % respectively.
Suppose that due to clock skew and set up, pipelining the processor
adds 0.15 ns of overhead to the clock. Ignoring any latency impact,
how much speed up in the instruction execution rate will we gain from
a pipeline?
Solution
The average instruction execution time on an unpipelined processor is
clockcycle * Avg:CP I = 1ns * ((0.5 * 4) + (0.35 * 5) + (0.15 * 4)) =
4.35ns The avg. instruction execution time on pipelined processor is = 1ns + 0.15ns = 1.15ns So speed up = 4.35 / 1.15 = 3.78
My question:
Where is 0.15 coming from in the average instruction execution time on a pipelines processor? Can anyone explain.
Any help is really appreciated.
As the question says those 0.15ns are due to clock skew and pipeline setup.
Forget about pipeline setup and imagine that all of the 0.15ns are from clock skew.
I think the solution implies the CPI (Cycle Per Instruction) is one (1) (w/o the overhead), i.e., 1-ns clock cycle which I'm assuming it's the CPU running clock (1 GHz).
However, I'm not seeing anywhere the CPI is clearly identified as one (1).
Did I misunderstand anything here?

Interrupt time in DMA operation

I'm facing difficulty with the following question :
Consider a disk drive with the following specifications .
16 surfaces, 512 tracks/surface, 512 sectors/track, 1 KB/sector, rotation speed 3000 rpm. The disk is operated in cycle stealing mode whereby whenever 1 byte word is ready it is sent to memory; similarly for writing, the disk interface reads a 4 byte word from the memory in each DMA cycle. Memory Cycle time is 40 ns. The maximum percentage of time that the CPU gets blocked during DMA operation is?
the solution to this question provided on the only site is :
Revolutions Per Min = 3000 RPM
or 3000/60 = 50 RPS
In 1 Round it can read = 512 KB
No. of tracks read per second = (2^19/2^2)*50
= 6553600 ............. (1)
Interrupt = 6553600 takes 0.2621 sec
Percentage Gain = (0.2621/1)*100
= 26 %
I have understood till (1).
Can anybody explain me how has 0.2621 come ? How is the interrupt time calculated? Please help .
Reversing form the numbers you've given, that's 6553600 * 40ns that gives 0.2621 sec.
One quite obvious problem is that the comments in the calculations are somewhat wrong. It's not
Revolutions Per Min = 3000 RPM ~ or 3000/60 = 50 RPS
In 1 Round it can read = 512 KB
No. of tracks read per second = (2^19/2^2)*50 <- WRONG
The numbers are 512K / 4 * 50. So, it's in bytes. How that could be called 'number of tracks'? Reading the full track is 1 full rotation, so the number of tracks readable in 1 second is 50, as there are 50 RPS.
However, the total bytes readable in 1s is then just 512K * 50 since 512K is the amount of data on the track.
But then it is further divided by 4..
So, I guess, the actual comments should be:
Revolutions Per Min = 3000 RPM ~ or 3000/60 = 50 RPS
In 1 Round it can read = 512 KB
Interrupts per second = (2^19/2^2) * 50 = 6553600 (*)
Interrupt triggers one memory op, so then:
total wasted: 6553600 * 40ns = 0.2621 sec.
However, I don't really like how the 'number of interrupts per second' is calculated. I currently don't see/fell/guess how/why it's just Bytes/4.
The only VAGUE explanation of that "divide it by 4" I can think of is:
At each byte written to the controller's memory, an event is triggered. However the DMA controller can read only PACKETS of 4 bytes. So, the hardware DMA controller must WAIT until there are at least 4 bytes ready to be read. Only then the DMA kicks in and halts the bus (or part of) for a duration of one memory cycle needed to copy the data. As bus is frozen, the processor MAY have to wait. It doesn't NEED to, it can be doing its own ops and work on cache, but if it tries touching the memory, it will need to wait until DMA finishes.
However, I don't like a few things in this "explanation". I cannot guarantee you that it is valid. It really depends on what architecture you are analyzing and how the DMA/CPU/BUS are organized.
The only mistake is its not
no. of tracks read
Its actually no. of interrupts occured (no. of times DMA came up with its data, these many times CPU will be blocked)
But again I don't know why 50 has been multiplied,probably because of 1 second, but I wish to solve this without multiplying by 50
My Solution:-
Here, in 1 rotation interface can read 512 KB data. 1 rotation time = 0.02 sec. So, one byte data preparation time = 39.1 nsec ----> for 4B it takes 156.4 nsec. Memory Cycle time = 40ns. So, the % of time the CPU get blocked = 40/(40+156.4) = 0.2036 ~= 20 %. But in the answer booklet options are given as A) 10 B)25 C)40 D)50. Tell me if I'm doing wrong ?

Increasing/Altering Matlab-Arduino analogRead() sampling rate

I have been controlling Arduino from Matlab using ArduinoIO-Matlab interface. My current setup is I have 3 EMG Muscle Sensors (from Advancer Technologies) are connected to the Arduino at analog pin 1,2, and 3. Arduino is connected to Matlab. I am trying to collect data from these three pins simultaneously and store them in an matrix size 1000x3. My issue is the rate at which Matlab is sampling from the analog pin. It takes about 25 seconds to collect 1000 readings from the 3 pins simultaneously. I know arduino itself samples at a higher rate. Below is my code. How do I alter this to get a sampling rate of about like 1000 samples in 10 seconds ?
ar = arduino('COM3');
ax = zeros(1000,3);
for ai = 1:1000
ax(ai,:) = [ar.analogRead(1) ar.analogRead(2) ar.analogRead(3)];
end
delete(ar);
This is the time taken by the above code (profile viewer):
time calls line
< 0.01 1 3 ax = zeros(1000,3);
4
< 0.01 1 5 for ai = 1:1000
25.07 1000 6 ax(ai,:) = [ar.analogRead(1) ar.analogRead(2) ar.analogRead(3)];
1000 7 end
8
1.24 1 9 delete(ar);
Please let me know if there is something else that I need to clarify.
Thanks :Denter code here
You need to modify the arduino c++ code (.pde file).
In this code you should sample the signal as you prefer (1000 for example) and then transfer the sampled data to matlab using serial.writeln() method.
This will give you a sampling rate of ~3KHz (depending on alot of factors)...
The following very probably explains the result that you are seeing and why you need to do something like what Muhammad's answer suggests. While this reason was implied by his answer it was not spelt out so that others can avoid the 'trap'.
I do not have access to the underlying code and systems needed to check this answer with certainty. This answer is based on "typical methods" and has a modest chance of being sheer poppycock [tm], but the exact fit between observation and standard methods suggests this is what is happening. A very little delving by someone with the requisite system to hand will demonstrate if this is correct.
When data is sent one data sample at a time you incur a per-sample overhead significantly in excess of the time taken to just transfer the raw data.
You say it takes 25 seconds to transfer 3000 samples.
The time per sample = 25/3000 = 8.333 ms per sample.
Assume a 9600 baud data transfer rate.
The default communications speed is liable to but 9600 baud. This can be checked but the result suggests that this may be correct and making slightly different assumptions provides an equally good explanation.
Serial coms usually uses N81 format = 1 start bit, 8 data bits, 1 stop bit per 8 bit byte.
So 1 bit takes 1/9600 s
and 10 bits take 10/9600 = 1.042 mS
And sample time / byte time
= 8.333 / 1.042 = 7.997 word times.
In fact if you do the calculations without rounding or truncation, ie
25 / 3000 x 9600/10 = 8.000.... .
ie your transfer is taking EXACTLY 8 x 9600 baud word times per sample.
Equally, this is exactly 4 x 4800 baud or 2 x 2400 baud transfer times.
I have not examined the format used but imagine that to work with the PC monitor program the basic serial routine may use
2 x data bytes + CR + LF = 4 bytes.
That assumes a 16 bit variable sent as 2 x 8 bit binary words.
More likely = either
- 16 bits sent as 4 x ASCII characters or
- 24 bits sent as 6 x ASCII characters.
In the absence of suitably deep delving, the use of 6 ASCII words and a CR + LF at 9600 baud provides such a good fit using typical parameters that Occam probably opines that this is the best starting point. Regardless of whether the total requirement is 8 or 4 or 2 bytes, the somewhat serendipitous exact match between your observed data rate and standard baud rates suggests that this provides the basic reason for what you see.
Looking at the code will rapidly show what baud rate, data length and packing is used.

Amdahl's law example

Can someone help me with this example please and show me how to work the second part?
the question is :
If one third of a weather prediction algorithm is inherently serial and the remainder
parallelizable, what is the minimum number of cores needed to guarantee a 150% speedup over a
single core implementation?
ii. Your boss revises the figure to 200%. What is your new answer?
Thanks very much in advance !!
Guess: If the algorithm is 1/3 serial and 2/3 parallel...I would think that each core you added would give you a 66% increase in performance...So for 150% increase, you'd need 3 more cores, and for a 200% increase, you'd need 4.
This is a guess. Your textbook might be more helpful :)
If the algorithm runs on a single core and takes 90 minutes then 30 minutes is for the serial part and 60 minutes for the parallel part.
Add a CPU:
30 is for the serial part and 30 for the parallel part(half of the 60 overlaps with the serial part).
90 / 60 = 150% increase.
I am a bit late, but here are the answers:
1) 150% increase -> 2 cores at least required as dbasnett said;
2) 200% increase -> 4 cores at least required basing on the Amahld's law:
Here, 90 minutes overall required to perform the calculation. P is the actually enhanced part of the algorithm (the parallelizable part) which is 2/3 of 90, N is the number of cores, so when there's a core only:
You get 1, which means 100%, which is how the algorithm performs the standard way (without multi-core acceleration and therefore no parallelization speedup).
Now, we must find N number of cores for which the previous equation equals 2, where 2 means that the algorithm performs in half time (45 minutes instead of 90 when there's no parallelization) and therefore with a 200% speedup:
Since:
We see that:
So with 4 cores computing in parallel the 2/3 of the algoritm you get 200% speedup. The same goes for 150%, you will get 2, as dbasnett already told you.
Pretty simple.
Note that a complex algorithm may imply further divisions of its parallelizable parts (and in theory you can have a different number of processing units per parallelizable part concurrently):
You can further look at Wikipedia (there's also an example):
http://en.wikipedia.org/wiki/Amdahl%27s_law#Description
Anyway, the principle is the same:
Let T be the time an algorithm needs to execute in order to complete, A be the serial part of it, B its parallelizable part and N the number of parallel CPUs, you can divide B in further small sections and perform calculations on each part:
You may for C, D, G e.g. adopt M CPUs instead of N (the speedup will of course differ if M != N).
And at the end, you will arrive at a point when having more CPUs doesn't matter anymore, since:
And your algorithm speedup will at most tend to total execution time (T) divided by the execution time of the Serial part only (A).
Therefore parallel calculation comes really handy only when you have low execution time for the serial part of your algorithm.