number of misses in spatial locality - cpu-architecture

I am confused how the number of misses calculated in the example below (from Computer Architecture: A Quantitative Approach)
Example:
For the code below, determine which accesses are likely to cause data cache
misses. Next, insert prefetch instructions to reduce misses. Finally, calculate the
number of prefetch instructions executed and the misses avoided by prefetching.
Let’s assume we have an 8 KB direct-mapped data cache with 16-byte blocks,
and it is a write-back cache that does write allocate. The elements of a and b are 8
bytes long since they are double-precision floating-point arrays. There are 3 rows
and 100 columns for a and 101 rows and 3 columns for b. Let’s also assume they
are not in the cache at the start of the program.
for (i = 0; i < 3; i = i+1)
for (j = 0; j < 100; j = j+1)
a[i][j] = b[j][0] * b[j+1][0];
Answer:
The compiler will first determine which accesses are likely to cause cache
misses; otherwise, we will waste time on issuing prefetch instructions for data
that would be hits. Elements of a are written in the order that they are stored in
memory, so a will benefit from spatial locality: The even values of j will miss
and the odd values will hit. Since a has 3 rows and 100 columns, its accesses will
lead to 3 × (100/2), or 150 misses.
How 150 misses calculated or why divided 100 by 2?

Related

What’s the performance difference between moving(sum, X, 10) and msum(X, 10) and what causes the difference?

For a vector of length 1 million, what’s the performance difference between moving(sum, X, 10) and msum(X, 10) and what causes the difference?
The calculation speed of function msum will be 50 to 200 times higher than the moving function. It may vary depending on the data volume. The reasons are as follows:
The functions adopt different methods to process data: msum puts data into memory at one time, no need to allocate memory separately for each calculation; while moving generates a sub-object and reallocates memory to the sub-object for each calculation, and the memory is reclaimed after the calculation completes.
The function msum implements incremental computation where each calculation adds the previous result to the adjacent value and subtract the last value of the previous calculation, while moving adds up all the data in the window for each calculation.

What is the complexity of external merge sort?

In wikipedia time complexity of external sort is given as follows
(N/B).log(M/B)(N/B)
where N is the total size of the data, M is memory size and B is the number of chunks in the memory. I can understand log part as we sort each chunk in RAM, however I could not understand the base of the log as M/B.
Any help would be appreciated!
After the sorting phase the merge phase processes m runs in parallel therefore you get the base m = M/B.
Source: wikipedia.org/wiki/External_memory_algorithm
M is memory size and ...
The confusion is due to:
B is the number of chunks in the memory.
In the wiki article, B is the block size per chunk, so the number of chunks in the memory = M/B. The wiki time complexity is ignoring the fact that one of the chunks is used for merged output, and that the algorithm uses a k-way merge where k = (M/B)-1.

Is there an efficient way to calculate the smallest N numbers from a set of numbers in hardware (HDL)?

I am trying to calculate the smallest N numbers from a set and I've found software algorithms to do this. I'm wondering if there is an efficient way to do this in hardware (i.e. HDL - in System Verilog or Verilog)? I am specifically trying to calculate the smallest 2 numbers from a set.
I am trying to do this combinationally optimizing with respect to area and speed (for a large set of signals) but I can only think of comparator trees to do this? Is there a more efficient way of doing this?
Thank you, any help is appreciated~
I don't think you can work around using comparator trees if you want to find the two smallest elements combinationally. However, if your goal isn't low latency than a (possibly pipelined) sequential circuit could also be an option.
One approach that I can come up with on the spot would be to break down the operation doing kind of an incomplete bubble sort in hardware using small sorting networks. Depending on the amount of area you are willing to spend you can use a smaller or larger p-sorting network that combinationaly sorts p elements at a time where p >= 3. You can then apply this network on your input set of size N, sorting p elements at a time. The two smallest elements in each iteration are stored in some sort of memory (e.g. an SRAM memory, if you want to process larger amounts of elements).
Here is an example for p=3 (the brackets indicate the grouping of elements the p-sorter is applied to):
(4 0 9) (8 6 7) (4 2 1) --> (0 4 9) (6 7 8) (1 2 4) --> 0 4 6 7 1 2
Now you start the next round:
You apply the p-sorter on the results of the first round.
Again you store the two smallest outputs of your p-sorter into the same memory overwriting values from the previous round.
Here the continuation of the example:
(0 4 6) (7 1 2) --> (0 4 6) (1 2 7) --> 0 4 1 2
In each round you can reduce the number of elements to look at by a factor of 2/p. E.g. with p==4 you discard half the elements in each round until the smallest two elements are stored at the first two memory locations. So the algorithm has time/cycle complexity of O(n log(n)). For an actual hardware implementation, you probably want to stick to powers of two for the size p of the sorting network.
Although the control logic of such a circuit is not trivial to implement the area should be mainly dominated by the size of your sorting network and the memory you need to hold the first 2/p*N intermediate results (assuming your input signals are not already stored in a memory that you can reuse for that purpose). If you want to tune your circuit towards throughput you can increase p and pipeline the sorting network at the expense of additional area. Additional speedup could be gained by replacing the single memory using up to p two-port memories (1 read and 1 write port each) which would allow you to fetch and write back the data for the sorting network in a single cycle thus increasing the utilization ratio of the comparators in the sorting network.

Operating Systems Virtual Memory

I am a student reading Operating systems course for the first time. I have a doubt in the calculation of the performance degradation calculation while using demand paging. In the Silberschatz book on operating systems, the following lines appear.
"If we take an average page-fault service time of 8 milliseconds and a
memory-access time of 200 nanoseconds, then the effective access time in
nanoseconds is
effective access time = (1 - p) x (200) + p (8 milliseconds)
= (1 - p) x 200 + p x 8.00(1000
= 200 + 7,999,800 x p.
We see, then, that the effective access time is directly proportional to the
page-fault rate. If one access out of 1,000 causes a page fault, the effective
access time is 8.2 microseconds. The computer will be slowed down by a factor
of 40 because of demand paging! "
How did they calculate the slowdown here? Is 'performance degradation' and slowdown the same?
This is whole thing is nonsensical. It assumes a fixed page fault rate P. That is not realistic in itself. That rate is a fraction of memory accesses that result in a page fault.
1-P is the fraction of memory accesses that do not result in a page fault.
T= (1-P) x 200ns + p (8ms) is then the average time of a memory access.
Expanded
T = 200ns + p (8ms - 200ns)
T = 200ns + p (799980ns)
The whole thing is rather silly.
All you really need to know is a nanosecond is 1/billionth of a second.
A microsecond is 1/thousandth of a second.
Using these figures, there is a factor of a million difference between the access time in memory and in disk.

Main memory bandwidth measurement

I want to measure the main memory bandwidth and while looking for the methodology, I found that,
many used 'bcopy' function to copy bytes from a source to destination and then measure the time which they report as the bandwidth.
Others ways of doing it is to allocate and array and walk through the array (with some stride) - this basically gives the time to read the entire array.
I tried doing (1) for data size of 1GB and the bandwidth I got is '700MB/sec' (I used rdtsc to count the number of cycles elapsed for the copy). But I suspect that this is not correct because my RAM config is as follows:
Speed: 1333 MHz
Bus width: 32bit
As per wikipedia, the theoretical bandwidth is calculated as follows:
clock speed * bus width * # bits per clock cycle per line (2 for ddr 3
ram) 1333 MHz * 32 * 2 ~= 8GB/sec.
So mine is completely different from the estimated bandwidth. Any idea of what am I doing wrong?
=========
Other question is, bcopy involves both read and write. So does it mean that I should divide the calculated bandwidth by two to get only the read or only the write bandwidth? I would like to confirm whether the bandwidth is just the inverse of latency? Please suggest any other ways of measuring the bandwidth.
I can't comment on the effectiveness of bcopy, but the most straightforward approach is the second method you stated (with a stride of 1). Additionally, you are confusing bits with bytes in your memory bandwidth equation. 32 bits = 4bytes. Modern computers use 64 bit wide memory buses. So your effective transfer rate (assuming DDR3 tech)
1333Mhz * 64bit/(8bits/byte) = 10666MB/s (also classified as PC3-10666)
The 1333Mhz already has the 2 transfer/clock factored in.
Check out the wiki page for more info: http://en.wikipedia.org/wiki/DDR3_SDRAM
Regarding your results, try again with the array access. Malloc 1GB and traverse the entire thing. You can sum each element of the array and print it out so your compiler doesn't think it's dead code.
Something like this:
double time;
int size = 1024*1024*1024;
int sum;
*char *array = (char*)malloc(size);
//start timer here
for(int i=0; i < size; i++)
sum += array[i];
//end timer
printf("time taken: %f \tsum is %d\n", time, sum);