Is the formula 2b* (1+⌈ log (dm )⁡〖(nr)〗⌉) for the total of I/O access in merge-sort correct? - mergesort

I am studying databases from the book Fundamentals of Database Systems, from authors Elmasri and Navathe, 5th edition, and they explain briefly external sort using merge sort in almost at the beginning of chapter 15. They divide the algorithm in two phases:
1) Sorting: They use the next notation:
b = Number of blocks of the data file we want to sort.
nb = Number of buffers (blocks) in memory available to do the sorting.
nr = Number of portions.
In this phase we put in memory as many blocks as we can of the data file, we sort them using any internal sorting algorithm and we write them as a temporary sorted subfile. We repeat this with the rest of the blocks of the file, so we will get more sorted subfiles. Those subfiles are which they call "portions", and the number of them is:
nr = ⌈ b / nb ⌉.
The symbols ⌈ ⌉ denote the ceiling function. The I/O cost of this phase is 2b, because we need to read each block one time (b accesses). Then, to save all the portions we also need to make b accesses.
2) Merging: They say something similar to this (I rewrote it using my interpretation to make it clearer):
The resulting portions (ordered subfiles) are mixed in one or more passes. For each pass an output block is reserved in memory to be placing the result of the mixtures, and the rest are used as input blocks, which can be up to nb - 1, and in which one block is placed at a time of each of the ordered portions, with the purpose of mixing them. More than one pass is needed when there are fewer input blocks than portions. In addition, since each portion can have more than one block, each pass is subdivided into iterations, in each of which a block of each portion is placed.
dm = Degree of mixing, that is, the number of portions that can be mixed in each pass.
The number dm must be equal to the minimum between (nb - 1) and nr. If we place the base of the logarithm between ( ), and its argument between 〖〗, the number of passes is:
⌈ log(dm)〖nr〗⌉.
The part I am cofused with is that they say that the cost of this phase is
2b * ⌈ log(dm)〖nr〗⌉,
so they are basically implying that in each pass we only need to read each block once and write it once, but I am not sure if that is correct. I suspect that more accesses may be necessary.
Therefore, the total cost of the algorithm is 2b + 2b * ⌈log(dm)〖nr〗⌉
= 2b (1 + ⌈log(dm)〖nr〗⌉)
Actually, they don't say that in that way, but: "In general, the logarithm is taken in base dm and the expression indicating the number of blocks accessed is as follows:"
(2*b) + (2* (b* (log(dm)〖nr〗))),
which is basically the same.
For example, suppose we have a file of 10 blocks, with 3 records per block. The available space in memory (buffer pool) has the size of 4 blocks. Let's separe the blocks of the file with ||
29,11,27 || 22,1,20 || 7,30,26 || 9,8,21 || 13,24,15 || 23,4,28 || 17,12,10||
5,3,6 || 16,19,2 || 25,14,18
The number of portions 'nr' that result in the sorting phase is ⌈10/4⌉ = 3.
p1 = 1,7,8 || 9,11,20 || 21,22,26 || 27,29,30
p2 = 3,4,5 || 6,10,12 || 13,15,17 || 23,24,28
p3 = 2,14,16 || 18,19,25
In the meging phase, dm = minimum{nb-1, nr} = minimum{4-1,3} = 3. Then, the number of passes is log(3)〖3〗= 1. According to the formula, we should make 20 I/O's in this phase.
Iteration 1: We put these blocks in memory:
1,7,8 || 3,4,5 || 2,14,16
and they transform into this (once block at a time, which is saved in disk):
1,2,3 || 4,5,7 || 8,14,16
6 access to disk.
Iteration 2:
9,11,20 || 6,10,12 || 18,19,25
and they transform into this:
6,9,10 || 11,12,18 || 19,20,25
6 access to disk (there are already 12 accumulated).
What am I doing wrong, and how do I continue?

I assuming initial pass produces sorted runs of size {3,3,3,3,3,3,3,3,3,3} (10 blocks, 30 records). I'm not sure about dm, but the number of merge passes is ⌈log3⌉(10) = 3. 1st merge pass results in sorted runs of size {9,9,9,3} (10 blocks). 2nd merge pass results in sorted runs of size {27,3} (10 blocks). 3rd merge pass results in a single sorted run {30} (10 blocks).
The initial pass and the 3 merge passes each involve 20 I/O's, for a total of 80 I/O's.

Related

Determining number of cache misses for in-place reversal of an array, on a 2-way associative cache

Consider a 32-bit computer. With 2-way associative cache, cache blocks are 8 words each and has 512 sets.
Consider the following code block
int A[N];
for(int i =0;i<N/2;i++){
int tmp = A[i];
A[i] = A[N-i-1];
A[N-i-1] = tmp;
}
Assume N is a multiple of 8. Determine a formula for the number of cache misses, with variable N.
My research :
Each block in memory will have 8 words (1 words = 32 bit = 4 bytes)
A block # b will map to b mod 512
There are 512 sets. Where each set has two lines in cache and each line is a block.
So things will be transferred block wise from memory to cache.
Now as per my understanding memory will be organized in blocks of 8 words.
Suppose N = 8 * K
Since ints are 32 bits. IT will be like :
A[0], A[1], A[2],A[3],..A[7]
A[8],...................A[15]
.
A[8*(K-1)],...............A[N-1]
So this is how Array will be laid out in memory with K blocks.
And when there is a cache miss a complete line will be transferred into cache.
In the first iteration code will access A[0] and A[N-1]
So Line 0 and Line K-1 will be put into cache.
At second iteration it will access A[1] and A[N-2] and both of them are already in cache
Following this logic, There will be two cache misses for i=0.
None for i=1,2,3,4,5,6,7
Then for i = 8, code will access A[8] and A[N-9] again two cache misses.
for i = N/2-5 it will acess A[N/2-5] and A[N/2+4] these two will be in different blocks so again two cache misses.
Not sure if I am proceeding in the right direction. I need to come up with the formula of number of cache misses in terms of N.

How to avoid nested for loops in this particular case?

I have written a BFS (Breadth First Search) path planning algorithm. It works perfectly fine with the small grids (e.g. 15 by 15) but it is a disaster when it comes to larger grids (e.g. 150 by 250). Using the tic and toc commands, I found where it takes the most time (the least efficient bit) and the issue is to do with a nested for loop. I know use of nested for loops is considered bad and I need help with replacing it with something else (by avoiding the use of loops if possible).
for j=1:length(F)
for k = 1:length(Closed)
if(F(j) == Closed(k))
F(j) = 0;
end
end
end
Closed = [Closed current];
The purpose of this section of the code is to replace the element of F that is the parent node (where we came from). So that any repetition can be ignored.
F is a 1 by n vector storing neighbor nodes that can be traveled from the current node.
Note: I am using 8-connected space, hence n is always less than 8.
Closed is initiated as an empty vector and it is used to store the list of visited nodes (by concatenating horizontally).
current is a number between 1 and 37901, representing the current node.
I know there is another question about nested loops but my question is different. Thanks for you help!
The reason that this bit of code is slow is not that you have nested loops, but that you implemented a O(n2) algorithm.
You should use a logical array to indicate nodes that are "closed". This will make the lookup much more efficient.
Consider you have N = 37901 nodes. Initialize your Closed array thus:
Closed = false(1,N);
Now to check if node F(j) is closed, you can simply do Closed(F(j)), which will be true if it's closed.
Now your loop can be replaced with
F(Closed(F)) = [];
Closed(current) = true;
because Closed(F) is an array of the same size as F that is true for the closed nodes. You can use this array to index into F, and remove all the closed nodes. I'm deleting those nodes, rather than assigning 0 to them, so that F can always be used to index into Closed. If we write 0 there, it's no longer a logical array. If you need to not change the shape of F, you'd have to do some additional tests before indexing.
Note that Closed = [Closed current] is also a lot slower than Closed(current) = true, because it creates a new array that the old Closed array is copied into.
Note that you can remove the nested loop in your code, as below, but it will not necessarily be any faster, as the algorithm is still O(n2) (you're comparing each of the elements in F to each of the elements of Closed).
for j=1:length(F)
if any(F(j) == Closed)
F(j) = 0;
end
end

Why is more data being added to my array than should be?

I am writing some code for data processing. The requirement is to sort the data, which comes in lists called u_B (5000 values of speed data) and P_B (5000 pieces of the related power data) into "bins" by speed, so that it is possible to calculate the mean speed and power within each bin. The code below is just trying to get the "bin" for the range of speeds 24-25m/s. What I expect to happen is that the code cycles through the u_B list, checks if each speed is within the required range, and if it is, puts it in the "bin", along with the corresponding power value. I have altered it to output the speeds it considers to be in the right range, and they seem to be all as I expect them to be, but when the bin is outputted right at the end it contains not only the data within the right range, but also a whole load of other data that does not fit within the speed range. I cannot work out why this other data is being added to the bin. If anyone can spot what I am missing I would be grateful.
i = 25;
inc = 1;
for n = 1:5000
if (u_B(n) >= (i-1)) && (u_B(n) < (i + 1))
disp(u_B(n))
bin(inc,1) = u_B(n);
disp(bin(inc,1))
bin(inc,2) = P_B(n);
inc = inc + 1
end
end
disp(bin)
This shows the first set of outputs from within the if-statement, the 24.7s are the speed u_B(n) and the value that has been put into the bin, they are the same as expected, the 0 for power and 2 for inc are both fine. the list from this goes on and only contains speed values in the right range.
screenshot of code and output
This shows the output of what is in the bin, the first 10 values are the ones I want to be in there, and all the rest have lower speeds, and therefore shouldn't be in the bin.
screenshot of code and output

MATLAB spending an incredible amount of time writing a relatively small matrix

I have a small MATLAB script (included below) for handling data read from a CSV file with two columns and hundreds of thousands of rows. Each entry is a natural number, with zeros only occurring in the second column. This code is taking a truly incredible amount of time (hours) to run what should be achievable in at most some seconds. The profiler identifies that approximately 100% of the run time is spent writing a matrix of zeros, whose size varies depending on input, but in all usage is smaller than 1000x1000.
The code is as follows
function [data] = DataHandler(D)
n = size(D,1);
s = max(D,1);
data = zeros(s,s);
for i = 1:n
data(D(i,1),D(i,2)+1) = data(D(i,1),D(i,2)+1) + 1;
end
It's the data = zeros(s,s); line that takes around 100% of the runtime. I can make the code run quickly by just changing out the s's in this line for 1000, which is a sufficient upper bound to ensure it won't run into errors for any of the data I'm looking at.
Obviously there're better ways to do this, but being that I just bashed the code together to quickly format some data I wasn't too concerned. As I said, I fixed it by just replacing s with 1000 for my purposes, but I'm perplexed as to why writing that matrix would bog MATLAB down for several hours. New code runs instantaneously.
I'd be very interested if anyone has seen this kind of behaviour before, or knows why this would be happening. Its a little disconcerting, and it would be good to be able to be confident that I can initialize matrices freely without killing MATLAB.
Your call to zeros is incorrect. Looking at your code, D looks like a D x 2 array. However, your call of s = max(D,1) would actually generate another D x 2 array. By consulting the documentation for max, this is what happens when you call max in the way you used:
C = max(A,B) returns an array the same size as A and B with the largest elements taken from A or B. Either the dimensions of A and B are the same, or one can be a scalar.
Therefore, because you used max(D,1), you are essentially comparing every value in D with the value of 1, so what you're actually getting is just a copy of D in the end. Using this as input into zeros has rather undefined behaviour. What will actually happen is that for each row of s, it will allocate a temporary zeros matrix of that size and toss the temporary result. Only the dimensions of the last row of s is what is recorded. Because you have a very large matrix D, this is probably why the profiler hangs here at 100% utilization. Therefore, each parameter to zeros must be scalar, yet your call to produce s would produce a matrix.
What I believe you intended should have been:
s = max(D(:));
This finds the overall maximum of the matrix D by unrolling D into a single vector and finding the overall maximum. If you do this, your code should run faster.
As a side note, this post may interest you:
Faster way to initialize arrays via empty matrix multiplication? (Matlab)
It was shown in this post that doing zeros(n,n) is in fact slow and there are several neat tricks to initializing an array of zeros. One way is to accomplish this by empty matrix multiplication:
data = zeros(n,0)*zeros(0,n);
One of my personal favourites is that if you assume that data was not declared / initialized, you can do:
data(n,n) = 0;
If I can also comment, that for loop is quite inefficient. What you are doing is calculating a 2D histogram / accumulation of data. You can replace that for loop with a more efficient accumarray call. This also avoids allocating an array of zeros and accumarray will do that under the hood for you.
As such, your code would basically become this:
function [data] = DataHandler(D)
data = accumarray([D(:,1) D(:,2)+1], 1);
accumarray in this case will take all pairs of row and column coordinates, stored in D(i,1) and D(i,2) + 1 for i = 1, 2, ..., size(D,1) and place all that match the same row and column coordinates into a separate 2D bin, we then add up all of the occurrences and the output at this 2D bin gives you the total tally of how many values at this 2D bin which corresponds to the row and column coordinate of interest mapped to this location.

recording 'bursts' of samples at 300 samples per sec

I am recording voltage changes over a small circuit- this records mouse feeding. When the mouse is eating, the circuit voltage changes, I convert that into ones and zeroes, all is well.
BUT- I want to calculate the number and duration of 'bursts' of feeding- that is, instances of circuit closing that occur within 250 ms (75 samples) of one another. If the gap between closings is larger than 250ms I want to count it as a new 'burst'
I guess I am looking for help in asking matlab to compare the sample number of each 1 in the digital file with the sample number of the next 1 down- if the difference is more than 75, call the first 1 the end of one bout and the second one the start of another bout, classifying the difference as a gap, but if it is NOT, keep the sample number of the first 1 and compare it against the next and next and next until there is a 75-sample difference
I can compare each 1 to the next 1 down:
n=1; m=2;
for i = 1:length(bouts4)-1
if bouts4(i+1) - bouts4(i) >= 75 %250 msec gap at a sample rate of 300
boutend4(n) = bouts4(i);
boutstart4(m)= bouts4(i+1);
m = m+1;
n = n+1;
end
I don't really want to iterate through i for both variables though...
any ideas??
-DB
You can try the following code
time_diff = diff(bouts4);
new_feeding = time_diff > 75;
boutend4 = bouts4(new_feeding);
boutstart4 = [0; bouts4(find(new_feeding) + 1)];
That's actually not too bad. We can actually make this completely vectorized. First, let's start with two signals:
A version of your voltages untouched
A version of your voltages that is shifted in time by 1 step (i.e. it starts at time index = 2).
Now the basic algorithm is really:
Go through each element and see if the difference is above a threshold (in your case 75).
Enumerate the locations of each one in separate arrays
Now onto the code!
%// Make those signals
bout4a = bouts4(1:end-1);
bout4b = bouts4(2:end);
%// Ensure column vectors - you'll see why soon
bout4a = bout4a(:);
bout4b = bout4b(:);
% // Step #1
loc = find(bouts4b - bouts4a >= 75);
% // Step #2
boutend4 = [bouts4(loc); 0];
boutstart4 = [0; bouts4(loc + 1)];
Aside:
Thanks to tail.b.lo, you can also use diff. It basically performs that difference operation with the copying of those vectors like I did before. diff basically works the same way. However, I decided not to use it so you can see how exactly your code that you wrote translates over in a vectorized way. Only way to learn, right?
Back to it!
Let's step through this slowly. The first two lines of code make those signals I was talking about. An original one (up to length(bouts) - 1) and another one that is the same length but shifted over by one time index. Next, we use find to find those time slots where the time index was >= 75. After, we use these locations to access the bouts array. The ending array accesses the original array while the starting array accesses the same locations but moved over by one time index.
The reason why we need to make these two signals column vector is the way I am appending information to the starting vector. I am not sure whether your data comes in rows or columns, so to make this completely independent of orientation, I'm going to make sure that your data is in columns. This is because if I try to append a 0, if I do it to a row vector I have to use a space to denote that I'm going to the next column. If I do it for a column vector, I have to use a semi-colon to go to the next row. To completely avoid checking to see whether it's a row or column vector, I'm going to make sure that it's a column vector no matter what.
By looking at your code m=2. This means that when you start writing into this array, the first location is 0. As such, I've artificially placed a 0 at the beginning of this array and followed that up with the rest of the values.
Hope this helps!