Random pivot selection for quicksort not working - quicksort

I am trying to choose a random index for quicksort, but for some reason, the array is not sorting. In fact, the algorithm returns a different array (ex. input [2,1,4] and [1,1,4] is outputted) sometimes. Any help would be much appreciated. This algorithm works if, instead of choosing a random index, I always choose the first element of the array as the pivot.
def quicksort(array):
if len(array) < 2:
return array
else:
random_pivot_index = randint(0, len(array) - 1)
pivot = array[random_pivot_index]
less = [i for i in array[1:] if i =< pivot]
greater = [i for i in array[1:] if i > pivot]
return quicksort(less) + [pivot] + quicksort(greater)

less = [i for i in array[1:] if i =< pivot]
You're including elements equal to the pivot value in less here.
But here, you also include the pivot value explicitly in the result:
return quicksort(less) + [pivot] + quicksort(greater)
Instead try it with just:
return quicksort(less) + quicksort(greater)
Incidentally, though this does divide-and-conquer in the same way as QuickSort does, it's not really an implementation of that algorithm: Actual QuickSort sorts the elements in place - your version will suffer from the run-time overhead associated with allocating and concatenating the utility arrays.

Related

Mean of values before and after a specific element

I have an array of 1 x 400, where all element values are above 1500. However, I have some elements that have values<50 which are wrong measures and I would like to have the mean of the elements before and after the wrong measured data points and replace it in the main array.
For instance, element number 17 is below 50 so I want to take the mean of elements 16 and 18 and replace element 17 with the new mean.
Can someone help me, please? many thanks in advance.
No language is specified in the question, but for Python you could work with List Comprehension:
# array with 400 values, some of which are incorrect
arr = [...]
arr = [arr[i] if arr[i] >= 50 else (arr[i-1]+arr[i+1])/2 for i in range(len(arr))]
That is, if arr[i] is less than 50, it'll be replaced by the average value of the element before and after it. There are two issues with this approach.
If i is the first or last element, then one of the two values will be undefined, and no mean can be obtained. This can be fixed by just using the value of the available neighbour, as specified below
If two values in a row are very low, the leftmost one will use the rightmost one to calculate its value, which will result in a very low value. This is a problem that may not occur for you in practice, but it is an inherent result of the way you wish to recalculate values, and you might want to keep it in mind.
Improved version, keeping in mind the edge cases:
# don't alter the first and last item, even if they're low
arr = [arr[i] if arr[i] >= 50 or i == 0 or i+1 == len(arr) else (arr[i-1]+arr[i+1])/2 for i in range(len(arr))]
# replace the first and last element if needed
if arr[0] < 50:
arr[0] = arr[1]
if arr[len(arr)-1] < 50:
arr[len(arr)-1] = arr[len(arr)-2]
I hope this answer was useful for you, even if you intend to use another language or framework than python.

Is a bloom filter with only one input hash still a bloom filter?

If I implement a bloom filter where only one hashing algorithm is used (e.g. murmur), is this still considered a bloom filter?
For example, if a hashes to 5, then bit 5 of the filter will be set. If b hashes to 1, then bit 1 of the filter will be set, and so on...
For something to be considered a bloom filter, do at least two bits in the filter have to be set? If only one bit is set, is it called something else?
It is still a Bloom filter: one with k=1. Depending on the bits per element, it is probably not be the most space-saving one. But there are various reasons why one might pick a k that is not round(bitsPerKey * log(2)), the main ones are:
To be able to better compress: here a Bloom filter with k=1 is best. See also the paper "Compressed Bloom Filters" from Michael Mitzenmacher.
To speed up lookup and update: using a lower k is faster.
By the way, you can still pick k to be the most space-saving one, even if you only use one "application hash function" (like Murmur hash with 64 bits). You just pick the "Bloom hash functions" to be a function of this "application hash function" (64-bit Murmur hash), like so (assuming int is 32 bit and long is 64 bit):
long m = murmur(x)
h(x, i) = (int) (m >> 32) + i * (int) m
And that's actually easier and faster than calculating multiple "application hash functions". In a loop, that looks like this:
long m = murmur(x)
int hash = (int) (m >> 32);
int add = (int) m;
for (int i = 0; i < k; i++) {
... test / set the bit depending on "hash" ...
hash += add;
}
Many Bloom filter libraries do it like this, for example the Bloom filter implementation in Guava.

Merge Sort algorithm efficiency

I am currently taking an online algorithms course in which the teacher doesn't give code to solve the algorithm, but rather rough pseudo code. So before taking to the internet for the answer, I decided to take a stab at it myself.
In this case, the algorithm that we were looking at is merge sort algorithm. After being given the pseudo code we also dove into analyzing the algorithm for run times against n number of items in an array. After a quick analysis, the teacher arrived at 6nlog(base2)(n) + 6n as an approximate run time for the algorithm.
The pseudo code given was for the merge portion of the algorithm only and was given as follows:
C = output [length = n]
A = 1st sorted array [n/2]
B = 2nd sorted array [n/2]
i = 1
j = 1
for k = 1 to n
if A(i) < B(j)
C(k) = A(i)
i++
else [B(j) < A(i)]
C(k) = B(j)
j++
end
end
He basically did a breakdown of the above taking 4n+2 (2 for the declarations i and j, and 4 for the number of operations performed -- the for, if, array position assignment, and iteration). He simplified this, I believe for the sake of the class, to 6n.
This all makes sense to me, my question arises from the implementation that I am performing and how it effects the algorithms and some of the tradeoffs/inefficiencies it may add.
Below is my code in swift using a playground:
func mergeSort<T:Comparable>(_ array:[T]) -> [T] {
guard array.count > 1 else { return array }
let lowerHalfArray = array[0..<array.count / 2]
let upperHalfArray = array[array.count / 2..<array.count]
let lowerSortedArray = mergeSort(array: Array(lowerHalfArray))
let upperSortedArray = mergeSort(array: Array(upperHalfArray))
return merge(lhs:lowerSortedArray, rhs:upperSortedArray)
}
func merge<T:Comparable>(lhs:[T], rhs:[T]) -> [T] {
guard lhs.count > 0 else { return rhs }
guard rhs.count > 0 else { return lhs }
var i = 0
var j = 0
var mergedArray = [T]()
let loopCount = (lhs.count + rhs.count)
for _ in 0..<loopCount {
if j == rhs.count || (i < lhs.count && lhs[i] < rhs[j]) {
mergedArray.append(lhs[i])
i += 1
} else {
mergedArray.append(rhs[j])
j += 1
}
}
return mergedArray
}
let values = [5,4,8,7,6,3,1,2,9]
let sortedValues = mergeSort(values)
My questions for this are as follows:
Do the guard statements at the start of the merge<T:Comparable> function actually make it more inefficient? Considering we are always halving the array, the only time that it will hold true is for the base case and when there is an odd number of items in the array.
This to me seems like it would actually add more processing and give minimal return since the time that it happens is when we have halved the array to the point where one has no items.
Concerning my if statement in the merge. Since it is checking more than one condition, does this effect the overall efficiency of the algorithm that I have written? If so, the effects to me seems like they vary based on when it would break out of the if statement (e.g at the first condition or the second).
Is this something that is considered heavily when analyzing algorithms, and if so how do you account for the variance when it breaks out from the algorithm?
Any other analysis/tips you can give me on what I have written would be greatly appreciated.
You will very soon learn about Big-O and Big-Theta where you don't care about exact runtimes (believe me when I say very soon, like in a lecture or two). Until then, this is what you need to know:
Yes, the guards take some time, but it is the same amount of time in every iteration. So if each iteration takes X amount of time without the guard and you do n function calls, then it takes X*n amount of time in total. Now add in the guards who take Y amount of time in each call. You now need (X+Y)*n time in total. This is a constant factor, and when n becomes very large the (X+Y) factor becomes negligible compared to the n factor. That is, if you can reduce a function X*n to (X+Y)*(log n) then it is worthwhile to add the Y amount of work because you do fewer iterations in total.
The same reasoning applies to your second question. Yes, checking "if X or Y" takes more time than checking "if X" but it is a constant factor. The extra time does not vary with the size of n.
In some languages you only check the second condition if the first fails. How do we account for that? The simplest solution is to realize that the upper bound of the number of comparisons will be 3, while the number of iterations can be potentially millions with a large n. But 3 is a constant number, so it adds at most a constant amount of work per iteration. You can go into nitty-gritty details and try to reason about the distribution of how often the first, second and third condition will be true or false, but often you don't really want to go down that road. Pretend that you always do all the comparisons.
So yes, adding the guards might be bad for your runtime if you do the same number of iterations as before. But sometimes adding extra work in each iteration can decrease the number of iterations needed.

Most efficient way to store dictionaries in Matlab

I have a set of IDs associated with costs which is just a double value. IDs are integers and unique. Two IDs may have same costs. I stored them as:-
a=containers.Map('KeyType','uint32','ValueType','double');
a(1)=7.3
a(2)=8.4
a(3)=7.3
Now i want to find the minimum cost.
b=[];
c=values(a);
b=[b,c{:}];
cost_min=min(b);
Now i want to find all IDs associated i.e. 1 and 3 with the minimum cost i.e. 7.3. I can collect all the keys into an array and then do a for loop over this array. Is there a better way to do this entire thing in Matlab so that for loops are not required?
sparse matrix can work as a hashmap, just do this:
a= sparse(1:3,1,[7.3 8.4 7.3])
find(a == min(nonzeros(a))
There are methods which can be used on maps for this kind of operations
http://se.mathworks.com/help/matlab/ref/containers.map-class.html
The approach finding minimum values and minimum keys can be done something like this,
a=containers.Map('KeyType','uint32','ValueType','double');
a(1)=7.3;
a(3)=8.4;
a(4)=7.3;
minval = inf;
minkeys = -1;
for k = keys(a)
val = a.values(k);
val = val{1};
if (val < minval(1))
minkeys = k;
minval = val;
elseif (val == minval(1))
minkeys = [minkeys,k];
end
end
disp(minval);
disp(minkeys);
This is not efficient though and value search is clumsy for maps. This is not what they are intended for. Maps is supposed to do efficient key lookup. In case you are going to do a lot of lookups and this is what takes time, then use a map. If you need to do a lot of value searches, I would recommend that you use a matrix (or two arrays) for this instead.
idx = [1;3;4];
val = [7.3,8.3,7.3];
minval = min(val);
minidx = idx(val==minval);
disp(minval);
disp(minidx);
There is also another post with an example where it is shown how a sparse matrix can be used as a hashmap. Let the index become the key. This will take about 3 times the memory as all non-zero elements an ordinary array, but a map uses more memory than an array as well.

Find value in vector "p" that corresponds to maximum value in vector "r = f(p)"

As simple as in title. I have nx1 sized vector p. I'm interested in the maximum value of r = p/foo - floor(p/foo), with foo being a scalar, so I just call:
max_value = max(p/foo-floor(p/foo))
How can I get which value of p gave out max_value?
I thought about calling:
[max_value, max_index] = max(p/foo-floor(p/foo))
but soon I realised that max_index is pretty useless. I'm sorry asking this, real beginner here.
Having dropped the issue to pieces, I realized there's no unique corrispondence between values p and values in my related vector p/foo-floor(p/foo), so there's a logical issue rather than a language one.
However, given my input data, I know that the solution is unique. How can I fix this?
I ended up doing:
result = p(p/foo-floor(p/foo) == max(p/foo-floor(p/foo)))
Looks terrible, so if you know any other way...
Once you have the index, use it:
result = p(max_index)
You can create a new vector with your lets say "transformed" values:
p2 = (p/foo-floor(p/foo))
and then just use find to find the max values on p2:
max_index = find(p2 == max(p2))
that will return the index or indices of p2 with the max value of that operation, and finally just lookup the original value in p
p(max_index)
in 1 line, this is:
p(find((p/foo-floor(p/foo) == max((p/foo-floor(p/foo))))))
which is basically the same thing you did in the end :)