hash function to speed up union of subsets algorithm analysis - hash

Is it possible to use hash to speed up the union process of a function that takes O(n) time? I'm trying to see if its doable, or practical even. how would I do that?

Related

How can function to find Hamming Distance be accelerated for bigger datas in postgreSQL?

I have a potgreSQL data bank with more than 10,0000 entries and each entry has a bit array of size 10000. Is there any method to accelerate the Hamming Distance calculation of the bit arrays for the whole table. Thanks
i tried using different data types like bytea, text and numeric for saving bit array and for calculating hamming distance i tried XOR gate operations, text comparison and numeric addition respectively for each datatypes. But i could not optimize the function to make it super quick, currently it takes almost 2 sec for the operation. The target is 200 millisecond.
There is no possibilities to have good performances for hamming distance because it's a recursive process with a high algorithmic complexity and a very high memory footprint.
https://www.cs.swarthmore.edu/~brody/papers/random14-hamming-distance.pdf
It is not accurate to use it in some big datasets like RDBMS.
Some other comparing technics exists and have a lower complexity withour recursive process and with a minimal footprint... They are not as accurate as the Hamming Distance, but can do a good job, as the one I wrote :
See "inférence basique"
You can combine the two... First use inférence basique to reduce the set, second use hamming on some very few results...

MATLAB - How should I choose the FunctionTolerance of genetic algorithm optimisation?

I am optimising MRI experiment settings in order to get optimal precision of tissue property measurements using my data.
To optimise the settings (i.e. a vector of numbers) I am using the MATLAB genetic algorithm function (ga()). I want to compare the final result of different optimisation experiments, that have different parameter upper bounds but I do not know how to choose the FunctionTolerance.
My current implementation takes several days. I would like to increase FunctionTolerance so that it does not take as long, yet still allows me to make reliable comparisons of the final results of the two different optimisation experiments. In other words, I do not want to stop the optimisation too early. I want to stop it when it gets close to its best results, but not wait for a long time for it to refine the result.
Is there a general rule for choosing FunctionTolerance or does it depend on what is being optimised?

When is it appropriate to use a simple modulus as a hashing function?

I need to create a 16 bit hash from a 32 bit number, and I'm trying to determine if a simple modulus 2^16 is appropriate.
The hash will be used in a 2^16 entry hash table for fast lookup of the 32 bit number.
My understanding is that if the data space has a fairly even distribution, that a simple mod 2^16 is fine - it shouldn't result in too many collisions.
In this case, my 32 bit number is the result of a modified adler32 checksum, using 2^16 as M.
So, in a general sense, is my understanding correct, that it's fine to use a simple mod n (where n is hashtable size) as a hashing function if I have an even data distribution?
And specifically, will adler32 give a random enough distribution for this?
Yes, if your 32-bit numbers are uniformly distributed over all possible values, then a modulo n of those will also be uniformly distributed over the n possible values.
Whether the results of your modified checksum algorithm are uniformly distributed is an entirely different question. That will depend on whether the data you are applying the algorithm to has enough data to roll over the sums several times. If you are applying the algorithm to short strings that don't roll over the sums, then the result will not be uniformly distributed.
If you want a hash function, then you should use a hash function. Neither Adler-32 nor any CRC is a good hash function. There are many very fast and effective hash functions available in the public domain. You can look at CityHash.

Vectorizing nested for loops and pointwise multiplication in matlab

Long time reader, first time inquisitor. So I am currently hitting a serious bottleneck from the following code:
for kk=1:KT
parfor jj=1:KT
umodes(kk,jj,:,:,:) = repmat(squeeze(psi(kk,jj,:,:)), [1,1,no_bands]).*squeeze(umodes(kk,jj,:,:,:));
end
end
In plain language, I need to tile the multi-dimensional array 'psi' across another dimension of length 'no_bands' and then perform pointwise multiplication with the matrix 'umodes'. The issue is that each of the arrays I am working with is large, on the order of 20 GB or more.
What is happening then, I suspect, is that my processors grind to a halt due to cache limits or because data is being paged. I am reasonably convinced there is no practical way to reduce the size of my arrays, so at this point I am trying to reduce computational overhead to a bare minimum.
If that is not possible, it might be time to think of using a proper programming language where I can enforce pass by reference to avoid unnecessary replication of arrays.
Often bsxfun uses less memory than repmat. So you can try:
for kk=1:KT
parfor jj=1:KT
umodes(kk,jj,:,:,:) = bsxfun(#times, squeeze(psi(kk,jj,:,:)), squeeze(umodes(kk,jj,:,:,:)));
end
end
Or you can vectorize the two loops. Vectorizing is usually faster, although not necessarily more memory-efficient, so I'm not sure it helps in your case. In any case, bsxfun benefits from multihreading:
umodes = bsxfun(#times, psi, umodes);

Parallelizing a for loop to run simultaneously on multiple GPU cores?

I understand that you can use a matlabpool and parfor to run for loop iterations in parallel, however, I want to try and take advantage of using the high number of cores in my GPU to run a larger number of simultaneous iterations. I was wondering if there is any built in functionality to do this?
To my understanding, the method in which MATLAB runs code on the GPU is through a GPUarray, but that does not seem to parallelize a loop, only certain functions inside the loop.
For the loop that I am running, each iteration can run independently and the only variables that need to exist outside of the loop is the data to be processed (a 3-D array, where the first index is time, and each iteration is operating on a different time) and a 2-D output array where each iteration is storing the result for a particular time. Each time is independent.
Thanks
With a GPUArray, you can run elementwise operations in parallel by structuring your algorithm in terms of MATLAB's arrayfun. Effectively, this implicitly loops over each element of your arrays, and can apply the body of a MATLAB function to each element. The doc is: here.
There's a simple demo: here.