How to create empty buckets in extendible hashing - hash

I know how to do extendible hashing on paper, but I don't know how it's possible for empty buckets to be created.
What would cause empty buckets to be created in extendible hashing? Can you show a simple example?

Assume the hash function h(x) is h(x) = x and each bucket can hold two things in it.
I will also use the least significant bits of the hash code as in index into the hash directory, as opposed to the most significant bits.
Basically, to get an empty bucket, we want to induce a doubling of the hash table by trying to place something into a bucket that has no space but we want that doubling to fail.
So, let's start inserting stuff.
First, insert 0. This should go in the first bucket, since h(0) = 0 and 0 % 2 = 0.
Then, insert 4. This should also go in the first bucket, since h(4) = 4 and 4 % 2 = 0.
Now, inserting 8 fails since the bucket can only hold two things, so the table must be doubled in size. Therefore, the global hash level increases from 1 to 2. Other changes include a new third bucket and the fourth hash index pointing to the second bucket.
Unfortunately, since the rehashing process takes h(x) % 4 and all of our numbers are (deliberately) multiples of 4, the first bucket remains too full and the third bucket is empty. This resolves itself with yet another doubling of the hash table.

Related

Perfect hash function for integer sequence

Given a set of integers (sequence) 1…999_999 (for example) I need to map each individual integer to another integer in the same set 1:1 randomly (distribution depends on seed). Hash function must be scalable to large sets, so shuffling and storing all values in the memory is not an option. Is there any good way of doing this?
Some examples:
// 1..3 seq
lowerBound = 1;
upperBound = 3;
seed = 1
h1 = makeHashFn(lowerBound, upperBound, seed)
h1(1) // 2
h1(2) // 3
h1(3) // 1
newSeed = 2
h2 = makeHashFn(lowerBound, upperBound, newSeed)
h2(1) // 3
h2(2) // 1
h2(2) // 2
It's not possible to do this without any kind of memory usage.
If you're happy for number collisions to happen, it is possible, but otherwise, you can't really have it be random and stateless.
What you can do though, is shuffle a list of all indices randomly.
That would be only 4 or 8 bytes per list element, which is fairly reasonable for most applications.
If you use a deterministic seeded RNG to shuffle the indices, the result will be the same every time, and in that case, you would not need to store the shuffled indices, rather you could regenerate them and discard them as needed for your memory requirements.
There aren't any silver bullets, every solution to this problem will have significant tradeoffs. If you have a supermassive database with billions of entries, it's probably better to step back and redefine the problem in a more efficient way.

how to "explain" the following hash function is bad

we have a hash table with size 16, using double hashing method.
h1(x) = k mod 16
h2(x) = 2*(k mod 8)
I know that h2 hash function is bad, probably because mod 8 and times 2, but I don't know how to explain it. is there any explanation like "h2 hash function should mod prime or it will cause ____ problem "
It is bad because it increases the number of collisions.
The (mod 8) means that you are only looking for 8 pigeonholes in your 16-pigeonhole table.
Multiplying it by 2 just spreads those 8 pigeonholes out so that you don’t have to search too many slots past the hashed index to find an empty hole...
You should always compute modulo the size of your table.
h(x) ::= x (mod N) // where N is the table size
The purpose of making the table size a prime number just has to do with how powers of two are very common in computer science. If your data is random, then the size of the table doesn’t matter.
— As long as it is big enough for your expected load factor. A 16-element table is very small. You shouldn’t expect to store more than 6-12 random values in your table without a high-probability of collisions.
A very good linked thread is What is a good Hash Function?, which is totally worth a read just for the links to further reading alone.

MATLAB spending an incredible amount of time writing a relatively small matrix

I have a small MATLAB script (included below) for handling data read from a CSV file with two columns and hundreds of thousands of rows. Each entry is a natural number, with zeros only occurring in the second column. This code is taking a truly incredible amount of time (hours) to run what should be achievable in at most some seconds. The profiler identifies that approximately 100% of the run time is spent writing a matrix of zeros, whose size varies depending on input, but in all usage is smaller than 1000x1000.
The code is as follows
function [data] = DataHandler(D)
n = size(D,1);
s = max(D,1);
data = zeros(s,s);
for i = 1:n
data(D(i,1),D(i,2)+1) = data(D(i,1),D(i,2)+1) + 1;
end
It's the data = zeros(s,s); line that takes around 100% of the runtime. I can make the code run quickly by just changing out the s's in this line for 1000, which is a sufficient upper bound to ensure it won't run into errors for any of the data I'm looking at.
Obviously there're better ways to do this, but being that I just bashed the code together to quickly format some data I wasn't too concerned. As I said, I fixed it by just replacing s with 1000 for my purposes, but I'm perplexed as to why writing that matrix would bog MATLAB down for several hours. New code runs instantaneously.
I'd be very interested if anyone has seen this kind of behaviour before, or knows why this would be happening. Its a little disconcerting, and it would be good to be able to be confident that I can initialize matrices freely without killing MATLAB.
Your call to zeros is incorrect. Looking at your code, D looks like a D x 2 array. However, your call of s = max(D,1) would actually generate another D x 2 array. By consulting the documentation for max, this is what happens when you call max in the way you used:
C = max(A,B) returns an array the same size as A and B with the largest elements taken from A or B. Either the dimensions of A and B are the same, or one can be a scalar.
Therefore, because you used max(D,1), you are essentially comparing every value in D with the value of 1, so what you're actually getting is just a copy of D in the end. Using this as input into zeros has rather undefined behaviour. What will actually happen is that for each row of s, it will allocate a temporary zeros matrix of that size and toss the temporary result. Only the dimensions of the last row of s is what is recorded. Because you have a very large matrix D, this is probably why the profiler hangs here at 100% utilization. Therefore, each parameter to zeros must be scalar, yet your call to produce s would produce a matrix.
What I believe you intended should have been:
s = max(D(:));
This finds the overall maximum of the matrix D by unrolling D into a single vector and finding the overall maximum. If you do this, your code should run faster.
As a side note, this post may interest you:
Faster way to initialize arrays via empty matrix multiplication? (Matlab)
It was shown in this post that doing zeros(n,n) is in fact slow and there are several neat tricks to initializing an array of zeros. One way is to accomplish this by empty matrix multiplication:
data = zeros(n,0)*zeros(0,n);
One of my personal favourites is that if you assume that data was not declared / initialized, you can do:
data(n,n) = 0;
If I can also comment, that for loop is quite inefficient. What you are doing is calculating a 2D histogram / accumulation of data. You can replace that for loop with a more efficient accumarray call. This also avoids allocating an array of zeros and accumarray will do that under the hood for you.
As such, your code would basically become this:
function [data] = DataHandler(D)
data = accumarray([D(:,1) D(:,2)+1], 1);
accumarray in this case will take all pairs of row and column coordinates, stored in D(i,1) and D(i,2) + 1 for i = 1, 2, ..., size(D,1) and place all that match the same row and column coordinates into a separate 2D bin, we then add up all of the occurrences and the output at this 2D bin gives you the total tally of how many values at this 2D bin which corresponds to the row and column coordinate of interest mapped to this location.

How to evaluate a hash generating algorithm

What ways do you know to evaluate the efficiency of a hash function besides generating a large set of values and see the distribution of values?
By efficiency I mean that the keys generated by your hash function distribute evenly. Is there a way to prove this without actually testing for actual values?
A hash function is only even in the context of the data being hashed
Consider two data sets:
Set 1
1, 3, 6, 2, 7, 9, 5, 8, 4
Set 2
65355, 96424664, 86463624, 133, 643564, 24232, 88677, 865747, 2224
A good hashing function for one set (ie mod 10 for set 1) gives no collisions and could be seen as the perfect hash for that data set
However apply it to the second set and there are collisions everywhere
Hash = (x * 37) mod 256
Is much better for the second set but may not suit the first set quite so well... Especially when partitioning the hash for eg a small number of buckets.
What you can do is evaluate a hash against random data that you "expect" your function to have to handle... But that is making assumptions...
Premature optimisation is looking for the perfect hash function before you have enough real data to base your assessment on.
You should get enough data well before the cost of rehashing becomes prohibitive to change your hash function
Update
Lets suppose we are looking for a hash function that generates an 8 bit hash of the input data. Lets further suppose that the hash function is supposed to take byte-streams of varying length.
If we assume that the bytes in the byte-streams are uniformly distributed, we can make some assessment of different hash functions.
int hash = 0;
for (byte b in datastream) hash = hash xor b;
This function will produce uniformly distributed hash values for the specified data set, and would therefore be a good hash function in this context. If you don't see why this is, then you might have other problems.
int hash = 37;
for (byte b in datastream hash = (31 * hash + b) mod 256;
This function will produce uniformly distributed hash values for the specified data set, and would therefore be a good hash function in this context.
Now lets change the data set from being variable length strings of random numbers in the range 0 to 255 to being variable length strings comprising English sentences encoded as US-ASCII.
The XOR is then a poor hash because the input data never has the 8th bit set and as a result only generates hashes in the range 0-127, also there is a higher likelyhood of some "hot" values because of the letter frequency in english words and the cancelling affect of the XOR.
The pair of primes remains reasonably good as a hash function because it uses the full output range and the prime initial offset coupled with a different prime multiplier tends to spread the values out. But it is still weak for collisions due to how English language is structured... Something that only testing with real data can show.

Extract a specific row from a combination matrix

Suppose I have 121 elements and want to get all combinations of 4 elements taken at a time, i.e. 121c4.
Since combnk(1:121, 4) takes a lot of time, I want to go for 2% of that combination by providing:
z = 1:50:length(121c4(:, 1))
For example: 1st row, 5th row, 100th row and so on, up to 121c4, picking only those rows from a 121c4 matrix without generating the complete combination (it's consuming too much for large numbers like 625c4).
If you haven't defined an ordering on the combinations, why not just use
randi(121,p,4)
where p is the number of combinations you want in your set ? With this approach you may, or may not, want to replace duplicates.
If you have defined an ordering on the combinations, tell us what it is.