How to generate hash values from hash function and how to get integer values from these hash values? - bloom-filter

enter image description here
here,string is "SEAN",then it converted to bigrams, each bigram produce different hash values,but i don't understand which hash function is used here and how it generates int values from hash values to map in bloom filter.

The hash function can be for example MurmurHash, the diagram doesn't specify this. It doesn't matter which one is used exactly, as long as you always use the same algorithm when accessing the Bloom filter.
How to generate int values: for example using modulo the length of the Bloom filter bit array. A little bit faster is usually multiply & shift, but it is harder to understand.

Related

how can I create a hash function in which different permutaions of digits of an integer form the same key?

for example 20986 and 96208 should generate the same key (but not 09862 or 9862 as leading zero means it not even a 5 digit number so we igore those).
One option is to get the least/max sorted permutation and then the sorted number is the hashkey, but sorting is too costly for my case. I need to generate key in O(1) time.
Other idea I have is to traverse the number and get frequency of each digits and the then get a hash function out of it. Now whats the best function to combine the frequencies given that 0<= Summation(f[i]) <= no_of_digits.
To create an order-insensitive hash simply hash each value (in your case the digits of the number) and then combine them using a commutative function (e.g. addition/multiplication/XOR). XOR is probably the most appropriate as it retains a constant hash output size and is very fast.
Also, you will want to strip away any leading 0's before hashing the number.

Universal hashing, should get the same hash value for the same key?

I mean, I have implemented an universal hashing function using this expression:
h(k) = ((a*k + b)mod p)mod m; (from Cormen)
where:
-p is big prime number greater than k;
-a and b are two numbers that are randomly choosen the first in the range [1, p-1] and the second one [0, p-1].
Now, I implemented this, and for the random function I have choosen the seed equal to k. That's because, if I don't do this, when I insert a value with the key k, it will generate a hash value, that will depends on the default seed of Random function (maybe the time). So if I want to search the key again, I can't do this, because now the universal hashing function returns me another value. So, I would appreciate you to tell me if my reasoning is correct or not.
My doubt is that now, doing so, if two elements have the same key, they will be irrimediably stored in the same linked list (thing that I didn't understand if it is correct or not).
Thanks in advance.
I think you have a slight misunderstanding about how universal hashing works. Rather than choosing a and b at random every time you compute the hash, instead, before you do any hashing at all, select a random a and b. Once you've done that, every time you need to compute the hash, go and compute it using the formula above based on the input value k and the values a and b that you chose initially.

How to evaluate a hash generating algorithm

What ways do you know to evaluate the efficiency of a hash function besides generating a large set of values and see the distribution of values?
By efficiency I mean that the keys generated by your hash function distribute evenly. Is there a way to prove this without actually testing for actual values?
A hash function is only even in the context of the data being hashed
Consider two data sets:
Set 1
1, 3, 6, 2, 7, 9, 5, 8, 4
Set 2
65355, 96424664, 86463624, 133, 643564, 24232, 88677, 865747, 2224
A good hashing function for one set (ie mod 10 for set 1) gives no collisions and could be seen as the perfect hash for that data set
However apply it to the second set and there are collisions everywhere
Hash = (x * 37) mod 256
Is much better for the second set but may not suit the first set quite so well... Especially when partitioning the hash for eg a small number of buckets.
What you can do is evaluate a hash against random data that you "expect" your function to have to handle... But that is making assumptions...
Premature optimisation is looking for the perfect hash function before you have enough real data to base your assessment on.
You should get enough data well before the cost of rehashing becomes prohibitive to change your hash function
Update
Lets suppose we are looking for a hash function that generates an 8 bit hash of the input data. Lets further suppose that the hash function is supposed to take byte-streams of varying length.
If we assume that the bytes in the byte-streams are uniformly distributed, we can make some assessment of different hash functions.
int hash = 0;
for (byte b in datastream) hash = hash xor b;
This function will produce uniformly distributed hash values for the specified data set, and would therefore be a good hash function in this context. If you don't see why this is, then you might have other problems.
int hash = 37;
for (byte b in datastream hash = (31 * hash + b) mod 256;
This function will produce uniformly distributed hash values for the specified data set, and would therefore be a good hash function in this context.
Now lets change the data set from being variable length strings of random numbers in the range 0 to 255 to being variable length strings comprising English sentences encoded as US-ASCII.
The XOR is then a poor hash because the input data never has the 8th bit set and as a result only generates hashes in the range 0-127, also there is a higher likelyhood of some "hot" values because of the letter frequency in english words and the cancelling affect of the XOR.
The pair of primes remains reasonably good as a hash function because it uses the full output range and the prime initial offset coupled with a different prime multiplier tends to spread the values out. But it is still weak for collisions due to how English language is structured... Something that only testing with real data can show.

Fast associative arrays or maps in Matlab

I need to build a fast one-to-one mapping between two large arrays of integers in Matlab. The mapping should take as input an element from a pre-defined array, e.g.:
in_range = [-200 2 56 45 ... ];
and map it, by its index in the previous array, to the corresponding element from another pre-defined array, e.g.:
out_range = [-10000 0 97 600 ... ];
For example, in the case above, my_map(-200) should output -10000, and my_map(45) should output 600.
I need a solution that
Can map very large arrays (~100K elements) relatively efficiently.
Scales well with the bounds of in_range and out_range (i.e. their min and max values)
So far, I have solved this problem using Matlab's external interface to Java with Java's HashMaps, but I was wondering if there was a Matlab-native alternative.
Thanks!
The latest versions of Matlab have hashes. I'm using 2007b and they aren't available, so I use structs whenever I need a hash. Just convert the integers to valid field names with genvarname.

Using hash functions with Bloom filters

A bloom filter uses a hash function (or many) to generate a value between 0 and m given an input string X. My question is how to you use a hash function to generate a value in this way, for example an MD5 hash is typically represented by a 32 length hex string, how would I use an MD5 hashing algorithm to generate a value between 0 and m where I can specify m? I'm using Java at the moment so an example of to do this with the MessageDigest functionality it offers would be great, though just a generic description of how to do about it would be fine too.
Thanks
You should first convert the hash output to an unsigned integer, then reduce it modulo m. This looks like this:
MessageDigest md = MessageDigest.getInstance("MD5");
// hash data...
byte[] hashValue = md.digest();
BigInteger n = new BigInteger(1, hashValue);
n = n.mod(m);
// at that point, n has a value between 0 and m-1 (inclusive)
I have assumed that m is a BigInteger instance. If necessary, use BigInteger.valueOf(). Similarly, use n.intValue() or n.longValue() to get the value of n as one of the primitive types of Java.
The modular reduction is somewhat biased, but the bias is very small if m is substantially smaller than 2^128.
Simplest way would probably be to just convert the hash output (as a byte sequence) to a single binary number and take that modulo m.