I want to convert a string to hash and divid into n bucket - hash

Problem: I want to divide the M strings into N bucket as uniformly as possible.
One solution I was thinking is,
Create a hash of string
Convert the hash to integer by mapping ascii value of each character in hash
Sum up those ascii values
Divide the sum by N
Uniform distribution I believe will be solved by hashing but not sure converting to ascii will change anything.
Please suggest better solution if you have one.
Thank you in advance

Related

how can I create a hash function in which different permutaions of digits of an integer form the same key?

for example 20986 and 96208 should generate the same key (but not 09862 or 9862 as leading zero means it not even a 5 digit number so we igore those).
One option is to get the least/max sorted permutation and then the sorted number is the hashkey, but sorting is too costly for my case. I need to generate key in O(1) time.
Other idea I have is to traverse the number and get frequency of each digits and the then get a hash function out of it. Now whats the best function to combine the frequencies given that 0<= Summation(f[i]) <= no_of_digits.
To create an order-insensitive hash simply hash each value (in your case the digits of the number) and then combine them using a commutative function (e.g. addition/multiplication/XOR). XOR is probably the most appropriate as it retains a constant hash output size and is very fast.
Also, you will want to strip away any leading 0's before hashing the number.

Are SHA1 hashes distributed uniformly?

I have a string in Python. I calculate the SHA1 hash of that string with hashlib. I convert it to its hexadecimal representation and take the last 16 characters to use as an identifier:
hash_str = "foobarbazάλφαβήταγάμμα..."
hash_obj = hashlib.sha1(hash_str, encode('utf-8'))
hash_id = hash_obj.hexdigest()[:16]
My goal is an identifier that provides reasonable length and is unlikely to yield the same hash_id value for a different hash_str input.
If the probability of a SHA1 collision is 1/(2^160), or 1/(16^40), then if I take the last sixteen characters of the hex representation, is the probability of a collision only 1/(16^16)? Or are the bytes (or their hex equivalent) not distributed evenly?
Yes. Any hash function which exhibits the property of uniformity has equal chance of any value in its output range being generated by a randomly chosen input value. Therefore, each value of the truncated hash is equally likely too. SHA-1 is is hash function that demonstrates uniformity, therefore your conjecture is true.

how to get reverse(not complement or inverse) of a binary number

I am implementing cooley-tuckey fft(raddix - 2 DIF / DIT) algorithm in matlab.In that for the bit reversing i want to have reverse of an binary number. so can anyone suggest how can I get the reverse of a binary number(like 100111 -> 111001). One who have worked on fft implementation can help me with the algorithm also.
Topic: How to do bit reversal in Matlab? .
If you're using double precision floating point ('double') numbers
which are integers, you can do this:
dr = bin2dec(fliplr(dec2bin(d,n))); % Bits in dr are in reverse order
where n is the number of bits to be reversed and where 0 <= d < 2^n.
You will experience no precision problems at all as long as the
integers are no more than 52 bits long.
And
Re: How to do bit reversal in Matlab?
How large will the numbers be that you need to reverse? May I ask what
is the purpose of it? Maybe there is a more efficient way to solve the
whole problem. If the numbers are large you can just store the bits as
a string. To reverse it just read the string backwards! Or use
fliplr().
(There may be better places to ask).
If it were VHDL I'd suggest an alias with 'REVERSE'RANGE.
Taken from the help section;
Y = swapbytes(X) reverses the byte ordering of each element in array X, converting little-endian values to big-endian (and vice versa). The input array must contain all full, noncomplex, numeric elements.

how to pick a modulo for integer or string hash?

Typically, we do hashing by calculating the integer or string according to a rule, then return hash(int-or-str) % m as the index in the hash table, but how do we choose the modulo m? Is there any convention to follow?
There are two possible conventions. One is to use a prime number, which yields good performance with quadratic probing.
The other is to use a power of two, since n mod m where m = 2^k is a fast operation; it's a bitwise AND with m-1. Of course, the modulus must be equal to the size of the hash table, and powers of two mean your hash table must double in size whenever it's overcrowded. This gives you amortized O(1) insertion in a similar way that a dynamic array does.
Since [val modulo m] is used as an index into the table, m is the number of elements in that table. Are you free to choose that ? Then use a big enough prime number. If you need to resize the table, you can either chose to use a bigger prime number, or (if you choose doubling the table for resizing) you'd better make sure that your hash function has enough entropy in the lower bits.

Using hash functions with Bloom filters

A bloom filter uses a hash function (or many) to generate a value between 0 and m given an input string X. My question is how to you use a hash function to generate a value in this way, for example an MD5 hash is typically represented by a 32 length hex string, how would I use an MD5 hashing algorithm to generate a value between 0 and m where I can specify m? I'm using Java at the moment so an example of to do this with the MessageDigest functionality it offers would be great, though just a generic description of how to do about it would be fine too.
Thanks
You should first convert the hash output to an unsigned integer, then reduce it modulo m. This looks like this:
MessageDigest md = MessageDigest.getInstance("MD5");
// hash data...
byte[] hashValue = md.digest();
BigInteger n = new BigInteger(1, hashValue);
n = n.mod(m);
// at that point, n has a value between 0 and m-1 (inclusive)
I have assumed that m is a BigInteger instance. If necessary, use BigInteger.valueOf(). Similarly, use n.intValue() or n.longValue() to get the value of n as one of the primitive types of Java.
The modular reduction is somewhat biased, but the bias is very small if m is substantially smaller than 2^128.
Simplest way would probably be to just convert the hash output (as a byte sequence) to a single binary number and take that modulo m.