Are SHA1 hashes distributed uniformly? - hash

I have a string in Python. I calculate the SHA1 hash of that string with hashlib. I convert it to its hexadecimal representation and take the last 16 characters to use as an identifier:
hash_str = "foobarbazάλφαβήταγάμμα..."
hash_obj = hashlib.sha1(hash_str, encode('utf-8'))
hash_id = hash_obj.hexdigest()[:16]
My goal is an identifier that provides reasonable length and is unlikely to yield the same hash_id value for a different hash_str input.
If the probability of a SHA1 collision is 1/(2^160), or 1/(16^40), then if I take the last sixteen characters of the hex representation, is the probability of a collision only 1/(16^16)? Or are the bytes (or their hex equivalent) not distributed evenly?

Yes. Any hash function which exhibits the property of uniformity has equal chance of any value in its output range being generated by a randomly chosen input value. Therefore, each value of the truncated hash is equally likely too. SHA-1 is is hash function that demonstrates uniformity, therefore your conjecture is true.

Related

how can I create a hash function in which different permutaions of digits of an integer form the same key?

for example 20986 and 96208 should generate the same key (but not 09862 or 9862 as leading zero means it not even a 5 digit number so we igore those).
One option is to get the least/max sorted permutation and then the sorted number is the hashkey, but sorting is too costly for my case. I need to generate key in O(1) time.
Other idea I have is to traverse the number and get frequency of each digits and the then get a hash function out of it. Now whats the best function to combine the frequencies given that 0<= Summation(f[i]) <= no_of_digits.
To create an order-insensitive hash simply hash each value (in your case the digits of the number) and then combine them using a commutative function (e.g. addition/multiplication/XOR). XOR is probably the most appropriate as it retains a constant hash output size and is very fast.
Also, you will want to strip away any leading 0's before hashing the number.

Maximum length of a string after performing unicode casefolding

I need to perform casefolding on a set of strings, and must ensure beforehand that they will not exceed a given length after this is done (to hard-code the needed buffer size). The problem is that a string length (in code points) may change after casefolding is applied. See, e.g., in Python3:
>>> "süß".casefold()
'süss'
Now, the maximum number of code points a string may contain after performing casefolding can be computed easily:
>>> max(len(chr(s).casefold()) for s in range(0x10FFFF + 1))
3
But is it valid in all cases? I mean, is it possible that the sequence of code points (the order in which they appear) might affect the final length of the string, due to some arcane property of Unicode? Or can I assume that the final string will always be at most 3 times longer than the original?
The Unicode standard defines casefolding as follows:
toCasefold(X): Map each character C in X to Case_Folding(C).
So every character in a string is casefolded regardless of context and the results are concatenated. This means that your assumption is correct: A casefolded string is guaranteed to have at most three times the number of code points of the original.

How to evaluate a hash generating algorithm

What ways do you know to evaluate the efficiency of a hash function besides generating a large set of values and see the distribution of values?
By efficiency I mean that the keys generated by your hash function distribute evenly. Is there a way to prove this without actually testing for actual values?
A hash function is only even in the context of the data being hashed
Consider two data sets:
Set 1
1, 3, 6, 2, 7, 9, 5, 8, 4
Set 2
65355, 96424664, 86463624, 133, 643564, 24232, 88677, 865747, 2224
A good hashing function for one set (ie mod 10 for set 1) gives no collisions and could be seen as the perfect hash for that data set
However apply it to the second set and there are collisions everywhere
Hash = (x * 37) mod 256
Is much better for the second set but may not suit the first set quite so well... Especially when partitioning the hash for eg a small number of buckets.
What you can do is evaluate a hash against random data that you "expect" your function to have to handle... But that is making assumptions...
Premature optimisation is looking for the perfect hash function before you have enough real data to base your assessment on.
You should get enough data well before the cost of rehashing becomes prohibitive to change your hash function
Update
Lets suppose we are looking for a hash function that generates an 8 bit hash of the input data. Lets further suppose that the hash function is supposed to take byte-streams of varying length.
If we assume that the bytes in the byte-streams are uniformly distributed, we can make some assessment of different hash functions.
int hash = 0;
for (byte b in datastream) hash = hash xor b;
This function will produce uniformly distributed hash values for the specified data set, and would therefore be a good hash function in this context. If you don't see why this is, then you might have other problems.
int hash = 37;
for (byte b in datastream hash = (31 * hash + b) mod 256;
This function will produce uniformly distributed hash values for the specified data set, and would therefore be a good hash function in this context.
Now lets change the data set from being variable length strings of random numbers in the range 0 to 255 to being variable length strings comprising English sentences encoded as US-ASCII.
The XOR is then a poor hash because the input data never has the 8th bit set and as a result only generates hashes in the range 0-127, also there is a higher likelyhood of some "hot" values because of the letter frequency in english words and the cancelling affect of the XOR.
The pair of primes remains reasonably good as a hash function because it uses the full output range and the prime initial offset coupled with a different prime multiplier tends to spread the values out. But it is still weak for collisions due to how English language is structured... Something that only testing with real data can show.

Pearson perfect hashing

I'm trying to write a generator that produces Pearson perfect hashes. Note that I don't need a minimal perfect hash. Wikipedia says that a Pearson perfect hash can be found in O(|S|) time using a randomized algorithm (where S is the set of keys). However, I haven't been able to find such an algorithm online. Is this even possible?
Note: I don't want to use gperf/cmph/etc., I'd rather write my own implementation.
Pearson's original paper outlines an algorithm to construct a permutation table T for perfect hashing:
The table T at the heart of this new hashing function can sometimes be modified to produce a minimal, perfect hashing function over a modest list of words. In fact, one can usually choose the exact value of the function for a particular word. For example, Knuth [3] illustrates perfect hashing with an algorithm that maps a list of 31 common English words onto unique integers between −10 and 30. The table T presented in Table II maps these same 31 words onto the integers from 1 to 31 in alphabetic order.
Although the procedure for constructing the table in Table II is too involved to be detailed here, the following highlights will enable the interested reader to repeat the process:
A table T was constructed by pseudorandom permutation of the integers (0 ... 255).
One by one, the desired values were assigned to the words in the list. Each assignment was effected by exchanging two elements in the table.
For each word, the first candidate considered for exchange was T[h[n − 1] ⊕ C[n]], the last table element referenced in the computation of the hash function for that word.
A table element could not be exchanged if it was referenced during the hashing of a previously assigned word or if it was referenced earlier in the hashing of the same word.
If the necessary exchange was forbidden by Rule 4, attention was shifted to the previously referenced table element, T[h[n − 2] ⊕ C[n − 1]].
The procedure is not always successful. For example, using the ASCII character codes, if the word “a” hashes to 0 and the word “i” hashes to 15, it turns out that the word “in” must hash to 0. Initial attempts to map Knuth's 31 words onto the integers (0 ... 30) failed for exactly this reason. The shift to the range (1 ... 31) was an ad hoc tactic to circumvent this problem.
Does this tampering with T damage the statistical behavior of the hashing function? Not seriously. When the 26,662 dictionary entries are hashed into 256 bins, the resulting distribution is still not significantly different from uniform (χ² = 266.03, 255 d.f., p = 0.30). Hashing the 128 randomly selected dictionary words resulted in an average of 27.5 collisions versus 26.8 with the unmodified T. When this function is extended as described above to produce 16-bit hash indices, the same test produces a substantially greater number of collisions (4,870 versus 4,721 with the unmodified T), although the distribution still is not significantly different from uniform (χ² = 565.2, 532 d.f., p = 0.154).

Using hash functions with Bloom filters

A bloom filter uses a hash function (or many) to generate a value between 0 and m given an input string X. My question is how to you use a hash function to generate a value in this way, for example an MD5 hash is typically represented by a 32 length hex string, how would I use an MD5 hashing algorithm to generate a value between 0 and m where I can specify m? I'm using Java at the moment so an example of to do this with the MessageDigest functionality it offers would be great, though just a generic description of how to do about it would be fine too.
Thanks
You should first convert the hash output to an unsigned integer, then reduce it modulo m. This looks like this:
MessageDigest md = MessageDigest.getInstance("MD5");
// hash data...
byte[] hashValue = md.digest();
BigInteger n = new BigInteger(1, hashValue);
n = n.mod(m);
// at that point, n has a value between 0 and m-1 (inclusive)
I have assumed that m is a BigInteger instance. If necessary, use BigInteger.valueOf(). Similarly, use n.intValue() or n.longValue() to get the value of n as one of the primitive types of Java.
The modular reduction is somewhat biased, but the bias is very small if m is substantially smaller than 2^128.
Simplest way would probably be to just convert the hash output (as a byte sequence) to a single binary number and take that modulo m.