How are uniform hashing functions applied? - hash

According to CLRS page 267, a class of uniform hashing functions are defined, but I am wondering how these functions are applied when hashing a group of keys.
Do we choose a function randomly every time we want to calc a hash value, or we choose a function randomly and use it to calc every hash value for keys in this group?

If you were to randomly choose a hashing function every time you wanted to hash a key, then you'd end up with a mess because different hashing functions create different hash values for the same key. That is, if your key was "foobar", then hash function A would compute a different value for it than hash function B. That wouldn't be useful.
So you choose a hashing function and apply that to every key in that group. Typically, you'll use the same hash function for all keys in your system. In general, there's no particular advantage to having multiple hashing functions in your program. (Yes, I know there are special cases.)

Related

Why isn't modulus sufficient within a hash function for hash tables?

I often see or hear of modulus being used as a last step of hashing or after hashing. e.g. h(input)%N where h is the hash function and % is the modulus operator. If I am designing a hash table, and want to map a large set of keys to a smaller space of indices for the hash table, doesn't the modulus operator achieve that? Furthermore, if I wanted to randomize the distribution across those locations within the hash table, is the remainder generated by modulus not sufficient? What does the hashing function h provide on top of the modulus operator?
I often see or hear of modulus being used as a last step of hashing or after hashing. e.g. h( input ) % N where h is the hash function and % is the modulus operator.
Indeed.
If I am designing a hash table, and want to map a large set of keys to a smaller space of indices for the hash table, doesn't the modulus operator achieve that?
That's precisely the purpose of the modulo operator: to restrict the range of array indexes, so yes.
But you cannot simply use the modulo operator by itself: the modulo operator requires an integer value: you cannot get the "modulo of a string over N" or "modulo of an object-graph over N"[1].
Furthermore, if I wanted to randomize the distribution across those locations within the hash table, is the remainder generated by modulus not sufficient?
No, it does not - because the modulo operator doesn't give you pseudorandom output - nor does it have any kind of avalanche effect - which means that similar input values will have similar output hashes, which will result in clustering in your hashtable bins, which will result in subpar performance due to the greatly increased likelihood of hash-collisions (and so requiring slower techniques like linear-probing which defeat the purpose of a hashtable because you lose O(1) lookup times.
What does the hashing function h provide on top of the modulus operator?
The domain of h can be anything, especially non-integer values.
[1] Technically speaking, this is possible if you use the value of the memory address of an object (i.e. an object pointer), but that doesn't work if you have hashtable keys that don't use object identity, such as a stack-allocated object or custom struct.
First, the hash function's primary purpose is to turn something that's not a number into a number. Even if you just use modulus after that to get a number in your range, getting the number is still the first step and is the responsibility of the hash function. If you're hashing integers and you just use the integers as their own hashes, it isn't that there's no hash function, it's that you've chosen the identity function as your hash function. If you don't write out the function, that means you inlined it.
Second, the hash function can provide a more unpredictable distribution to reduce the likelihood of unintentional collisions. The data people work with often contain patterns and if you're just using a simple identity function with modulus, the pattern in inputs may be such that the modulus is more likely to cause collisions. The hash function presents an opportunity to break this up so it becomes unlikely that modulus exposes patterns in the original data sequence.

Can we reverse second sha256 hash?

Can i reverse sha256 hash like 2nd hash to 1st hash ?
ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb
da3811154d59c4267077ddd8bb768fa9b06399c486e1fc00485116b57c9872f5
2nd hash is generated by sha256(1) so is it possible to reverse to 1st hash ?
In short, as of 2019, NO.
Cryptographic Hash functions are, in short, one-way deterministic but random functions. Deterministic means the same input has always the same output and the random in the sense that the output is unpredictable.
In Cryptography, we consider the security of hash functions by
Preimage-Resistance: for essentially all pre-specified outputs, it is computationally infeasible to find any input which hashes to that output, i.e., to find any preimage x' such that h(x') = y when given any y for which a corresponding input is not known.
2nd-preimage resistance, weak-collision: it is computationally infeasible to find any second input which has the same output as any specified input, i.e., given x, to find a 2nd-preimage x' != x such that h(x) = h(x').
Collision resistance: it is computationally infeasible to find any two distinct inputs x, x' which hash to the same output, i.e., such that h(x) = h(x').
What you are looking for is the preimage. There are cryptographic hash functions like MD4 and SHA-1 for those collisions are found. But all of them are still have pre and 2nd-preimage resistance.
For Sha256 there are no known pre-secondary yet collision attacks. It is considered a secure hash function.
You may find some rainbow tables for SHA-256 that may include your hash values but probably not since the space is too big to cover.
Hashing is meant to be a one way process. If a hashing algorithm were easily reversible, then it would be insecure. To answer your question, no, it's not possible to "unhash" 2 and obtain 1. In order to "crack" the second hash, you would have to brute force it by computing the sha256 of other strings and comparing the result with 2. If they match, then you (probably) have the original string.
Sha256 is a hash function, as defined in wikipedia https://en.wikipedia.org/wiki/Cryptographic_hash_function :
The ideal cryptographic hash function has five main properties:
it is deterministic so the same message always results in the same hash
it is quick to compute the hash value for any given message
it is infeasible to generate a message from its hash value except by trying all possible messages
a small change to a message should change the hash value so extensively that the new hash value appears uncorrelated with the old hash value
it is infeasible to find two different messages with the same hash value
By definition a hash function is useful as long as you cannot reverse to the input.

Hash UUIDs without requiring ordering

I have two UUIDs. I want to hash them perfectly to produce a single unique value, but with a constraint that f(m,n) and f(n,m) must generate the same hash.
UUIDs are 128-bit values
the hash function should have no collisions - all possible input pairings must generate unique hash values
f(m,n) and f(n,m) must generate the same hash - that is, ordering is not important
I'm working in Go, so the resulting value must fit in a 256-bit int
the hash does not need to be reversible
Can anyone help?
Concatenate them with the smaller one first.
To build on user2357112's brilliant solution and boil down the comment chain, let's consider your requirements one by one (and out of order):
No collisions
Technically, that's not a hash function. A hash function is about mapping heterogeneous, arbitrary length data inputs into fixed-width, homogenous outputs. The only way to accomplish that if the input is longer than the output is through some data loss. For most applications, this is tolerable because the hash function is only used as a fast lookup key and the code falls back onto the slower, complete comparison of the data. That's why many guides and languages insist that if you implement one, you must implement the other.
Fortunately, you say:
Two UUID inputs m and n
UUIDs are 128 bits each
Output of f(m,n) must be 256 bits or less
Combined your two inputs are exactly 256 bits, which means you do not have to lose any data. If you needed a smaller output, then you would be out of luck. As it is, you can concatenate the two numbers together and generate a perfect, unique representation.
f(m,n) and f(n,m) must generate the same hash
To accomplish this final requirement, make a decision on the concatenation order by some intrinsic value of the two UUIDs. The suggested smaller-first works just great. However...
The hash does not need to be reversible
If you specifically need irreversible hashing, that's a different question entirely. You could still use the less-than comparison to ensure order independence when feeding to a cryptographically hash function, but you would be hard pressed to find something that guaranteed no collisions even with fixed-width inputs a 256 bit output width.

what does jenkinshash in hadoop guarantee?

I know that jenkinshash produces an integer (2^32) for a given value. The documentation at this link:
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/util/JenkinsHash.html
says
Returns:
a 32-bit value. Every bit of the key affects every bit of the return value. Two keys differing by one or two bits will have totally different hash values.
jenkinshash can return at most 2^32 different results for given values.
What if I have more than 2^32 values?
Will it return same result for two different values?
Thanks
As most hash functions, yes, it may return duplicate hash values for different input data. The guarantee, according to the documentation you linked to, is that values that differs with one or two bits are different. As soon as they differ with 3 bits or more you have no uniqueness-guarantee.
The input data to the hash function may be of a larger size (have more unique input values) than the output of the hash. This trivially makes it so that duplicates must exist in the output data. Consider a hashing function that outputs an integer in the range 1-10 but takes an input in the range 1-100: it is obvious that multiple values must hash to the same value because you cannot enumerate the values 1-100 using only ten different integers. This is called the pigeonhole principle.
Any good hashing function will, however, try to distribute the output values evenly. In the 1-10 example you can expect a good hashing function to give a 2 approximately the same amount of times as a 6.
Hashing functions that guarantee uniqueness are called perfect hash functions. They all provide an output data of at least the same cardinality as the input data. A perfect hashing function for the input integers 1-100 must at least have 100 different output values.
Note that according to Wikipedia the Jenkins hash functions are not cryptographic. This means that you should avoid them for password security and the like, but you can use the hash for somewhat even work distribution and checksums.

Why Does a Bloom Filter Need Multiple Hash Functions?

I don't really understand why a bloom filter requires multiple hash functions (say, SHA and MD5).
Why not just make a bigger SHA hash, for example, and then break it up into multiple parts and treat them as separate hashes? Isn't that more efficient in terms of speed?
The idea is to use several different but simple hash functions. If you're going to use some cryptographic hash function like SHA or MD5 then you could just vary the input to it. Whether it's more efficient depends how complex your hash functions are.
It's called triple/double hashing, it minimizes the chance of collisions, probability of collision occurring with 5 hash functions, is 5 times smaller than with one hash function.