Explanation about hashing and its use for data compression - hash

I am facing an application that uses hashing, but I cannot still figure out how it works. Here is my problem, hashing is used to generate some index, and with those indexes I access to different tables, and after I add the value of every table that I get using the indexes and with that I get my final value. This is done to reduce the memory requirements. The input to the hashing function is doing the XOR between a random constant number and some parameters from the application.
Is this a typical hashing application?. The thing that I do not understand is how using hashing can we reduce the memory requirements?. Can anyone clarify this?.
Thank you

Hashing alone doesn't have anything to do with memory.
What it is often used for is a hashtable. Hashtables work by computing the hash of what you are keying off of, which is then used as an index into a data structure.
Hashing allows you to reduce the key (string, etc.) into a more compact value like an integer or set of bits.
That might be the memory savings you're referring to--reducing a large key to a simple integer.
Note, though, that hashes are not unique! A good hashing algorithm minimizes collisions but they are not intended to reduce to a unique value--doing so isn't possible (e.g., if your hash outputs a 32bit integer, your hash would have only 2^32 unique values).

Is it a bloom filter you are talking about? This uses hash functions to get a space efficient way to test membership of a set. If so then see the link for an explanation.

Most good hash implementations are memory inefficient, otherwise there would be more computing involved - and that would exactly be missing the point of hashing.
Hash implementations are used for processing efficiency, as they'll provide you with constant running time for operations like insertion, removal and retrieval.
You can think about the quality of hashing in a way that all your data, no matter what type or size, is always represented in a single fixed-length form.

This could be explained if the hashing being done isn't to build a true hash table, but is to just create an index in a string/memory block table. If you had the same string (or memory sequence) 20 times in your data, and you then replaced all 20 instances of that string with just its hash/table index, you could achieve data compression in that way. If there's an actual collision chain contained in that table for each hash value, however, then what I just described is not what's going on; in that case, the reason for the hashing would most likely be to speed up execution (by providing quick access to stored values), rather than compression.

Related

Hashing a string of bounded size

Assuming I have a bounded input string of maximum length 64 characters [0-9,a-z,A-Z]. Given the following code using sha1 hash:
var hash = sha1(str).substring(0,n)
I want to minimize the integer n while still acceptably avoiding collisions.
How to do I calculate the probability of a collision given n and an input set size x?
There is no length that guarantees that there won't be any collision. Even the full 20-byte SHA-1 does not guarantee that there are no collisions: it is computationally expensive to craft collision, but it has been done). Even a 64-byte SHA-512 value does not give a mathematical guarantee that there are no collisions, but the best known ways to find a collision require more energy than is available in the solar system.
If you want a practical guarantee that there are no collisions (even in the face of hostile input), you can use a cryptographic hash that has not been broken, such as SHA-256.
But if this is for indexing rather than security, hashes are usually not a practical way to ensure the absence of collisions. Use a non-cryptographic hash instead. Non-cryptographic hashes make it easy to craft collisions, but they are faster to compute. If there is a collision, use a secondary hash, a binary search in a sorted data structure or a linear search to resolve the ambiguity. This is how data structures such as hash tables work.
There is one case where you can ensure that there are no collisions: when you're working with a fixed data set. In that case, you can calculate a perfect hash function from the data.
Alternatively, hashing may be the wrong tool for the job. Maybe you should keep a central database of indexes instead.

Hash UUIDs without requiring ordering

I have two UUIDs. I want to hash them perfectly to produce a single unique value, but with a constraint that f(m,n) and f(n,m) must generate the same hash.
UUIDs are 128-bit values
the hash function should have no collisions - all possible input pairings must generate unique hash values
f(m,n) and f(n,m) must generate the same hash - that is, ordering is not important
I'm working in Go, so the resulting value must fit in a 256-bit int
the hash does not need to be reversible
Can anyone help?
Concatenate them with the smaller one first.
To build on user2357112's brilliant solution and boil down the comment chain, let's consider your requirements one by one (and out of order):
No collisions
Technically, that's not a hash function. A hash function is about mapping heterogeneous, arbitrary length data inputs into fixed-width, homogenous outputs. The only way to accomplish that if the input is longer than the output is through some data loss. For most applications, this is tolerable because the hash function is only used as a fast lookup key and the code falls back onto the slower, complete comparison of the data. That's why many guides and languages insist that if you implement one, you must implement the other.
Fortunately, you say:
Two UUID inputs m and n
UUIDs are 128 bits each
Output of f(m,n) must be 256 bits or less
Combined your two inputs are exactly 256 bits, which means you do not have to lose any data. If you needed a smaller output, then you would be out of luck. As it is, you can concatenate the two numbers together and generate a perfect, unique representation.
f(m,n) and f(n,m) must generate the same hash
To accomplish this final requirement, make a decision on the concatenation order by some intrinsic value of the two UUIDs. The suggested smaller-first works just great. However...
The hash does not need to be reversible
If you specifically need irreversible hashing, that's a different question entirely. You could still use the less-than comparison to ensure order independence when feeding to a cryptographically hash function, but you would be hard pressed to find something that guaranteed no collisions even with fixed-width inputs a 256 bit output width.

Is there a solution to creating a perfect hash table for non-finite inputs?

So hash tables are really cool for constant-time lookups of data in sets, but as I understand they are limited by possible hashing collisions which leads to increased small amounts of time-complexity.
It seems to me like any hashing function that supports a non-finite range of inputs is really a heuristic for reducing collision. Are there any absolute limitations to creating a perfect hash table for any range of inputs, or is it just something that no one has figured out yet?
I think this depends on what you mean by "any range of inputs."
If your goal is to create a hash function that can take in anything and never produce a collision, then there's no way to do what you're asking. This is a consequence of the pigeonhole principle - if you have n objects that can be hashed, you need at least n distinct outputs for your hash function or you're forced to get at least one hash collision. If there are infinitely many possible input objects, then no finite hash table could be built that will always avoid collisions.
On the other hand, if your goal is to build a hash table where lookups are worst-case O(1) (that is, you only have to look at a fixed number of locations to find any element), then there are many different options available. You could use a dynamic perfect hash table or a cuckoo hash table, which supports worst-case O(1) lookups and expected O(1) insertions and deletions. These hash tables work by using a variety of different hash functions rather than any one fixed hash function, which helps circumvent the above restriction.
Hope this helps!

best way to resolve collisions in hashing strings

I got asked this question at an interview and said to use a second has function, but the interviewer kept probing me for other answers. Anyone have other solutions?
best way to resolve collisions in hashing strings
"with continuous inserts"
Assuming the inserts are of strings whose contents can't be predicted, then reasonable options are:
Use a displacement list, so you try a number of offsets from the
hashed-to bucket until you find a free bucket (modding by table
size). Displacement lists might look something like { 3, 5, 11,
19... } etc. - ideally you want to have the difference between
displacements not be the sum of a sequence of other displacements.
rehash using a different algorithm (but then you'd need yet another
algorithm if you happen to clash twice etc.)
root a container in the
buckets, such that colliding strings can be searched for. Typically
the number of buckets should be similar to or greater than the
number of elements, so elements per bucket will be fairly small and
a brute-force search through an array/vector is a reasonable
approach, but a linked list is also credible.
Comparing these, displacement lists tend to be fastest (because adding an offset is cheaper than calculating another hash or support separate heap & allocation, and in most cases the first one or two displacements (which can reasonably be by a small number of buckets) is enough to find an empty bucket so the locality of memory use is reasonable) though they're more collision prone than an alternative hashing algorithm (which should approach #elements/#buckets chance of further collisions). With both displacement lists and rehashing you have to provide enough retries that in practice you won't expect a complete failure, add some last-resort handling for failures, or accept that failures may happen.
Use a linked list as the hash bucket. So any collisions are handled gracefully.
Alternative approach: You might want to concider using a trie instead of a hash table for dictionaries of strings.
The up side of this approach is you get O(|S|) worst case complexity for seeking/inserting each string [where |S| is the length of that string]. Note that hash table allows you only average case of O(|S|), where the worst case is O(|S|*n) [where n is the size of the dictionary]. A trie also does not require rehashing when load balance is too high.
Assuming we are not using a perfect hash function (which you usually don't have) the hash tells you that:
if the hashes are different, the objects are distinct
if the hashes are the same, the objects are probably the same (if good hashing function is used), but may still be distinct.
So in a hashtable, the collision will be resolved with some additional checking if the objects are actually the same or not (this brings some performance penalty, but according to Amdahl's law, you still gained a lot, because collisions rarely happen for good hashing functions). In a dictionary you just need to resolve that rare collision cases and assure you get the right object out.
Using another non-perfect hash function will not resolve anything, it just reduces the chance of (another) collision.

How are hash functions like MD5 unique?

I'm aware that MD5 has had some collisions but this is more of a high-level question about hashing functions.
If MD5 hashes any arbitrary string into a 32-digit hex value, then according to the Pigeonhole Principle surely this can not be unique, as there are more unique arbitrary strings than there are unique 32-digit hex values.
You're correct that it cannot guarantee uniqueness, however there are approximately 3.402823669209387e+38 different values in a 32 digit hex value (16^32). That means that, assuming the math behind the algorithm gives a good distribution, your odds are phenomenally small that there will be a duplicate. You do have to keep in mind that it IS possible to duplicate when you're thinking about how it will be used. MD5 is generally used to determine if something has been changed (I.e. it's a checksum). It would be ridiculously unlikely that something could be modified and result in the same MD5 checksum.
Edit: (given recent news re: SHA1 hashes)
The answer above, still holds, but you shouldn't expect an MD5 hash to serve as any kind of security check against manipulation. SHA-1 Hashes as 2^32 (over 4 billion) times less likely to collide, and it has been demonstrated that it is possible to contrive an input to produce the same value. (This was demonstrated against MD5 quite some time ago). If you're looking to ensure nobody has maliciously modified something to produce the same hash value, these days, you need at SHA-2 to have a solid guarantee.
On the other hand, if it's not in a security check context, MD5 still has it's usefulness.
The argument could be made that an SHA-2 hash is cheap enough to compute, that you should just use it anyway.
You are absolutely correct. But hashes are not about "unique", they are about "unique enough".
As others have pointed out, the goal of a hash function like MD5 is to provide a way of easily checking whether two objects are equivalent, without knowing what they originally were (passwords) or comparing them in their entirety (big files).
Say you have an object O and its hash hO. You obtain another object P and wish to check whether it is equal to O. This could be a password, or a file you downloaded (in which case you won't have O but rather the hash of it hO that came with P, most likely). First, you hash P to get hP.
There are now 2 possibilities:
hO and hP are different. This must mean that O and P are different, because using the same hash on 2 values/objects must yield the same value. Hashes are deterministic. There are no false negatives.
hO and hP are equal. As you stated, because of the Pigeonhole Principle this could mean that different objects hashed to the same value, and further action may need to be taken.
a. Because the number of possibilities is so high, if you have faith in your hash function it may be enough to say "Well there was a 1 in 2128 chance of collision (ideal case), so we can assume O = P. This may work for passwords if you restrict the length and complexity of characters, for example. It is why you see hashes of passwords stored in databases rather than the passwords themselves.
b. You may decide that just because the hash came out equal doesn't mean the objects are equal, and do a direct comparison of O and P. You may have a false positive.
So while you may have false positive matches, you won't have false negatives. Depending on your application, and whether you expect the objects to always be equal or always be different, hashing may be a superfluous step.
Cryptographic one-way hash functions are, by nature of definition, not Injective.
In terms of hash functions, "unique" is pretty meaningless. These functions are measured by other attributes, which affects their strength by making it hard to create a pre-image of a given hash. For example, we may care about how many image bits are affected by changing a single bit in the pre-image. We may care about how hard it is to conduct a brute force attack (finding a prie-image for a given hash image). We may care about how hard it is to find a collision: finding two pre-images that have the same hash image, to be used in a birthday attack.
While it is likely that you get collisions if the values to be hashed are much longer than the resulting hash, the number of collisions is still sufficiently low for most purposes (there are 2128 possible hashes total so the chance of two random strings producing the same hash is theoretically close to 1 in 1038).
MD5 was primarily created to do integrity checks, so it is very sensitive to minimal changes. A minor modification in the input will result in a drastically different output. This is why it is hard to guess a password based on the hash value alone.
While the hash itself is not reversible, it is still possible to find a possible input value by pure brute force. This is why you should always make sure to add a salt if you are using MD5 to store password hashes: if you include a salt in the input string, a matching input string has to include exactly the same salt in order to result in the same output string because otherwise the raw input string that matches the output will fail to match after the automated salting (i.e. you can't just "reverse" the MD5 and use it to log in because the reversed MD5 hash will most likely not be the salted string that originally resulted in the creation of the hash).
So hashes are not unique, but the authentication mechanism can be made to make it sufficiently unique (which is one somewhat plausible argument for password restrictions in lieu of salting: the set of strings that results in the same hash will probably contain many strings that do not obey the password restrictions, so it's more difficult to reverse the hash by brute force -- obviously salts are still a good idea nevertheless).
Bigger hashes mean a larger set of possible hashes for the same input set, so a lower chance of overlap, but until processing power advances sufficiently to make brute-forcing MD5 trivial, it's still a decent choice for most purposes.
(It seems to be Hash Function Sunday.)
Cryptographic hash functions are designed to have very, very, very, low duplication rates. For the obvious reason you state, the rate can never be zero.
The Wikipedia page is informative.
As Mike (and basically every one else) said, its not perfect, but it does the job, and collision performance really depends on the algo (which is actually pretty good).
What is of real interest is automatic manipulation of files or data to keep the same hash with different data, see this Demo
As others have answered, hash functions are by definition not guaranteed to return unique values, since there are a fixed number of hashes for an infinite number of inputs. Their key quality is that their collisions are unpredictable.
In other words, they're not easily reversible -- so while there may be many distinct inputs that will produce the same hash result (a "collision"), finding any two of them is computationally infeasible.