Hashing a string of bounded size

Hashing a string of bounded size - hash

Assuming I have a bounded input string of maximum length 64 characters [0-9,a-z,A-Z]. Given the following code using sha1 hash:
var hash = sha1(str).substring(0,n)
I want to minimize the integer n while still acceptably avoiding collisions.
How to do I calculate the probability of a collision given n and an input set size x?

There is no length that guarantees that there won't be any collision. Even the full 20-byte SHA-1 does not guarantee that there are no collisions: it is computationally expensive to craft collision, but it has been done). Even a 64-byte SHA-512 value does not give a mathematical guarantee that there are no collisions, but the best known ways to find a collision require more energy than is available in the solar system.
If you want a practical guarantee that there are no collisions (even in the face of hostile input), you can use a cryptographic hash that has not been broken, such as SHA-256.
But if this is for indexing rather than security, hashes are usually not a practical way to ensure the absence of collisions. Use a non-cryptographic hash instead. Non-cryptographic hashes make it easy to craft collisions, but they are faster to compute. If there is a collision, use a secondary hash, a binary search in a sorted data structure or a linear search to resolve the ambiguity. This is how data structures such as hash tables work.
There is one case where you can ensure that there are no collisions: when you're working with a fixed data set. In that case, you can calculate a perfect hash function from the data.
Alternatively, hashing may be the wrong tool for the job. Maybe you should keep a central database of indexes instead.

Related

How do I truncate a 64-bit hash into a 32-bit hash? [duplicate]

We're trying to settle an internal debate on our dev team:
We're looking for a 64-bit PHP hash function. We found a PHP implementation of MurmurHash3, but MurmurHash3 is either 32-bit or 128-bit, not 64-bit.
Co-worker #1 believes that to produce a 64-bit hash from MurmurHash3, we can simply slice the first (or last, or any) 64 bits of the 128-bit hash and that it will be as collision-proof as a native 64-bit hash function.
Co-worker #2 believes that we must find a native 64-bit hash function to reduce collisions and that 64-bit slices of a 128-bit hash will not be as collision proof as a native 64-bit hash.
Who's correct?
Does the answer change if we take the first (or last, or any) 64-bits of a cryptographic hash like SHA1 instead of Murmur3?

If you had real random, uniformly distributed values, then "slicing" would yield exactly the same results as if you had started with the smaller value right from the start. To see why, consider this very simple example: Let's say your random generator outputs 3 random bits, but you only need one random bit to work with. Let's assume the output is
b1 b2 b3
The possible values are
000, 001, 010, 011, 100, 101, 110, 111
and all are to occur with equal probability of 1/8. Now whatever bit you slice from those three for your purpose - the first, second or third - the probability of having a '1' is always going to be 1/2, regardless of the position - and the same is true for a '0'.
You can easily scale this experiment to the 64 out of 128 bit case: regardless of which bits you slice, the probability of ending up with a one or a zero in a certain position is going to be one half. What this means is that if you had a sample taken from a uniformly distributed random variable, then slicing wouldn't make the probability for collisions more or less likely.
Now a good question is whether random functions are really the best we can do to prevent collisions. But as it turns out, it can be shown that the probability of finding collisions increases whenever a function deviates from random.
Cryptographic hash functions: co-worker #1 wins
The problem in real life is that hash functions are not random at all, on the contrary, they are boringly deterministic. But a design goal of cryptographic hash functions is as follows: if we didn't know their initial state, then their output would be computationally indistinguishable from a real random function, that is there's no computationally efficient way to tell the difference between the hash output and real random values. This is why you'd consider a hash already as kind of broken if you can find a "distinguisher", a method to tell the hash from real random values with a probability higher than one half. Unfortunately, we can't really prove these properties for existing cryptographic hashes, but unless somebody breaks them, we may assume these properties hold with some confidence. Here is an example of a paper about a distinguisher for one of the SHA-3 submissions that illustrates the process.
To summarize, unless a distinguisher is found for a given cryptographic hash, slicing is perfectly fine and does not increase the probability of a collision.
Non-cryptographic hash functions: co-worker #2 might win
Non-cryptographic hashes do not have to satisfy the same set of requirements as cryptographic hashes do. They are usually defined to be very fast and satisfy certain properties "under sane/benevolent conditions", but they might easily fall short if somebody tries to maliciously manipulate them. A good example for what this means in practice is the computational complexity attack on hash table implementations (hashDoS) presented earlier this year. Under normal conditions, non-crypto hashes work perfectly fine, but their collision resistance may be severely undermined by some clever inputs. This can't happen with cryptographic hash functions, because their very definition requires them to be immune to all sorts of clever inputs.
Because it is possible, sometimes even quite easy, to find a distinguisher like above for the output of non-cryptographic hashes, we can immediately say that they do not qualify as cryptographic hash functions. Being able to tell the difference means that somewhere there is a pattern or bias in the output.
And this fact alone implies that they deviate more or less from a random function, and thus (after what we said above) collisions are probably more likely than they would be for random functions. Finally, since collisions occur with higher probability for the full 128 bits already, this will not get better with shorter ouptputs, collisions will probably be even more likely in that case.
tl;dr You're safe with a cryptographic hash function when truncating it. But you're better off with a "native" 64 bit cryptographic hash function compared to truncating a non-cryptographic hash with a larger output to 64 bits.

Due to the avalanche effect, a strong hash is one where a single bit of change in the source results in half the bits of the hash flipping on average. For a good hash, then, the "hashness" is evenly distributed, and so each section or slice is affected by an equal and evenly distributed amount of source bits, and therefore is just as strong as any other slice of the same bit length could be.
I would agree with co-worker 1 as long as the hash has good properties and even distribution.

This question seems incomplete without this being mentioned:
Some hashes are provably perfect hashes for a specific class of inputs (eg., for input of length n for some reasonable value of n). If you truncate that hash then you are likely to destroy that property, in which case you are, by definition, increasing the rate of collisions from zero to non-zero and you have weakened the hash in that use case.
It's not the general case, but it's an example of a legitimate concern when truncating hashes.

If I use a composite hashing strategy for strings can I virtually eliminate collisions?

Ok so here's the use case. I have lots of somewhat lengthy (200-500 character) strings that I'd like to have a smaller deterministic hash for. Since I can store the full 160-bit SHA1 value in a mere 20 bytes, this yields an order of magnitude space improvement per string.
But of course one has to worry about collisions with hashing on strings even with a crypto hash with decent avalanche effects. I know the chances are infintesimely small, but I'd like to be more conservative. If I do something like this:
hash(input) = CONCAT(HF1(input),HF2(input))
where HF1 is some suitable robust hashing f() and HF2 is another distinct but robust hashing f(). Does this effectively make the chance of a collision near impossible (At the cost of 40 bytes now instead of 20)?
NOTE: I am not concerned with the security/crypto implications of SHA-1 for my use case.
CLARIFICATION: original question was posed about a hashing the concatenated hash value, not concatenating hashes which DOES NOT change the hash collision probabilities of the outer hash function.

Assuming "reasonable" hash functions, then by concatenating, all you're doing is creating a hash function with a larger output space. So yes, this reduces the probability of collision.
But either way, it's probably not worth worrying about. 2^320 is something like the number of particles in the universe. So you only need to worry if you're expecting attackers.

I asked the wrong question initially. This was probably the question I was looking for:
Probability of SHA1 collisions
This was also illuminating
Understanding sha-1 collision weakness
I guess it's fair to ask if I had two hash functions whose concatenated size was smaller than 20 bytes say 2 distinct 32-bit hashing functions. If concatenating those produces a probability that is small enough to ignore in practice since 2 (or even 3) of those concatenated would be smaller than SHA-1.

Faster way to find the correct order of chunks to get a known SHA1 hash?

Say a known SHA1 hash was calculated by concatenating several chunks of data and that the order in which the chunks were concatenated is unknown. The straight forward way to find the order of the chunks that gives the known hash would be to calculate an SHA1 hash for each possible ordering until the known hash is found.
Is it possible to speed this up by calculating an SHA1 hash separately for each chunk and then find the order of the chunks by only manipulating the hashes?

In short, No.
If you are using SHA-1, due to Avalanche Effect ,any tiny change in the plaintext (in your case, your chunks) would alter its corresponding SHA-1 significantly.
Say if you have 4 chunks : A B C and D,
the SHA1 hash of A+B+C+D (concated) is supposed to be uncorrelated with the SHA1 hash for A, B, C and D computed as separately.
Since they are unrelated, you cannot draw any relationship between the concated chunk (A+B+C+D, B+C+A+D etc) and each individual chunk (A,B,C or D).
If you could identify any relationship in-between, the SHA1 hashing algorithm would be in trouble.

Practical answer: no. If the hash function you use is any good, then it is supposed to look like a Random Oracle, the output of which on an exact given input being totally unknown until that input is tried. So you cannot infer anything from the hashes you compute until you hit the exact input ordering that you are looking for. (Strictly speaking, there could exist a hash function which has the usual properties of a hash function, namely collision and preimage resistances, without being a random oracle, but departing from the RO model is still considered as a hash function weakness.)(Still strictly speaking, it is slightly improper to talk about a random oracle for a single, unkeyed function.)
Theoretical answer: it depends. Assuming, for simplicity, that you have N chunks of 512 bits, then you can arrange for the cost not to exceed N*2160 elementary evaluations of SHA-1, which is lower than N! when N >= 42. The idea is that the running state of SHA-1, between two successive blocks, is limited to 160 bits. Of course, that cost is ridiculously infeasible anyway. More generally, your problem is about finding a preimage to SHA-1 with inputs in a custom set S (the N! sequences of your N chunks) so the cost has a lower bound of the size of S and the preimage resistance of SHA-1, whichever is lower. The size of S is N!, which grows very fast when N is increased. SHA-1 has no known weakness with regards to preimages, so its resistance is still assumed to be about 2160 (since it has a 160-bit output).
Edit: this kind of question would be appropriate on the proposed "cryptography" stack exchange, when (if) it is instantiated. Please commit to help create it !

Depending on your hashing library, something like this may work: Say you have blocks A, B, C, and D. You can process the hash for block A, and then clone that state and calculate A+B, A+C, and A+D without having to recalculate A each time. And then you can clone each of those to calculate A+B+C and A+B+D from A+B, A+C+B and A+C+D from A+C, and so on.

Nope. Calculating the complete SHA1 hash requires that the chunks be put in in order. The calculation of the next hash chunk requires the output of the current one. If that wasn't true then it would be much easier to manipulate documents so that you could reorder the chunks at will, which would greatly decrease the usefulness of the algorithm.

How are hash functions like MD5 unique?

I'm aware that MD5 has had some collisions but this is more of a high-level question about hashing functions.
If MD5 hashes any arbitrary string into a 32-digit hex value, then according to the Pigeonhole Principle surely this can not be unique, as there are more unique arbitrary strings than there are unique 32-digit hex values.

You're correct that it cannot guarantee uniqueness, however there are approximately 3.402823669209387e+38 different values in a 32 digit hex value (16^32). That means that, assuming the math behind the algorithm gives a good distribution, your odds are phenomenally small that there will be a duplicate. You do have to keep in mind that it IS possible to duplicate when you're thinking about how it will be used. MD5 is generally used to determine if something has been changed (I.e. it's a checksum). It would be ridiculously unlikely that something could be modified and result in the same MD5 checksum.
Edit: (given recent news re: SHA1 hashes)
The answer above, still holds, but you shouldn't expect an MD5 hash to serve as any kind of security check against manipulation. SHA-1 Hashes as 2^32 (over 4 billion) times less likely to collide, and it has been demonstrated that it is possible to contrive an input to produce the same value. (This was demonstrated against MD5 quite some time ago). If you're looking to ensure nobody has maliciously modified something to produce the same hash value, these days, you need at SHA-2 to have a solid guarantee.
On the other hand, if it's not in a security check context, MD5 still has it's usefulness.
The argument could be made that an SHA-2 hash is cheap enough to compute, that you should just use it anyway.

You are absolutely correct. But hashes are not about "unique", they are about "unique enough".

As others have pointed out, the goal of a hash function like MD5 is to provide a way of easily checking whether two objects are equivalent, without knowing what they originally were (passwords) or comparing them in their entirety (big files).
Say you have an object O and its hash hO. You obtain another object P and wish to check whether it is equal to O. This could be a password, or a file you downloaded (in which case you won't have O but rather the hash of it hO that came with P, most likely). First, you hash P to get hP.
There are now 2 possibilities:
hO and hP are different. This must mean that O and P are different, because using the same hash on 2 values/objects must yield the same value. Hashes are deterministic. There are no false negatives.
hO and hP are equal. As you stated, because of the Pigeonhole Principle this could mean that different objects hashed to the same value, and further action may need to be taken.
a. Because the number of possibilities is so high, if you have faith in your hash function it may be enough to say "Well there was a 1 in 2128 chance of collision (ideal case), so we can assume O = P. This may work for passwords if you restrict the length and complexity of characters, for example. It is why you see hashes of passwords stored in databases rather than the passwords themselves.
b. You may decide that just because the hash came out equal doesn't mean the objects are equal, and do a direct comparison of O and P. You may have a false positive.
So while you may have false positive matches, you won't have false negatives. Depending on your application, and whether you expect the objects to always be equal or always be different, hashing may be a superfluous step.

Cryptographic one-way hash functions are, by nature of definition, not Injective.
In terms of hash functions, "unique" is pretty meaningless. These functions are measured by other attributes, which affects their strength by making it hard to create a pre-image of a given hash. For example, we may care about how many image bits are affected by changing a single bit in the pre-image. We may care about how hard it is to conduct a brute force attack (finding a prie-image for a given hash image). We may care about how hard it is to find a collision: finding two pre-images that have the same hash image, to be used in a birthday attack.

While it is likely that you get collisions if the values to be hashed are much longer than the resulting hash, the number of collisions is still sufficiently low for most purposes (there are 2128 possible hashes total so the chance of two random strings producing the same hash is theoretically close to 1 in 1038).
MD5 was primarily created to do integrity checks, so it is very sensitive to minimal changes. A minor modification in the input will result in a drastically different output. This is why it is hard to guess a password based on the hash value alone.
While the hash itself is not reversible, it is still possible to find a possible input value by pure brute force. This is why you should always make sure to add a salt if you are using MD5 to store password hashes: if you include a salt in the input string, a matching input string has to include exactly the same salt in order to result in the same output string because otherwise the raw input string that matches the output will fail to match after the automated salting (i.e. you can't just "reverse" the MD5 and use it to log in because the reversed MD5 hash will most likely not be the salted string that originally resulted in the creation of the hash).
So hashes are not unique, but the authentication mechanism can be made to make it sufficiently unique (which is one somewhat plausible argument for password restrictions in lieu of salting: the set of strings that results in the same hash will probably contain many strings that do not obey the password restrictions, so it's more difficult to reverse the hash by brute force -- obviously salts are still a good idea nevertheless).
Bigger hashes mean a larger set of possible hashes for the same input set, so a lower chance of overlap, but until processing power advances sufficiently to make brute-forcing MD5 trivial, it's still a decent choice for most purposes.

(It seems to be Hash Function Sunday.)
Cryptographic hash functions are designed to have very, very, very, low duplication rates. For the obvious reason you state, the rate can never be zero.
The Wikipedia page is informative.

As Mike (and basically every one else) said, its not perfect, but it does the job, and collision performance really depends on the algo (which is actually pretty good).
What is of real interest is automatic manipulation of files or data to keep the same hash with different data, see this Demo

As others have answered, hash functions are by definition not guaranteed to return unique values, since there are a fixed number of hashes for an infinite number of inputs. Their key quality is that their collisions are unpredictable.
In other words, they're not easily reversible -- so while there may be many distinct inputs that will produce the same hash result (a "collision"), finding any two of them is computationally infeasible.

Explanation about hashing and its use for data compression

I am facing an application that uses hashing, but I cannot still figure out how it works. Here is my problem, hashing is used to generate some index, and with those indexes I access to different tables, and after I add the value of every table that I get using the indexes and with that I get my final value. This is done to reduce the memory requirements. The input to the hashing function is doing the XOR between a random constant number and some parameters from the application.
Is this a typical hashing application?. The thing that I do not understand is how using hashing can we reduce the memory requirements?. Can anyone clarify this?.
Thank you

Hashing alone doesn't have anything to do with memory.
What it is often used for is a hashtable. Hashtables work by computing the hash of what you are keying off of, which is then used as an index into a data structure.
Hashing allows you to reduce the key (string, etc.) into a more compact value like an integer or set of bits.
That might be the memory savings you're referring to--reducing a large key to a simple integer.
Note, though, that hashes are not unique! A good hashing algorithm minimizes collisions but they are not intended to reduce to a unique value--doing so isn't possible (e.g., if your hash outputs a 32bit integer, your hash would have only 2^32 unique values).

Is it a bloom filter you are talking about? This uses hash functions to get a space efficient way to test membership of a set. If so then see the link for an explanation.

Most good hash implementations are memory inefficient, otherwise there would be more computing involved - and that would exactly be missing the point of hashing.
Hash implementations are used for processing efficiency, as they'll provide you with constant running time for operations like insertion, removal and retrieval.
You can think about the quality of hashing in a way that all your data, no matter what type or size, is always represented in a single fixed-length form.

This could be explained if the hashing being done isn't to build a true hash table, but is to just create an index in a string/memory block table. If you had the same string (or memory sequence) 20 times in your data, and you then replaced all 20 instances of that string with just its hash/table index, you could achieve data compression in that way. If there's an actual collision chain contained in that table for each hash value, however, then what I just described is not what's going on; in that case, the reason for the hashing would most likely be to speed up execution (by providing quick access to stored values), rather than compression.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse