I'm looking for an algorithm which given a messy value (like a url with a querystring) produces a value which can be used as a key.
Ideally, it should have a fairly low collision rate, produce a value which is shorter than the input, be made of alphanumerics (a-z 1-9), and create the same output given the same input (though not necessarily reversible).
Anything come to mind?
Several excellent examples exist as industry standards, such as SHA-2 and MD5. There will be a library for running either of these any language you are using.
MD5 or or the SHA functions are fine (however, MD5 is less secure in terms of collision rate, I'd suggest SHA1, or the best, but longest SHA512).
They're implemented in OpenSSL for C, or CommonCrypto in Objective-C, etc.
You might also consider using CRC64 as a hash function, because it has a really small footprint and a rather good collision rate -- compared to its length, of course.
Quite a lot. MD5, SHA-1, SHA-128, SHA-256,...
If say... MD5's 32 character are to long for you, use substring and just take the first x characters.
Of course there are many other solutions like CRC32 that you could use, but it's hard to say what will be good for your use case without knowing what your goal is...
Related
Let's say we have the following:
- String: str
- MD4 hash of the string: MD4(str)
- MD5 hash of the string: MD5(str)
MD4 and MD5 are cryptographically "broken" algorithms, meaning it is not difficult to:
1) find str_2 where MD4(str) = MD4(str_2) (i.e. attack on MD4)
2) find str_3 where MD5(str) = MD5(str_3) (i.e. attack on MD5)
But how hard would it be to:
3) find str_4 where MD4(str) = MD4(str_4) AND MD5(str) = MD5(str_4)
(i.e. attack on MD4 and MD5 simultaneously)?
The obvious (probably not very efficient) way would be to:
1) Find a string STR where MD4(STR) = MD4(str)
2) Check if MD5(STR) = MD5(str)
3) If so, we're done. If not, go back to step 1 and satisfy step 1 with a different string.
But the above algorithm doesn't seem fast to me (or is it?). So is it true that a string hashed by both MD4 and MD5 would be quite safe from a second preimage attack?
EDIT:
(1) The main concern is enhancing second pre-image resistance
(2) The main motivation is not to use outdated hashes for today's applications. Rather, it is two-fold: first, I am anticipating the day that hashes considered secure today become broken. For example, If I use only SHA-2, then the day it becomes broken is the same day I will become very worried about my data. But if I use SHA-2 and BCrypt, then even if both become individually broken, it may still be unfeasible to defeat the second pre-image resistance of concat(Sha2_hash, Bcrypt_Hash). Second, I want to reduce the chance of accidental collision (server thinks two inputs are the same because two hashes JUST so happens to be the same)
This sort of thing doesn't improve security as much as you think. The resulting (M+N) bit value is actually weaker than the output of a hash that natively generates (M+N) bits of output. This answer on crypto.stackexchange.com goes a little deeper if you want to know more details.
But the bottom line is that when constructing a hash function whose output is the concatenation of other hash functions, the output you get is, at best, as strong as the strongest constituent hash.
And I have to ask why even use MD4 or MD5 and go to this trouble to begin with? Use SHA-3. If you want to feel "extra safe" then calculate the margin of safety that you feel comfortable with, and increase it by some percentage. That is, if you feel that 384 bits are enough, then go for 512.
So, with some more information about you are trying to do, which is to use the file contents to generate both a "quick checksum" value and generate a unique locator/identifier for the file at the same time I still think that choosing a single hash is the better approach.
If you insist on using two hash functions, then I would submit that instead of concatenating two hashes the better approach would be to instead use a HMAC using two different hash functions/algorithms. Please note, that I do not have a rigorous proof that this works better, or that this construct won't generate horrible output. So take it with a grain of salt:
Let H1 and H2 be two cryptographically secure hash functions, and let P be your input data. Then, the hash & file identifier for your file is given by the construct:
HMAC (K,P) = H1((KGEN(P) ⊕ PAD1) ∥ H1((KGEN(P) ⊕ PAD2) ∥ P)))
Where
KGEN (P) = H2(P)
It is kinda more difficult. Because one would need to calculate collision for MD4 and simultaneously for MD5. But kinda is a lame term in cryptography. Rolling your own security scheme is the enemy #1. However, there are examples, when people chain algorithms, such as DES => 3 DES or TrueCrypt allows chaining several encryption algorithms or PBKDF2 key derivation runs the same algorithms N times.
Seriously, if you need a strong hash - use SHA2 and onwards.
The problem with finding MD4 and MD5 hash collisions is that it's possible to make a chain of devices that would allow attacker to linearly scale number of attack attempts, and given large enough budget this sounds plausible.
how hard is it to find x
where
sha1(x) = x?
where x is the form of 'c999303647068a6abaca25717850c26c9cd0d89c'
i think the fact that there are sha1 collisions make this possible, but, how easy (or hard) is it to find an example?
Read Cryptanalysis of SHA-1 on Wikipedia. There's more information than you need on that article and its references combined.
Edit:
how hard is it to find x where sha1(x) = x?
Such an attack is known as a preimage attack and finding such an x is usually much harder than a general collision attack, i.e. finding arbitrary x1 and x2 such that sha(x1) = sha(x2).
SHA1 Collisions can be Found in 2^63 Operations. I would say its rather hard. You could go about brute forcing it. Get the book applied cryptography and sit down for a read. Look into the Birthday Paradox, which can be used to find collisions.
The one most important reason for existence of cryptographic hash functions (of which SHA family functions are) is to make finding inputs corresponding to a given digest difficult. A cryptographic hash function producing N-bit digests is considered good if to find a matching input one must perform 2^N/2 operations in average, that is, no other way than brute-force is reliably possible.
So you are searching for mathematical invariant for SHA1 transformation. invariant subspace problem. :-)
I'm aware that MD5 has had some collisions but this is more of a high-level question about hashing functions.
If MD5 hashes any arbitrary string into a 32-digit hex value, then according to the Pigeonhole Principle surely this can not be unique, as there are more unique arbitrary strings than there are unique 32-digit hex values.
You're correct that it cannot guarantee uniqueness, however there are approximately 3.402823669209387e+38 different values in a 32 digit hex value (16^32). That means that, assuming the math behind the algorithm gives a good distribution, your odds are phenomenally small that there will be a duplicate. You do have to keep in mind that it IS possible to duplicate when you're thinking about how it will be used. MD5 is generally used to determine if something has been changed (I.e. it's a checksum). It would be ridiculously unlikely that something could be modified and result in the same MD5 checksum.
Edit: (given recent news re: SHA1 hashes)
The answer above, still holds, but you shouldn't expect an MD5 hash to serve as any kind of security check against manipulation. SHA-1 Hashes as 2^32 (over 4 billion) times less likely to collide, and it has been demonstrated that it is possible to contrive an input to produce the same value. (This was demonstrated against MD5 quite some time ago). If you're looking to ensure nobody has maliciously modified something to produce the same hash value, these days, you need at SHA-2 to have a solid guarantee.
On the other hand, if it's not in a security check context, MD5 still has it's usefulness.
The argument could be made that an SHA-2 hash is cheap enough to compute, that you should just use it anyway.
You are absolutely correct. But hashes are not about "unique", they are about "unique enough".
As others have pointed out, the goal of a hash function like MD5 is to provide a way of easily checking whether two objects are equivalent, without knowing what they originally were (passwords) or comparing them in their entirety (big files).
Say you have an object O and its hash hO. You obtain another object P and wish to check whether it is equal to O. This could be a password, or a file you downloaded (in which case you won't have O but rather the hash of it hO that came with P, most likely). First, you hash P to get hP.
There are now 2 possibilities:
hO and hP are different. This must mean that O and P are different, because using the same hash on 2 values/objects must yield the same value. Hashes are deterministic. There are no false negatives.
hO and hP are equal. As you stated, because of the Pigeonhole Principle this could mean that different objects hashed to the same value, and further action may need to be taken.
a. Because the number of possibilities is so high, if you have faith in your hash function it may be enough to say "Well there was a 1 in 2128 chance of collision (ideal case), so we can assume O = P. This may work for passwords if you restrict the length and complexity of characters, for example. It is why you see hashes of passwords stored in databases rather than the passwords themselves.
b. You may decide that just because the hash came out equal doesn't mean the objects are equal, and do a direct comparison of O and P. You may have a false positive.
So while you may have false positive matches, you won't have false negatives. Depending on your application, and whether you expect the objects to always be equal or always be different, hashing may be a superfluous step.
Cryptographic one-way hash functions are, by nature of definition, not Injective.
In terms of hash functions, "unique" is pretty meaningless. These functions are measured by other attributes, which affects their strength by making it hard to create a pre-image of a given hash. For example, we may care about how many image bits are affected by changing a single bit in the pre-image. We may care about how hard it is to conduct a brute force attack (finding a prie-image for a given hash image). We may care about how hard it is to find a collision: finding two pre-images that have the same hash image, to be used in a birthday attack.
While it is likely that you get collisions if the values to be hashed are much longer than the resulting hash, the number of collisions is still sufficiently low for most purposes (there are 2128 possible hashes total so the chance of two random strings producing the same hash is theoretically close to 1 in 1038).
MD5 was primarily created to do integrity checks, so it is very sensitive to minimal changes. A minor modification in the input will result in a drastically different output. This is why it is hard to guess a password based on the hash value alone.
While the hash itself is not reversible, it is still possible to find a possible input value by pure brute force. This is why you should always make sure to add a salt if you are using MD5 to store password hashes: if you include a salt in the input string, a matching input string has to include exactly the same salt in order to result in the same output string because otherwise the raw input string that matches the output will fail to match after the automated salting (i.e. you can't just "reverse" the MD5 and use it to log in because the reversed MD5 hash will most likely not be the salted string that originally resulted in the creation of the hash).
So hashes are not unique, but the authentication mechanism can be made to make it sufficiently unique (which is one somewhat plausible argument for password restrictions in lieu of salting: the set of strings that results in the same hash will probably contain many strings that do not obey the password restrictions, so it's more difficult to reverse the hash by brute force -- obviously salts are still a good idea nevertheless).
Bigger hashes mean a larger set of possible hashes for the same input set, so a lower chance of overlap, but until processing power advances sufficiently to make brute-forcing MD5 trivial, it's still a decent choice for most purposes.
(It seems to be Hash Function Sunday.)
Cryptographic hash functions are designed to have very, very, very, low duplication rates. For the obvious reason you state, the rate can never be zero.
The Wikipedia page is informative.
As Mike (and basically every one else) said, its not perfect, but it does the job, and collision performance really depends on the algo (which is actually pretty good).
What is of real interest is automatic manipulation of files or data to keep the same hash with different data, see this Demo
As others have answered, hash functions are by definition not guaranteed to return unique values, since there are a fixed number of hashes for an infinite number of inputs. Their key quality is that their collisions are unpredictable.
In other words, they're not easily reversible -- so while there may be many distinct inputs that will produce the same hash result (a "collision"), finding any two of them is computationally infeasible.
This is more of a cryptography theory question, but is it possible that the result of a hash algorithm will ever be the same value as the source? For example, say I have a string:
baf34551fecb48acc3da868eb85e1b6dac9de356
If I get the SHA1 hash on it, the result is:
4d2f72adbafddfe49a726990a1bcb8d34d3da162
In theory, is there ever a case where these two values would match? I'm not asking about SHA1 specifically here - it's just my example. I'm just wondering if hashing algorithms are built in such a way as to prevent this.
Well, it would depend on the hashing algorithm - but I'd be surprised to see anything explicitly prevent this. After all, it really shouldn't matter.
I suspect it's very unlikely to happen, of course (for cryptographic hashes)... but even if it does, that shouldn't cause a problem.
For non-crypto hashes (used in hash tables etc) it would be perfectly reasonable to return the source value in some cases. For example, in Java, Integer.hashCode() just returns the embedded value.
Sure, the Python hashing algorithm for integers returns the value of the integer. So hash(1) == 1.
Given a good hashing algorithm, one that returns a seemingly random output, I believe there should be on average one input that gives itself as the output. Let's say the hash can give N possible outputs. That means there are N possible inputs for which this is possible. For each of those, the odds of the output matching the input is 1/N, so there the expected number of fixed points is N*1/N, or 1.
A hash function might be defined to avoid ‘fixed points’ where hash(x)==x, but your hash-quine differs a little in that you're taking the string representation in hex of the hash rather than the raw binary. It would, I think, be infeasible to design a hash that could frustrate that, and it's mathematically less interesting since it depends on the arbitrary mapping of 0-F to ASCII character codes.
See Is there an MD5 Fixed Point where md5(x) == x? for a discussion about fixed points in MD5. The probability calculation would be equally true for hex hash-quines and any other hash function with 128 bits of output.
Does anyone know if there's a real benefit regarding decreasing collision probability by combining hash functions? I especially need to know this regarding 32 bit hashing, namely combining Adler32 and CRC32.
Basically, will adler32(crc32(data)) yield a smaller collision probability than crc32(data)?
The last comment here gives some test results in favor of combining, but no source is mentioned.
For my purpose, collision is not critical (i.e. the task does not involve security), but I'd rather minimize the probability anyway, if possible.
PS: I'm just starting in the wonderful world of hashing, doing a lot of reading about it. Sorry if I asked a silly question, I haven't even acquired the proper "hash dialect" yet, probably my Google searches regarding this were also poorly formed.
Thanks.
This doesn't make sense combining them in series like that. You are hashing one 32-bit space to another 32-bit space.
In the case of a crc32 collision in the first step, the final result is still a collision. Then you add on any potential collisions in the adler32 step. So it can not get any better, and can only be the same or worse.
To reduce collisions, you might try something like using the two hashes independently to create a 64-bit output space:
adler32(data) << 32 | crc32(data)
Whether there is significant benefit in doing that, I'm not sure.
Note that the original comment you referred to was storing the hashes independently:
Whichever algorithm you use there is
going to be some chance of false
positives. However, you can reduce
these chances by a considerable margin
by using two different hashing
algorithms. If you were to calculate
and store both the CRC32 and the
Alder32 for each url, the odds of a
simultaneous collision for both hashes
for any given pair of urls is vastly
reduced.
Of course that means storing twice as
much information which is a part of
your original problem. However, there
is a way of storing both sets of hash
data such that it requires minimal
memory (10kb or so) whilst giving
almost the same lookup performance (15
microsecs/lookup compared to 5
microsecs) as Perl's hashes.