I want to create a hash or checksum for each of millions of URLs, such that identical URLs (after sanitizing) have the same hash/checksum.
If I generate SHA-1 (20 bytes) or SHA-256 hashes (32 bytes) of the URLs, and store them as big integers (8 bytes) by XORing each 8-bytes chunk of the hash (C# code example here), then is it still safe from collisions? I've read some people say that it should be fine, but haven't found any credible source.
As I understand, a XOR of [1, 5] and [5, 1] will be same, despite them being different sequences, so the hash XOR technique might result in collisions. In that case, are any of the non-crypto hash algorithms like MurMur, FNV or xxHash better for my use case, which requires least chance of collisions at decent performance (not necessarily the fastest)?
Related
Let's say I have strings that need not be reversible and let's say I use SHA224 to hash it.
The hash of hello world is 2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b and its length is 56 bytes.
What if I convert every two chars to its numerical representation and make a single byte out of them?
In Python I'd do something like this:
shalist = list("2f05477fc24bb4faefd86517156dafdecec45b8ad3cf2522a563582b")
for first_byte,next_byte in zip(shalist[0::2],shalist[1::2]):
chr(ord(first_byte)+ord(next_byte))
The result will be \x98ek\x9d\x95\x96\x96\xc7\xcb\x9ckhf\x9a\xc7\xc9\xc8\x97\x97\x99\x97\xc9gd\x96im\x94. 28 bytes. Effectively halved the input.
Now, is there a higher hash collision risk by doing so?
The simple answer is pretty obvious: yes, it increases the chance of collision by as many powers of 2 as there are bits missing. For 56 bytes halved to 28 bytes you get the chance of collision increased 2^(28*8). That still leaves the chance of collision at 1:2^(28*8).
Your use of that truncation can be still perfectly legit, depending what it is. Git for example shows only the first few bytes from a commit hash and for most practical purposes the short one works fine.
A "perfect" hash should retain a proportional amount of "effective" bits if you truncate it. For example 32 bits of SHA256 result should have the same "strength" as a 32-bit CRC, although there may be some special properties of CRC that make it more suitable for some purposes while the truncated SHA may be better for others.
If you're doing any kind of security with this it will be difficult to prove your system, you're probably better of using a shorter but complete hash.
Lets shrink the size to make sense of it and use 2 bytes hash instead of 56. The original hash will have 65536 possible values, so if you hash more than that many strings you will surely get a collision. Half that to 1 bytes and you will get a collision after at most 256 strings hashed, regardless do you take the first or the second byte. So your chance of collision is 256 greater (2^(1byte*8bits)) and is 1:256.
Long hashes are used to make it truly impractical to brute-force them, even after long years of cryptanalysis. When MD5 was introduced in 1991 it was considered secure enough to use for certificate signing, in 2008 it was considered "broken" and not suitable for security-related use. Various cryptanalysis techniques can be developed to reduce the "effective" strength of hash and encryption algorithms, so the more spare bits there are (in an otherwise strong algorithm) the more effective bits should remain to keep the hash secure for all practical purposes.
I would like to store hashes for approximately 2 billion strings. For that purpose I would like to use as less storage as possible.
Consider an ideal hashing algorithm which returns hash as series of hexadecimal digits (like an md5 hash).
As far as i understand the idea this means that i need hash to be not less and not more than 8 symbols in length. Because such hash would be capable of hashing 4+ billion (16 * 16 * 16 * 16 * 16 * 16 * 16 * 16) distinct strings.
So I'd like to know whether it is it safe to cut hash to a certain length to save space ?
(hashes, of course, should not collide)
Yes/No/Maybe - i would appreciate answers with explanations or links to related studies.
P.s. - i know i can test whether 8-character hash would be ok to store 2 billion strings. But i need to compare 2 billion hashes with their 2 billion cutted versions. It doesn't seem trivial to me so i'd better ask before i do that.
The hash is a number, not a string of hexadecimal numbers (characters). In case of MD5, it is 128 bits or 16 bytes saved in efficient form. If your problem still applies, you sure can consider truncating the number (by either coersing into a word or first bitshifting by). Good hash algorithms distribute evenly to all bits.
Addendum:
Generally whenever you deal with hashes, you want to check if the strings really match. This takes care of the possibility of collising hashes. The more you cut the hash the more collisions you're going to get. But it's good to plan for that happening at this phase.
Whether or not its safe to store x values in a hash domain only capable of representing 2x distinct hash values depends entirely on whether you can tolerate collisions.
Hash functions are effectively random number generators, so your 2 billion calculated hash values will be distributed evenly about the 4 billion possible results. This means that you are subject to the Birthday Problem.
In your case, if you calculate 2^31 (2 billion) hashes with only 2^32 (4 billion) possible hash values, the chance of at least two having the same hash (a collision) is very, very nearly 100%. (And the chance of three being the same is also very, very nearly 100%. And so on.) I can't find the formula for calculating the probable number of collisions based on these numbers, but I suspect it is a huge number.
If in your case hash collisions are not a disaster (such as in Java's HashMap implementation which deals with collisions by turning the hash target into a list of objects which share the same hash key, albeit at the cost of reduced performance) then maybe you can live with the certainty of a high number of collisions. But if you need uniqueness then you need either a far, far larger hash domain, or you need to assign each record a guaranteed-unique serial ID number, depending on your purposes.
Finally, note that Keccak is capable of generating any desired output length, so it makes little sense to spend CPU resources generating a long hash output only to trim it down afterwards. You should be able to tell your Keccak function to give only the number of bits you require. (Also note that a change in Keccak output length does not affect the initial output bits, so the result will be exactly the same as if you did a manual bitwise trim afterwards.)
I'm trying to hash a large number of files with binary data inside of them in order to:
(1) check for corruption in the future, and
(2) eliminate duplicate files (which might have completely different names and other metadata).
I know about md5 and sha1 and their relatives, but my understanding is that these are designed for security and therefore are deliberately slow in order to reduce the efficacy of brute force attacks. In contrast, I want algorithms that run as fast as possible, while reducing collisions as much as possible.
Any suggestions?
You are the most right. If your system does not have any adversary, using cryptographic hash-functions is overkill given their security properties.
Collisions depend on the number of bits, b, of your hash function and the number of hash values, N, you estimate to compute. Academic literature defends this collision probability must be bellow hardware error probability, so it is less likely to make a collision with a hash function than to be comparing data byte-by-byte [ref1,ref2,ref3,ref4,ref5]. Hardware error probability is in the range of 2^-12 and 2^-15 [ref6]. If you expect to generate N=2^q hash values then your collision probability may be given by this equation, which already takes into account the birthday paradox:
The number of bits of your hash function is directly proportional to its computational complexity. So you are interested in finding an hash function with the minimum bits possible, while being able to maintain collision probability at acceptable values.
Here's an example on how to make that analysis:
Let's say you have f=2^15 files;
The average size of each file lf is 2^20 bytes;
You pretend to divide each file into chunks of average size lc equal to 2^10 bytes;
Each file will be divided into c=lf/lc=2^10 chunks;
You will then hash q = f*c =2^25 objects.
From that equation the collision probability for several hash sizes is the following:
P(hash=64 bits) = 2^(2*25-64+1) = 2^-13 (lesser than 2^-12)
P(hash=128 bits) = 2^(2*25-128+1) 2^-77 (way much lesser than 2^-12)
Now you just need to decide which non-cryptographic hash function of 64 or 128 bits you will use, knowing 64 bits it pretty close to hardware error probability (but will be faster) and 128 bits is a much safer option (though slower).
Bellow you can find a small list removed from wikipedia of non-cryptographic hash functions. I know Murmurhash3 and it is much faster than any cryptographic hash function:
Fowler–Noll–Vo : 32, 64, 128, 256, 512 and 1024 bits
Jenkins : 64 and 128 bits
MurmurHash : 32, 64, 128, and 160 bits
CityHash : 64, 128 and 256 bits
MD5 and SHA1 are not designed for security, no, so they are not particularly secure, and hence not really very slow, either. I've used MD5 for deduplication myself (with Python), and performance was just fine.
This article claims machines today can compute the MD5 hash of 330 MB of data per second.
SHA-1 was developed as a safer alternative to MD5 when it was discovered that you could craft inputs that would hash to the same value with MD5, but I think for your purposes MD5 will work fine. It certainly did for me.
If security is not a concern for you you can take one of the secure hash functions and reduce the number of rounds. This makes the cryptographically unsound but still perfect for equality-testing.
Skein is very strong. It has 80 rounds. Try reducing to 10 or so.
Or encrypt with AES and XOR the output blocks together. AES is hardware-accelerated on modern CPUs and insanely fast.
Interestingly I haven't found enough information regarding any test or experiment of collision chances of single 512bit hash like whirlpool versus concatenation of 4 128bit hashes like md5, sha1 etc.
Possibility of 4 128bit hashes to appear same seems less probable than single 512bit hash when the data on which hashing is performed is considerably of small size merely on average 100 characters.
But its just an apparent guess with no basis because I haven't performed any test. What you think about it?
Edit its like
512bit hash vs 128bit hash . 128bit hash . 128bit hash . 128bit hash (4 128bit hash concatenated)
Edit2
I want to use hash for this index on url or hashing considering RAM
and purpose is to minimize the possibility of collision because I want to set hash column as unique instead of url column.
Edit3
Please note that purpose of this question is to find the way to minimize the possibility of collision. Having said that, Why I need to focus more on minimizing the possibility of collision? Here comes my Edit2 description which leads to finding the solution to use less RAM. So, interests are both in minimizing the collision and lower RAM usage. But prime focus of this question is lowering the possibility of collision.
It sounds like you want to compare the collision behaviour of:
hash512(x)
with the collision behaviour of:
hash128_a(x) . hash128_b(x) . hash128_c(x) . hash128_d(x)
where "." denotes concatenation, and hash128_a, hash128_b, etc. are four different 128-bit hash algorithms.
The answer is: it depends entirely on the properties of the individual hashes involved.
Consider, for instance that the 128-bit hash functions could be implemented as:
uint128_t hash128_a(T x) { return hash512(x)[ 0:127]; }
uint128_t hash128_b(T x) { return hash512(x)[128:255]; }
uint128_t hash128_c(T x) { return hash512(x)[256:383]; }
uint128_t hash128_d(T x) { return hash512(x)[384:511]; }
In which case, the performance would be identical.
The classical article to read on that question is due to Hoch and Shamir. It builds on previous discoveries, especially by Joux. Bottom-line is the following: if you take four hash functions with a 128-bit output, and the four hash functions use the Merkle-Damgård construction, then finding a collision for the whole 512-bit ouput is no more difficult than finding a collision for one of the hash functions. MD5, SHA-1... use the MD construction.
On the other hand, if some of your hash functions use a distinct structure, in particular with a wider running state, the concatenation could yield a stronger function. See the example from #Oli: if all four functions are SHA-512 with some surgery on the output, then the concatenated hash function could be plain SHA-512.
The only sure thing about the concatenation of four hash functions is that the result will be no less collision-resistant than the strongest of the four hash functions. This has been used within SSL/TLS, which, up to version 1.1, internally uses concurrently both MD5 and SHA-1 in an attempt to resist breaks on either.
512 bits is 512 bits. The only difference is in the difference in imperfections in the hashes. The best overall hash would be a 512 using the best algorithm available.
Edit to add clarification, because it's too long for a comment:
An ideal hash maps content uniformly onto x bits. If you have 4 (completely independent) x-bit hashes, that maps the file uniformly onto 4x bits; a 4x-bit hash still maps the same file uniformly onto 4x bits. 4x bits is 4x bits; as long as it's perfectly uniform, it doesn't matter whether it comes from one (4x) hash function or 4 (x). However, no hash can be completely ideal, so you want the most uniform obtainable distribution, and if you use 4 different functions, only 1 can be the closest to optimal so you have x optimal bits and 3x suboptimal, whereas a single algorithm can cover the entire 4x space with the most optimal distribution.
I suppose it is possible that enough larger algorithms could have subsets of bits that are more uniformly distributed than a single 512, and could be combined to get more uniformity, but that seems like it would be a great deal extra research and implementation for little potential benefit.
If you are comparing concatenating four different 'ideal' 128bit hashing algorithms with one ideal 512 bit hashing algorithm, then yes, both methods will get you the same probability of a collision. Using md5 would make it easier to crack a hash though. If an attacker for example knew you were doing md5 + md5 w/ salt + md5 with another salt .. then that would be much easier to crack as md5 collision attack. Look here for more information about hash functions that have known attacks.
Here is a little conundrum for you: If you use a hash algorithm like CRC-64 then how many bytes in a string would be necessary to read to calculate a good hash? Lets say all your strings are at least 2 KB long then it seems a waste or resources using the whole string to calculate the cache, but just how many characters do you think is enough? Would just 8 ASCII-characters be enough since it equals 64-bits? Wont using more than 8 ASCII characters just be pointless? I want to know your though on this.
Update:
With a 'good hash' I mean the point where the likelihood of hash collisions can not get any less by using even more bytes to calculate it.
If you use CRC-64 over 8 bytes or less then there is no point in using CRC-64: just use the 8 bytes "as is". A CRC does not have any added value unless the input is longer than the intended output.
As a general rule, if your hash function has an output of n bits then collisions begin to appear once you have accumulated about 2n/2 strings. In shorter words, if you use 64 bits, then it is very unlikely that you encounter a collision in the first 2 billions of strings. If you get a 160-bit or more output, then collisions are virtually unfeasible (you will encounter much less collisions than hardware failures such as the CPU catching fire). This assumes that the hash function is "perfect". If your hash function begins by selecting a few data bytes, then, necessarily, the bytes that you do not select cannot have any influence on the hash output, so you'd better use the "good" bytes -- which utterly depends on the kind of strings that you are hashing. There is no general rule here.
My advice would be to first try using a generic hash function over the whole string; I usually recommend MD4. MD4 is a cryptographic hash function, which has been utterly broken, but for a problem with no security involved, it is still very good at mixing data elements (cryptographically speaking, a CRC is so much more broken than MD4). MD4 has been reported to actually be faster than CRC-32 on some platforms, so you could give it a shot. On a basic PC (my 2.4 GHz Core2), a MD4 implementation works at about 700 MBytes/s, so we are talking about 35000 hashed 2 kB strings per second, which is not bad.
What are the chances that the first 8 letters of two different strings are the same? Depending on what these strings are, it could be very high, in which case you'll definitely get hash collisions.
Hash the whole thing. A few kilobytes is nothing. Unless you actually have a need to save nanoseconds in your program, not hashing the full strings would be premature optimization.