Checksumming: CRC or hash? - hash

Performance and security considerations aside, and assuming a hash function with a perfect avalanche effect, which should I use for checksumming blocks of data: CRC32 or hash truncated to N bytes? I.e. which will have a smaller probability to miss an error? Specifically:
CRC32 vs. 4-byte hash
CRC32 vs. 8-byte hash
CRC64 vs. 8-byte hash
Data blocks are to be transferred over network and stored on disk, repeatedly. Blocks can be 1KB to 1GB in size.
As far as I understand, CRC32 can detect up to 32 bit flips with 100% reliability, but after that its reliability approaches 1-2^(-32) and for some patterns is much worse. A perfect 4-byte hash reliability is always 1-2^(-32), so go figure.
8-byte hash should have a much better overall reliability (2^(-64) chance to miss an error), so should it be preferred over CRC32? What about CRC64?
I guess the answer depends on type of errors that might be expected in such sort of operation. Are we likely to see sparse 1-bit flips or massive block corruptions? Also, given that most storage and networking hardware implements some sort of CRC, should not accidental bit flips be taken care of already?

Only you can say whether 1-2-32 is good enough or not for your application. The error detection performance between a CRC-n and n bits from a good hash function will be very close to the same, so pick whichever one is faster. That is likely to be the CRC-n.
Update:
The above "That is likely to be the CRC-n" is only somewhat likely. It is not so likely if very high performance hash functions are used. In particular, CityHash appears to be very nearly as fast as a CRC-32 calculated using the Intel crc32 hardware instruction! I tested three CityHash routines and the Intel crc32 instruction on a 434 MB file. The crc32 instruction version (which computes a CRC-32C) took 24 ms of CPU time. CityHash64 took 55 ms, CityHash128 60 ms, and CityHashCrc128 50 ms. CityHashCrc128 makes use of the same hardware instruction, though it does not compute a CRC.
In order to get the CRC-32C calculation that fast, I had to get fancy with three crc32 instructions on three separate buffers in order to make use of the three arithmetic logic units in parallel in a single core, and then writing the inner loop in assembler. CityHash is pretty damned fast. If you don't have the crc32 instruction, then you would be hard-pressed to compute a 32-bit CRC as fast as a CityHash64 or CityHash128.
Note however that the CityHash functions would need to be modified for this purpose, or an arbitrary choice would need to be made in order to define a consistent meaning for the CityHash value on large streams of data. The reason is that those functions are not set up to accept buffered data, i.e. feeding the functions a chunk at a time and expecting to get the same result as if the entire set of data were fed to the function at once. The CityHash functions would need to modified to update an intermediate state.
The alternative, and what I did for the quick and dirty testing, is to use the Seed versions of the functions where I would use the CityHash from the previous buffer as the seed for the next buffer. The problem with that is that the result is then dependent on the buffer size. If you feed CityHash different size buffers with this approach, you get different hash values.
Another Update four years later:
Even faster is the xxhash family. I would now recommend that over a CRC for a non-cryptographic hash.

Putting aside "performance" issues; you might want to consider using one of the SHA-2 functions (say SHA-256).

Related

How well do Non-cryptographic hashes detect errors in data vs. CRC-32 etc.?

Non-cryptographic hashes such as MurmurHash3 and xxHash are almost exclusively designed for hash tables, but they appear to function comparably (and even favorably) to CRC-32, Adler-32 and Fletcher-32. Non-crypto hashes are often faster than CRC-32 and produce more "random" output similar to slow cryptographic hashes (MD5, SHA). Despite this, I only ever see CRC-32 or MD5 recommended for data integrity/checksum purposes.
In the table below, I tested 32-bit checksum/CRC/hash functions to determine how well they detect small differences in data:
The results in each cell means: A) number of collisions found, and B) minimum and maximum probability that any of the 32 output bits are set to 1. To pass test B, the max and min should be as close as possible to 50. Anything under 45 or over 55 indicates bias.
Looking at the table, MurmurHash3 and Jenkins lookup2 compare favorably to CRC-32 (which actually fails one test). They are also well-distributed. DJB2 and FNV1a pass collision tests but aren't well distributed. Fletcher32 and Adler32 struggle with the NullBytes and 8RandBytes tests.
So then my question is, compared to other checksums, how suitable are 'non-cryptographic hashes' for detecting errors or differences in files? Is there any reason a CRC-32/Adler-32/CRC-64 might outperform any decent 32-bit/64-bit hash?
Is there any reason this function would be inferior to CRC-32 or
Adler-32 for detecting errors in data?
Yes, for certain kinds of error characteristics. A CRC can be designed to very effectively detect small numbers of bit errors in a packet, as you might expect on an actual communications or storage channel. That's what it's designed for.
For large numbers of errors, any 32-bit check that fills the 32 bits and does a reasonably good job of being sensitive to all of the bits of the packet will work about as well as any other. So your's would be as good as a CRC-32, and a smidge better than an Adler-32. (The Adler-32 deliberately does not use all possible 32-bit values, so has a slightly higher false positive rate than 32-bit checks that use all possible values.)
By the way, looking a little more at your algorithm, it does not distribute over all 32-bit values until you have many bytes of input. So your check would not be as good as any other 32-bit check on a large number of errors until you have covered the possible 32-bit values of the check.

efficient hash function for uris

i am looking for a hash function to build a (global) fixed size id for
strings, most of them URIs.
it should be:
fast
low chance of collision
~ 64bit
exploiting the structure of an uri if that is possible?
would http://murmurhash.googlepages.com/ be a good choice or is there anything better suited?
Try MD4. As far as cryptography is concerned, it is "broken", but since you do not have any security concern (you want a 64-bit output size, which is too small to yield any decent security against collisions), that should not be a problem. MD4 yields a 128-bit value, which you just have to truncate to the size you wish.
Cryptographic hash functions are designed for resilience to explicit attempts at building collisions. Conceivably, one can build a faster function by relaxing that condition (it is easier to beat random collisions than a determinate attacker). There are a few such functions, e.g. MurmurHash. However it may take a quite specific setup to actually notice the speed difference. With my home PC (a 2.4 GHz Core2), I can hash about 10 millions of short strings per second with MD4, using a single CPU core (I have four cores). For MurmurHash to be faster than MD4 in a non-negligible way, it would have to be used in a context involving at least one million hash invocations per second. That does not happen very often...
I'd wait a little longer for MurmurHash3 to be finalized, then use that. The 128-bit version should give you adequate collision protection against the birthday paradox.

Can one construct a "good" hash function using CRC32C as a base?

Given that SSE 4.2 (Intel Core i7 & i5 parts) includes a CRC32 instruction, it seems reasonable to investigate whether one could build a faster general-purpose hash function. According to this only 16 bits of a CRC32 are evenly distributed. So what other transformation would one apply to overcome that?
Update
How about this? Only 16 bits are suitable for a hash value. Fine. If your table is 65535 or less then great. If not, run the CRC value through the Nehalem POPCNT (population count) instruction to get the number of bits set. Then, use that as an index into an array of tables. This works if your table is south of 1mm entries. I'd bet that's cheaper/faster that the best-performing hash functions. Now that GCC 4.5 has a CRC32 intrinsic it should be easy to test...if only I had the copious spare time to work on it.
David
Revisited, August 2014
Prompted by Arnaud Bouchez in a recent comment, and in view of other answers and comments, I acknowledge that the original answer needs to be altered or for the least qualified. I left the original as-is, at the end, for reference.
First, and maybe most important, a fair answer to the question depends on the intended use of the hash code: What does one mean by "good" [hash function...]? Where/how will the hash be used? (e.g. is it for hashing a relatively short input key? Is it for indexing / lookup purposes, to produce message digests or yet other uses? How long is the desired hash code itself, all 32 bits [of CRC32 or derivatives thereof], more bits, fewer... etc?
The OP questions calls for "a faster general-purpose hash function", so the focus is on SPEED (something less CPU intensive and/or something which can make use of parallel processing of various nature). We may note here that the computation time for the hash code itself is often only part of the problem in an application of hash (for example if the size of the hash code or its intrinsic characteristics result in many collisions which require extra cycles to be dealt with). Also the requirement for "general purpose" leaves many questions as to the possible uses.
With this in mind, a short and better answer is, maybe:
Yes, the hardware implementations of CRC32C on newer Intel processors can be used to build faster hash codes; beware however that depending on the specific implementation of the hash and on its application the overall results may be sub-optimal because of the frequency of collisions, of the need to use longer codes. Also, for sure, cryptographic uses of the hash should be carefully vetted because the CRC32 algorithm itself is very weak in this regard.
The original answer cited a article on Evaluating Hash functions by Bret Mulvey and as pointed in Mdlg's answer, the conclusion of this article are erroneous in regards to CRC32 as the implementation of CRC32 it was based on was buggy/flawed. Despite this major error in regards to CRC32, the article provides useful guidance as to the properties of hash algorithms in general. The URL to this article is now defunct; I found it on archive.today but I don't know if the author has it at another location and also whether he updated it.
Other answers here cite CityHash 1.0 as an example of a hash library that uses CRC32C. Apparently, this is used in the context of some longer (than 32 bits) hash codes but not for the CityHash32() function itself. Also, the use of CRC32 by City Hash functions is relatively small, compared with all the shifting and shuffling and other operations that are performed to produce the hash code. (This is not a critique of CityHash for which I have no hands-on experience. I'll go on a limb, from a cursory review of the source code that CityHash functions produce good, e.g. ell distributed codes, but are not significantly faster than various other hash functions.)
Finally, you may also find insight on this issue in a quasi duplicate question on SO .
Original answer and edit (April 2010)
A priori, this sounds like a bad idea!.
CRC32 was not designed for hashing purposes, and its distribution is likely to not be uniform, hence making it a relatively poor hash-code. Furthermore, its "scrambling" power is relatively weak, making for a very poor one-way hash, as would be used in cryptographic applications.
[BRB: I'm looking for online references to that effect...]
Google's first [keywords = CRC32 distribution] hit seems to confirm this :
Evaluating CRC32 for hash tables
Edit: The page cited above, and indeed the complete article provides a good basis of what to look for in Hash functions.
Reading [quickly] this article, confirmed the blanket statement that in general CRC32 should not be used as a hash, however, and depending on the specific purpose of the hash, it may be possible to use, at least in part, a CRC32 as a hash code.
For example the lower (or higher, depending on implementation) 16 bits of the CRC32 code have a relatively even distribution, and, provided that one isn't concerned about the cryptographic properties of the hash code (i.e. for example the fact that similar keys produce very similar codes), it may be possible to build a hash code which uses, say, a concatenation of the lower [or higher] 16 bits for two CRC32 codes produced with the two halves (or whatever division) of the original key.
One would need to run tests to see if the efficiency of the built-in CRC32 instruction, relative to an alternative hash functions, would be such that the overhead of calling the instruction twice and splicing the code together etc. wouldn't result in an overall slower function.
The article referred to in other answers draws incorrect conclusions based on buggy crc32 code. Google's ranking algorithm does not rank based on scientific accuracy yet.
Contrary to the referred article "Evaluating CRC32 for hash tables" conclusions, CRC32 and CRC32C are acceptable for hash table use. The author's sample code has a bug in the crc32 table generation. Fixing the crc32 table, gives satifactory results using the same methodology. Also the speed of the CRC32 instruction, makes it the best choice in many contexts. Code using the CRC32 instruction is 16x faster at peak than an optimal software implementation. (Note that CRC32 is not exactly the same than CRC32C which the intel instruction implements.)
CRC32 is obviously not suitable for crypto use. (32 bit is a joke to brute force).
Yes. CityHash 1.0.1 includes some new "good hash functions" that use CRC32 instructions.
Just so long as you're not after crypto hash it just might work.
For cryptographic purposes, CRC32 is a bad fundation because it is linear (over the vector space GF(2)^32) and that is hard to correct. It may work for non-cryptographic purposes.
However, recent Intel cores have the AES-NI instructions, which basically perform 1/10th of an AES block encryption in two clock cycles. They are available on the most recent i5 and i7 processors (see the Wikipedia page for some details). This looks like a good start for building a cryptographic hash function (and a hash function which is good for cryptography will also be good for about anything else).
Indeed, at least one of the SHA-3 "round 2" candidates (the ECHO hash function) is built around the AES elements so that the AES-NI opcodes provide a very substantial performance boost. (Unfortunately, in the absence of AES-NI instruction, ECHO performance somewhat sucks.)

What hash algorithms are parallelizable? Optimizing the hashing of large files utilizing on multi-core CPUs

I'm interested in optimizing the hashing of some large files (optimizing wall clock time). The I/O has been optimized well enough already and the I/O device (local SSD) is only tapped at about 25% of capacity, while one of the CPU cores is completely maxed-out.
I have more cores available, and in the future will likely have even more cores. So far I've only been able to tap into more cores if I happen to need multiple hashes of the same file, say an MD5 AND a SHA256 at the same time. I can use the same I/O stream to feed two or more hash algorithms, and I get the faster algorithms done for free (as far as wall clock time). As I understand most hash algorithms, each new bit changes the entire result, and it is inherently challenging/impossible to do in parallel.
Are any of the mainstream hash algorithms parallelizable?
Are there any non-mainstream hashes that are parallelizable (and that have at least a sample implementation available)?
As future CPUs will trend toward more cores and a leveling off in clock speed, is there any way to improve the performance of file hashing? (other than liquid nitrogen cooled overclocking?) or is it inherently non-parallelizable?
There is actually a lot of research going on in this area. The US National Institute of Standards and Technology is currently holding a competition to design the next-generation of government-grade hash function. Most of the proposals for that are parallelizable.
One example: http://www.schneier.com/skein1.2.pdf
Wikipedia's description of current status of the contest: http://en.wikipedia.org/wiki/SHA-3
What kind of SSD do you have ? My C implementation of MD5 runs at 400 MB/s on a single Intel Core2 core (2.4 GHz, not the latest Intel). Do you really have SSD which support a bandwidth of 1.6 GB/s ? I want the same !
Tree hashing can be applied on any hash function. There are a few subtleties and the Skein specification tries to deal with them, integrating some metadata in the function itself (this does not change much things for performance), but the "tree mode" of Skein is not "the" Skein as submitted to SHA-3. Even if Skein is selected as SHA-3, the output of a tree-mode hash would not be the same as the output of "plain Skein".
Hopefully, a standard will be defined at some point, to describe generic tree hashing. Right now there is none. However, some protocols have been defined with support for a custom tree hashing with the Tiger hash function, under the name "TTH" (Tiger Tree Hash) or "THEX" (Tree Hash Exchange Format). The specification for TTH appears to be a bit elusive; I find some references to drafts which have either moved or disappeared for good.
Still, I am a bit dubious about the concept. It is kind of neat, but provides a performance boost only if you can read data faster than what a single core can process, and, given the right function and the right implementation, a single core can hash quite a lot of data per second. A tree hash spread over several cores requires having the data sent to the proper cores, and 1.6 GB/s is not the smallest bandwidth ever.
SHA-256 and SHA-512 are not very fast. Among the SHA-3 candidates, assuming an x86 processor in 64-bit mode, some of them achieve high speed (more than 300 MB/s on my 2.4 GHz Intel Core2 Q6600, with a single core -- that's what I can get out of SHA-1, too), e.g. BMW, SHABAL or Skein. Cryptographically speaking, these designs are a bit too new, but MD5 and SHA-1 are already cryptographically "broken" (quite effectively in the case of MD5, rather theoretically for SHA-1) so any of the round-2 SHA-3 candidates should be fine.
When I put my "seer" cap, I foresee that processors will keep on becoming faster than RAM, to the point that hashing cost will be dwarfed out by memory bandwidth: the CPU will have clock cycles to spare while it waits for the data from the main RAM. At some point, the whole threading model (one big RAM for many cores) will have to be amended.
You didn't say what you need your hash for.
If you're not gonna exchange it with the outside world but just for internal use, simply divide each file in chunks, compute and store all the checksums. You can then use many cores just by throwing a chunk to each one.
Two solutions that comes to mind is dividing files in fixed-size chunks (simpler, but will use less cores for smaller files where you're not supposed to need all that power) or in a fixed-number of chunks (will use all the cores for every file). Really depends on what you want to achieve and what your file size distribution looks like.
If, on the other hand, you need hashes for the outside world, as you can read from the other replies it's not possible with "standard" hashes (eg. if you want to send out SHA1 hashes for others to check with different tools) so you must look somewhere else. Like computing the hash when you store the file, for later retrieval, or compute hashes in background with the 'free' cores and store for later retrieval.
The better solution depends on what your constraints are and where you can invest space, time or cpu power.

Hash function combining - is there a significant decrease in collision risk?

Does anyone know if there's a real benefit regarding decreasing collision probability by combining hash functions? I especially need to know this regarding 32 bit hashing, namely combining Adler32 and CRC32.
Basically, will adler32(crc32(data)) yield a smaller collision probability than crc32(data)?
The last comment here gives some test results in favor of combining, but no source is mentioned.
For my purpose, collision is not critical (i.e. the task does not involve security), but I'd rather minimize the probability anyway, if possible.
PS: I'm just starting in the wonderful world of hashing, doing a lot of reading about it. Sorry if I asked a silly question, I haven't even acquired the proper "hash dialect" yet, probably my Google searches regarding this were also poorly formed.
Thanks.
This doesn't make sense combining them in series like that. You are hashing one 32-bit space to another 32-bit space.
In the case of a crc32 collision in the first step, the final result is still a collision. Then you add on any potential collisions in the adler32 step. So it can not get any better, and can only be the same or worse.
To reduce collisions, you might try something like using the two hashes independently to create a 64-bit output space:
adler32(data) << 32 | crc32(data)
Whether there is significant benefit in doing that, I'm not sure.
Note that the original comment you referred to was storing the hashes independently:
Whichever algorithm you use there is
going to be some chance of false
positives. However, you can reduce
these chances by a considerable margin
by using two different hashing
algorithms. If you were to calculate
and store both the CRC32 and the
Alder32 for each url, the odds of a
simultaneous collision for both hashes
for any given pair of urls is vastly
reduced.
Of course that means storing twice as
much information which is a part of
your original problem. However, there
is a way of storing both sets of hash
data such that it requires minimal
memory (10kb or so) whilst giving
almost the same lookup performance (15
microsecs/lookup compared to 5
microsecs) as Perl's hashes.