Can one construct a "good" hash function using CRC32C as a base? - hash

Given that SSE 4.2 (Intel Core i7 & i5 parts) includes a CRC32 instruction, it seems reasonable to investigate whether one could build a faster general-purpose hash function. According to this only 16 bits of a CRC32 are evenly distributed. So what other transformation would one apply to overcome that?
Update
How about this? Only 16 bits are suitable for a hash value. Fine. If your table is 65535 or less then great. If not, run the CRC value through the Nehalem POPCNT (population count) instruction to get the number of bits set. Then, use that as an index into an array of tables. This works if your table is south of 1mm entries. I'd bet that's cheaper/faster that the best-performing hash functions. Now that GCC 4.5 has a CRC32 intrinsic it should be easy to test...if only I had the copious spare time to work on it.
David

Revisited, August 2014
Prompted by Arnaud Bouchez in a recent comment, and in view of other answers and comments, I acknowledge that the original answer needs to be altered or for the least qualified. I left the original as-is, at the end, for reference.
First, and maybe most important, a fair answer to the question depends on the intended use of the hash code: What does one mean by "good" [hash function...]? Where/how will the hash be used? (e.g. is it for hashing a relatively short input key? Is it for indexing / lookup purposes, to produce message digests or yet other uses? How long is the desired hash code itself, all 32 bits [of CRC32 or derivatives thereof], more bits, fewer... etc?
The OP questions calls for "a faster general-purpose hash function", so the focus is on SPEED (something less CPU intensive and/or something which can make use of parallel processing of various nature). We may note here that the computation time for the hash code itself is often only part of the problem in an application of hash (for example if the size of the hash code or its intrinsic characteristics result in many collisions which require extra cycles to be dealt with). Also the requirement for "general purpose" leaves many questions as to the possible uses.
With this in mind, a short and better answer is, maybe:
Yes, the hardware implementations of CRC32C on newer Intel processors can be used to build faster hash codes; beware however that depending on the specific implementation of the hash and on its application the overall results may be sub-optimal because of the frequency of collisions, of the need to use longer codes. Also, for sure, cryptographic uses of the hash should be carefully vetted because the CRC32 algorithm itself is very weak in this regard.
The original answer cited a article on Evaluating Hash functions by Bret Mulvey and as pointed in Mdlg's answer, the conclusion of this article are erroneous in regards to CRC32 as the implementation of CRC32 it was based on was buggy/flawed. Despite this major error in regards to CRC32, the article provides useful guidance as to the properties of hash algorithms in general. The URL to this article is now defunct; I found it on archive.today but I don't know if the author has it at another location and also whether he updated it.
Other answers here cite CityHash 1.0 as an example of a hash library that uses CRC32C. Apparently, this is used in the context of some longer (than 32 bits) hash codes but not for the CityHash32() function itself. Also, the use of CRC32 by City Hash functions is relatively small, compared with all the shifting and shuffling and other operations that are performed to produce the hash code. (This is not a critique of CityHash for which I have no hands-on experience. I'll go on a limb, from a cursory review of the source code that CityHash functions produce good, e.g. ell distributed codes, but are not significantly faster than various other hash functions.)
Finally, you may also find insight on this issue in a quasi duplicate question on SO .
Original answer and edit (April 2010)
A priori, this sounds like a bad idea!.
CRC32 was not designed for hashing purposes, and its distribution is likely to not be uniform, hence making it a relatively poor hash-code. Furthermore, its "scrambling" power is relatively weak, making for a very poor one-way hash, as would be used in cryptographic applications.
[BRB: I'm looking for online references to that effect...]
Google's first [keywords = CRC32 distribution] hit seems to confirm this :
Evaluating CRC32 for hash tables
Edit: The page cited above, and indeed the complete article provides a good basis of what to look for in Hash functions.
Reading [quickly] this article, confirmed the blanket statement that in general CRC32 should not be used as a hash, however, and depending on the specific purpose of the hash, it may be possible to use, at least in part, a CRC32 as a hash code.
For example the lower (or higher, depending on implementation) 16 bits of the CRC32 code have a relatively even distribution, and, provided that one isn't concerned about the cryptographic properties of the hash code (i.e. for example the fact that similar keys produce very similar codes), it may be possible to build a hash code which uses, say, a concatenation of the lower [or higher] 16 bits for two CRC32 codes produced with the two halves (or whatever division) of the original key.
One would need to run tests to see if the efficiency of the built-in CRC32 instruction, relative to an alternative hash functions, would be such that the overhead of calling the instruction twice and splicing the code together etc. wouldn't result in an overall slower function.

The article referred to in other answers draws incorrect conclusions based on buggy crc32 code. Google's ranking algorithm does not rank based on scientific accuracy yet.
Contrary to the referred article "Evaluating CRC32 for hash tables" conclusions, CRC32 and CRC32C are acceptable for hash table use. The author's sample code has a bug in the crc32 table generation. Fixing the crc32 table, gives satifactory results using the same methodology. Also the speed of the CRC32 instruction, makes it the best choice in many contexts. Code using the CRC32 instruction is 16x faster at peak than an optimal software implementation. (Note that CRC32 is not exactly the same than CRC32C which the intel instruction implements.)
CRC32 is obviously not suitable for crypto use. (32 bit is a joke to brute force).

Yes. CityHash 1.0.1 includes some new "good hash functions" that use CRC32 instructions.

Just so long as you're not after crypto hash it just might work.

For cryptographic purposes, CRC32 is a bad fundation because it is linear (over the vector space GF(2)^32) and that is hard to correct. It may work for non-cryptographic purposes.
However, recent Intel cores have the AES-NI instructions, which basically perform 1/10th of an AES block encryption in two clock cycles. They are available on the most recent i5 and i7 processors (see the Wikipedia page for some details). This looks like a good start for building a cryptographic hash function (and a hash function which is good for cryptography will also be good for about anything else).
Indeed, at least one of the SHA-3 "round 2" candidates (the ECHO hash function) is built around the AES elements so that the AES-NI opcodes provide a very substantial performance boost. (Unfortunately, in the absence of AES-NI instruction, ECHO performance somewhat sucks.)

Related

CRC32 vs CRC32C?

What is the difference of CRC32 and CRC32C? I know CRC32 for a long time, but just heard CRC32C today. Are they basically the same method (i.e. both results in the same hash for a given data)?
The CRC32 found in zip and a lot of other places uses the polynomial 0x04C11DB7; its reversed form 0xEDB88320 is perhaps better known, being often found in little-endian implementations.
CRC32C uses a different polynomial (0x1EDC6F41, reversed 0x82F63B78) but otherwise the computation is the same. The results are different, naturally. This is also known as the Castagnoli CRC32 and most conspicuously found in newer Intel CPUs which can compute a full 32-bit CRC step in 3 cycles. That is the reason why the CRC32C is becoming more popular, since it allows advanced implementations that effectively process one 32-bit word per cycle despite the three-cycle latency (by processing 3 streams of data in parallel and using linear algebra to combine the results).

Checksumming: CRC or hash?

Performance and security considerations aside, and assuming a hash function with a perfect avalanche effect, which should I use for checksumming blocks of data: CRC32 or hash truncated to N bytes? I.e. which will have a smaller probability to miss an error? Specifically:
CRC32 vs. 4-byte hash
CRC32 vs. 8-byte hash
CRC64 vs. 8-byte hash
Data blocks are to be transferred over network and stored on disk, repeatedly. Blocks can be 1KB to 1GB in size.
As far as I understand, CRC32 can detect up to 32 bit flips with 100% reliability, but after that its reliability approaches 1-2^(-32) and for some patterns is much worse. A perfect 4-byte hash reliability is always 1-2^(-32), so go figure.
8-byte hash should have a much better overall reliability (2^(-64) chance to miss an error), so should it be preferred over CRC32? What about CRC64?
I guess the answer depends on type of errors that might be expected in such sort of operation. Are we likely to see sparse 1-bit flips or massive block corruptions? Also, given that most storage and networking hardware implements some sort of CRC, should not accidental bit flips be taken care of already?
Only you can say whether 1-2-32 is good enough or not for your application. The error detection performance between a CRC-n and n bits from a good hash function will be very close to the same, so pick whichever one is faster. That is likely to be the CRC-n.
Update:
The above "That is likely to be the CRC-n" is only somewhat likely. It is not so likely if very high performance hash functions are used. In particular, CityHash appears to be very nearly as fast as a CRC-32 calculated using the Intel crc32 hardware instruction! I tested three CityHash routines and the Intel crc32 instruction on a 434 MB file. The crc32 instruction version (which computes a CRC-32C) took 24 ms of CPU time. CityHash64 took 55 ms, CityHash128 60 ms, and CityHashCrc128 50 ms. CityHashCrc128 makes use of the same hardware instruction, though it does not compute a CRC.
In order to get the CRC-32C calculation that fast, I had to get fancy with three crc32 instructions on three separate buffers in order to make use of the three arithmetic logic units in parallel in a single core, and then writing the inner loop in assembler. CityHash is pretty damned fast. If you don't have the crc32 instruction, then you would be hard-pressed to compute a 32-bit CRC as fast as a CityHash64 or CityHash128.
Note however that the CityHash functions would need to be modified for this purpose, or an arbitrary choice would need to be made in order to define a consistent meaning for the CityHash value on large streams of data. The reason is that those functions are not set up to accept buffered data, i.e. feeding the functions a chunk at a time and expecting to get the same result as if the entire set of data were fed to the function at once. The CityHash functions would need to modified to update an intermediate state.
The alternative, and what I did for the quick and dirty testing, is to use the Seed versions of the functions where I would use the CityHash from the previous buffer as the seed for the next buffer. The problem with that is that the result is then dependent on the buffer size. If you feed CityHash different size buffers with this approach, you get different hash values.
Another Update four years later:
Even faster is the xxhash family. I would now recommend that over a CRC for a non-cryptographic hash.
Putting aside "performance" issues; you might want to consider using one of the SHA-2 functions (say SHA-256).

efficient hash function for uris

i am looking for a hash function to build a (global) fixed size id for
strings, most of them URIs.
it should be:
fast
low chance of collision
~ 64bit
exploiting the structure of an uri if that is possible?
would http://murmurhash.googlepages.com/ be a good choice or is there anything better suited?
Try MD4. As far as cryptography is concerned, it is "broken", but since you do not have any security concern (you want a 64-bit output size, which is too small to yield any decent security against collisions), that should not be a problem. MD4 yields a 128-bit value, which you just have to truncate to the size you wish.
Cryptographic hash functions are designed for resilience to explicit attempts at building collisions. Conceivably, one can build a faster function by relaxing that condition (it is easier to beat random collisions than a determinate attacker). There are a few such functions, e.g. MurmurHash. However it may take a quite specific setup to actually notice the speed difference. With my home PC (a 2.4 GHz Core2), I can hash about 10 millions of short strings per second with MD4, using a single CPU core (I have four cores). For MurmurHash to be faster than MD4 in a non-negligible way, it would have to be used in a context involving at least one million hash invocations per second. That does not happen very often...
I'd wait a little longer for MurmurHash3 to be finalized, then use that. The 128-bit version should give you adequate collision protection against the birthday paradox.

How much more likely are hash collisions if I hash a bunch of hashes?

Say I'm using a hash to identify files, so I don't need it to be secure, I just need to minimize collisions. I was thinking that I could speed the hash up by running four hashes in parallel using SIMD and then hashing the final result. If the hash is designed to take a 512-bit block, I just step through the file taking 4x512 bit blocks at one go and generate four hashes out of that; then at the end of the file I hash the four resulting hashes together.
I'm pretty sure that this method would produce poorer hashes... but how much poorer? Any back of the envelope calculations?
The idea that you can read blocks of the file from disk quicker than you can hash them is, well, an untested assumption? Disk IO - even SSD - is many orders of magnitude slower than the RAM that the hashing is going though.
Ensuring low collisions is a design criteria for all hashes, and all mainstream hashes do a good job of it - just use a mainstream hash e.g. MD5.
Specific to the solution the poster is considering, its not a given that parallel hashing weakens the hash. There are hashes specifically designed for parallel hashing of blocks and combining the results as the poster said, although perhaps not yet in widespread adoption (e.g. MD6, which withdrew unbroken from SHA3)
More generally, there are mainstream implementations of hashing functions that do use SIMD. Hashing implementers are very performance-aware, and do take time to optimise their implementations; you'd have a hard job equaling their effort. The best software for strong hashing is around 6 to 10 cycles / byte. Hardware accelerated hashing is also available if hashing is the real bottleneck.

Hash function combining - is there a significant decrease in collision risk?

Does anyone know if there's a real benefit regarding decreasing collision probability by combining hash functions? I especially need to know this regarding 32 bit hashing, namely combining Adler32 and CRC32.
Basically, will adler32(crc32(data)) yield a smaller collision probability than crc32(data)?
The last comment here gives some test results in favor of combining, but no source is mentioned.
For my purpose, collision is not critical (i.e. the task does not involve security), but I'd rather minimize the probability anyway, if possible.
PS: I'm just starting in the wonderful world of hashing, doing a lot of reading about it. Sorry if I asked a silly question, I haven't even acquired the proper "hash dialect" yet, probably my Google searches regarding this were also poorly formed.
Thanks.
This doesn't make sense combining them in series like that. You are hashing one 32-bit space to another 32-bit space.
In the case of a crc32 collision in the first step, the final result is still a collision. Then you add on any potential collisions in the adler32 step. So it can not get any better, and can only be the same or worse.
To reduce collisions, you might try something like using the two hashes independently to create a 64-bit output space:
adler32(data) << 32 | crc32(data)
Whether there is significant benefit in doing that, I'm not sure.
Note that the original comment you referred to was storing the hashes independently:
Whichever algorithm you use there is
going to be some chance of false
positives. However, you can reduce
these chances by a considerable margin
by using two different hashing
algorithms. If you were to calculate
and store both the CRC32 and the
Alder32 for each url, the odds of a
simultaneous collision for both hashes
for any given pair of urls is vastly
reduced.
Of course that means storing twice as
much information which is a part of
your original problem. However, there
is a way of storing both sets of hash
data such that it requires minimal
memory (10kb or so) whilst giving
almost the same lookup performance (15
microsecs/lookup compared to 5
microsecs) as Perl's hashes.