How well do Non-cryptographic hashes detect errors in data vs. CRC-32 etc.?

How well do Non-cryptographic hashes detect errors in data vs. CRC-32 etc.? - hash

Non-cryptographic hashes such as MurmurHash3 and xxHash are almost exclusively designed for hash tables, but they appear to function comparably (and even favorably) to CRC-32, Adler-32 and Fletcher-32. Non-crypto hashes are often faster than CRC-32 and produce more "random" output similar to slow cryptographic hashes (MD5, SHA). Despite this, I only ever see CRC-32 or MD5 recommended for data integrity/checksum purposes.
In the table below, I tested 32-bit checksum/CRC/hash functions to determine how well they detect small differences in data:
The results in each cell means: A) number of collisions found, and B) minimum and maximum probability that any of the 32 output bits are set to 1. To pass test B, the max and min should be as close as possible to 50. Anything under 45 or over 55 indicates bias.
Looking at the table, MurmurHash3 and Jenkins lookup2 compare favorably to CRC-32 (which actually fails one test). They are also well-distributed. DJB2 and FNV1a pass collision tests but aren't well distributed. Fletcher32 and Adler32 struggle with the NullBytes and 8RandBytes tests.
So then my question is, compared to other checksums, how suitable are 'non-cryptographic hashes' for detecting errors or differences in files? Is there any reason a CRC-32/Adler-32/CRC-64 might outperform any decent 32-bit/64-bit hash?

Is there any reason this function would be inferior to CRC-32 or
Adler-32 for detecting errors in data?
Yes, for certain kinds of error characteristics. A CRC can be designed to very effectively detect small numbers of bit errors in a packet, as you might expect on an actual communications or storage channel. That's what it's designed for.
For large numbers of errors, any 32-bit check that fills the 32 bits and does a reasonably good job of being sensitive to all of the bits of the packet will work about as well as any other. So your's would be as good as a CRC-32, and a smidge better than an Adler-32. (The Adler-32 deliberately does not use all possible 32-bit values, so has a slightly higher false positive rate than 32-bit checks that use all possible values.)
By the way, looking a little more at your algorithm, it does not distribute over all 32-bit values until you have many bytes of input. So your check would not be as good as any other 32-bit check on a large number of errors until you have covered the possible 32-bit values of the check.

Related

Can a CRC32 engine be used for computing CRC16 hashes?

I'm working with a microcontroller with native HW functions to calculate CRC32 hashes from chunks of memory, where the polynomial can be freely defined. It turns out that the system has different data-links with different bit-lengths for CRC, like 16 and 8 bit, and I intend to use the hardware engine for it.
In simple tests with online tools I've concluded that it is possible to find a 32-bit polynomial that has the same result of a 8-bit CRC, example:
hashing "a sample string" with 8-bit engine and poly 0xb7 yelds a result 0x97
hashing "a sample string" with 16-bit engine and poly 0xb700 yelds a result 0x9700
...32-bit engine and poly 0xb7000000 yelds a result 0x97000000
(with zero initial value and zero final xor, no reflections)
So, padding the poly with zeros and right-shifting the results seems to work.
But is it 'always' possible to find a set of parameters that make 32-bit engines to work as 16 or 8 bit ones? (including poly, final xor, init val and inversions)
To provide more context and prevent 'bypass answers' like 'dont't use the native engine': I have a scenario in a safety critical system where it's necessary to prevent a common design error from propagating to redundant processing nodes. One solution for that is having software-based CRC calculation in one node, and hardware-based in its pair.

Yes, what you're doing will work in general for CRCs that are not reflected. The pre and post conditioning can be done very simply with code around the hardware instructions loop.
Assuming that the hardware CRC doesn't have an option for this, to do a reflected CRC you would need to reflect each input byte, and then reflect the final result. That may defeat the purpose of using a hardware CRC. (Though if your purpose is just to have a different implementation, then maybe it wouldn't.)

You don't have to guess. You can calculate it. Because CRC is a remainder of a division by an irreducible polynomial, it's a 1-to-1 function on its domain.
So, CRC16, for example, has to produce 65536 (64k) unique results if you run it over 0 through 65536.
To see if you get the same outcome by taking parts of CRC32, run it over 0 through 65535, keep the 2 bytes that you want to keep, and then see if there is any collision.
If your data has 32 bits in it, then it should not be an issue. The issue arises if you have less than 32 bit numbers and you shuffle them around in a 32-bit space. Their 1st and last byte are not guaranteed to be uniformly distributed.

How do I truncate a 64-bit hash into a 32-bit hash? [duplicate]

We're trying to settle an internal debate on our dev team:
We're looking for a 64-bit PHP hash function. We found a PHP implementation of MurmurHash3, but MurmurHash3 is either 32-bit or 128-bit, not 64-bit.
Co-worker #1 believes that to produce a 64-bit hash from MurmurHash3, we can simply slice the first (or last, or any) 64 bits of the 128-bit hash and that it will be as collision-proof as a native 64-bit hash function.
Co-worker #2 believes that we must find a native 64-bit hash function to reduce collisions and that 64-bit slices of a 128-bit hash will not be as collision proof as a native 64-bit hash.
Who's correct?
Does the answer change if we take the first (or last, or any) 64-bits of a cryptographic hash like SHA1 instead of Murmur3?

If you had real random, uniformly distributed values, then "slicing" would yield exactly the same results as if you had started with the smaller value right from the start. To see why, consider this very simple example: Let's say your random generator outputs 3 random bits, but you only need one random bit to work with. Let's assume the output is
b1 b2 b3
The possible values are
000, 001, 010, 011, 100, 101, 110, 111
and all are to occur with equal probability of 1/8. Now whatever bit you slice from those three for your purpose - the first, second or third - the probability of having a '1' is always going to be 1/2, regardless of the position - and the same is true for a '0'.
You can easily scale this experiment to the 64 out of 128 bit case: regardless of which bits you slice, the probability of ending up with a one or a zero in a certain position is going to be one half. What this means is that if you had a sample taken from a uniformly distributed random variable, then slicing wouldn't make the probability for collisions more or less likely.
Now a good question is whether random functions are really the best we can do to prevent collisions. But as it turns out, it can be shown that the probability of finding collisions increases whenever a function deviates from random.
Cryptographic hash functions: co-worker #1 wins
The problem in real life is that hash functions are not random at all, on the contrary, they are boringly deterministic. But a design goal of cryptographic hash functions is as follows: if we didn't know their initial state, then their output would be computationally indistinguishable from a real random function, that is there's no computationally efficient way to tell the difference between the hash output and real random values. This is why you'd consider a hash already as kind of broken if you can find a "distinguisher", a method to tell the hash from real random values with a probability higher than one half. Unfortunately, we can't really prove these properties for existing cryptographic hashes, but unless somebody breaks them, we may assume these properties hold with some confidence. Here is an example of a paper about a distinguisher for one of the SHA-3 submissions that illustrates the process.
To summarize, unless a distinguisher is found for a given cryptographic hash, slicing is perfectly fine and does not increase the probability of a collision.
Non-cryptographic hash functions: co-worker #2 might win
Non-cryptographic hashes do not have to satisfy the same set of requirements as cryptographic hashes do. They are usually defined to be very fast and satisfy certain properties "under sane/benevolent conditions", but they might easily fall short if somebody tries to maliciously manipulate them. A good example for what this means in practice is the computational complexity attack on hash table implementations (hashDoS) presented earlier this year. Under normal conditions, non-crypto hashes work perfectly fine, but their collision resistance may be severely undermined by some clever inputs. This can't happen with cryptographic hash functions, because their very definition requires them to be immune to all sorts of clever inputs.
Because it is possible, sometimes even quite easy, to find a distinguisher like above for the output of non-cryptographic hashes, we can immediately say that they do not qualify as cryptographic hash functions. Being able to tell the difference means that somewhere there is a pattern or bias in the output.
And this fact alone implies that they deviate more or less from a random function, and thus (after what we said above) collisions are probably more likely than they would be for random functions. Finally, since collisions occur with higher probability for the full 128 bits already, this will not get better with shorter ouptputs, collisions will probably be even more likely in that case.
tl;dr You're safe with a cryptographic hash function when truncating it. But you're better off with a "native" 64 bit cryptographic hash function compared to truncating a non-cryptographic hash with a larger output to 64 bits.

Due to the avalanche effect, a strong hash is one where a single bit of change in the source results in half the bits of the hash flipping on average. For a good hash, then, the "hashness" is evenly distributed, and so each section or slice is affected by an equal and evenly distributed amount of source bits, and therefore is just as strong as any other slice of the same bit length could be.
I would agree with co-worker 1 as long as the hash has good properties and even distribution.

This question seems incomplete without this being mentioned:
Some hashes are provably perfect hashes for a specific class of inputs (eg., for input of length n for some reasonable value of n). If you truncate that hash then you are likely to destroy that property, in which case you are, by definition, increasing the rate of collisions from zero to non-zero and you have weakened the hash in that use case.
It's not the general case, but it's an example of a legitimate concern when truncating hashes.

Checksumming: CRC or hash?

Performance and security considerations aside, and assuming a hash function with a perfect avalanche effect, which should I use for checksumming blocks of data: CRC32 or hash truncated to N bytes? I.e. which will have a smaller probability to miss an error? Specifically:
CRC32 vs. 4-byte hash
CRC32 vs. 8-byte hash
CRC64 vs. 8-byte hash
Data blocks are to be transferred over network and stored on disk, repeatedly. Blocks can be 1KB to 1GB in size.
As far as I understand, CRC32 can detect up to 32 bit flips with 100% reliability, but after that its reliability approaches 1-2^(-32) and for some patterns is much worse. A perfect 4-byte hash reliability is always 1-2^(-32), so go figure.
8-byte hash should have a much better overall reliability (2^(-64) chance to miss an error), so should it be preferred over CRC32? What about CRC64?
I guess the answer depends on type of errors that might be expected in such sort of operation. Are we likely to see sparse 1-bit flips or massive block corruptions? Also, given that most storage and networking hardware implements some sort of CRC, should not accidental bit flips be taken care of already?

Only you can say whether 1-2-32 is good enough or not for your application. The error detection performance between a CRC-n and n bits from a good hash function will be very close to the same, so pick whichever one is faster. That is likely to be the CRC-n.
Update:
The above "That is likely to be the CRC-n" is only somewhat likely. It is not so likely if very high performance hash functions are used. In particular, CityHash appears to be very nearly as fast as a CRC-32 calculated using the Intel crc32 hardware instruction! I tested three CityHash routines and the Intel crc32 instruction on a 434 MB file. The crc32 instruction version (which computes a CRC-32C) took 24 ms of CPU time. CityHash64 took 55 ms, CityHash128 60 ms, and CityHashCrc128 50 ms. CityHashCrc128 makes use of the same hardware instruction, though it does not compute a CRC.
In order to get the CRC-32C calculation that fast, I had to get fancy with three crc32 instructions on three separate buffers in order to make use of the three arithmetic logic units in parallel in a single core, and then writing the inner loop in assembler. CityHash is pretty damned fast. If you don't have the crc32 instruction, then you would be hard-pressed to compute a 32-bit CRC as fast as a CityHash64 or CityHash128.
Note however that the CityHash functions would need to be modified for this purpose, or an arbitrary choice would need to be made in order to define a consistent meaning for the CityHash value on large streams of data. The reason is that those functions are not set up to accept buffered data, i.e. feeding the functions a chunk at a time and expecting to get the same result as if the entire set of data were fed to the function at once. The CityHash functions would need to modified to update an intermediate state.
The alternative, and what I did for the quick and dirty testing, is to use the Seed versions of the functions where I would use the CityHash from the previous buffer as the seed for the next buffer. The problem with that is that the result is then dependent on the buffer size. If you feed CityHash different size buffers with this approach, you get different hash values.
Another Update four years later:
Even faster is the xxhash family. I would now recommend that over a CRC for a non-cryptographic hash.

Putting aside "performance" issues; you might want to consider using one of the SHA-2 functions (say SHA-256).

choosing a hash function

I was wondering: what are maximum number of bytes that can safely be hashed while maintaining the expected collision count of a hash function?
For md5, sha-*, maybe even crc32 or adler32.

Your question isn't clear. By "maximum number of bytes" you mean "maximum number of items"? The size of the files being hashed has no relation with the number of collisions (assuming that all files are different, of course).
And what do you mean by "maintaining the expected collision count"? Taken literally, the answer is "infinite", but after a certain number you will aways have collisions, as expected.
As for the answer to the question "How many items I can hash while maintaining the probability of a collision under x%?", take a look at the following table:
http://en.wikipedia.org/wiki/Birthday_problem#Probability_table
From the link:
For comparison, 10^-18 to 10^-15 is the uncorrectable bit error rate of a typical hard disk [2]. In theory, MD5, 128 bits, should stay within that range until about 820 billion documents, even if its possible outputs are many more.
This assumes a hash function that outputs a uniform distribution. You may assume that, given enough items to be hashed and cryptographic hash functions (like md5 and sha) or good hashes (like Murmur3, Jenkins, City, and Spooky Hash).
And also assumes no malevolent adversary actively fabricating collisions. Then you really need a secure cryptographic hash function, like SHA-2.
And be careful: CRC and Adler are checksums, designed to detect data corruption, NOT minimizing expected collisions. They have proprieties like "detect all bit zeroing of sizes < X or > Y for inputs up to Z kbytes", but not as good statistical proprieties.
EDIT: Don't forget this is all about probabilities. It is entirely possible to hash only two files smaller than 0.5kb and get the same SHA-512, though it is extremely unlikely (no collision has ever been found for SHA hashes till this date, for example).

You are basically looking at the Birthday paradox, only looking at really big numbers.
Given a normal 'distribution' of your data, I think you could go to about 5-10% of the amount of possibilities before running into issues, though nothing is guaranteed.
Just go with a long enough hash to not run into problems ;)

Uniquely identifying URLs with one 64-bit number

This is basically a math problem, but very programing related: if I have 1 billion strings containing URLs, and I take the first 64 bits of the MD5 hash of each of them, what kind of collision frequency should I expect?
How does the answer change if I only have 100 million URLs?
It seems to me that collisions will be extremely rare, but these things tend to be confusing.
Would I be better off using something other than MD5? Mind you, I'm not looking for security, just a good fast hash function. Also, native support in MySQL is nice.
EDIT: not quite a duplicate

If the first 64 bits of the MD5 constituted a hash with ideal distribution, the birthday paradox would still mean you'd get collisions for every 2^32 URL's. In other words, the probability of a collision is the number of URL's divided by 4,294,967,296. See http://en.wikipedia.org/wiki/Birthday_paradox#Cast_as_a_collision_problem for details.
I wouldn't feel comfortable just throwing away half the bits in MD5; it would be better to XOR the high and low 64-bit words to give them a chance to mix. Then again, MD5 is by no means fast or secure, so I wouldn't bother with it at all. If you want blinding speed with good distribution, but no pretence of security, you could try the 64-bit versions of MurmurHash. See http://en.wikipedia.org/wiki/MurmurHash for details and code.

You have tagged this as "birthday-paradox", I think you know the answer already.
P(Collision) = 1 - (2^64)!/((2^64)^n (1 - n)!)
where n is 1 billion in your case.
You will be a bit better using something other then MD5, because MD5 have pratical collusion problem.

From what I see, you need a hash function with the following requirements,
Hash arbitrary length strings to a 64-bit value
Be good -- Avoid collisions
Not necessarily one-way (security not required)
Preferably fast -- which is a necessary characteristic for a non-security application
This hash function survey may be useful for drilling down to the function most suitable for you.
I will suggest trying out multiple functions from here and characterizing them for your likely input set (pick a few billion URL that you think you will see).
You can actually generate another column like this test survey for your test URL list to characterize and select from the existing or any new hash functions (more rows in that table) that you might want to check. They have MSVC++ source code to start with (reference to ZIP link).
Changing the hash functions to suit your output width (64-bit) will give you a more accurate characterization for your application.

If you have 2^n hash possibilities, there's over a 50% chance of collision when you have 2^(n/2) items.
E.G. if your hash is 64 bits, you have 2^64 hash possibilities, you'd have a 50% chance of collision if you have 2^32 items in a collection.

Just by using a hash, there is always a chance of collisions. And you don't know beforehand wether collisions will happen once or twice, or even hundreds or thousands of times in your list of urls.
The probability is still just a probability. Its like throwing a dice 10 or 100 times, what are the chances of getting all sixes? The probability says it is low, but it still can happen. Maybe even many times in a row...
So while the birthday paradox shows you how to calculate the probabilities, you still need to decide if collisions are acceptable or not.
...and collisions are acceptable, and hashes are still the right way to go; find a 64 bit hashing algorithm instead of relying on "half-a-MD5" having a good distribution. (Though it probably has...)