i am looking for a hash function to build a (global) fixed size id for
strings, most of them URIs.
it should be:
fast
low chance of collision
~ 64bit
exploiting the structure of an uri if that is possible?
would http://murmurhash.googlepages.com/ be a good choice or is there anything better suited?
Try MD4. As far as cryptography is concerned, it is "broken", but since you do not have any security concern (you want a 64-bit output size, which is too small to yield any decent security against collisions), that should not be a problem. MD4 yields a 128-bit value, which you just have to truncate to the size you wish.
Cryptographic hash functions are designed for resilience to explicit attempts at building collisions. Conceivably, one can build a faster function by relaxing that condition (it is easier to beat random collisions than a determinate attacker). There are a few such functions, e.g. MurmurHash. However it may take a quite specific setup to actually notice the speed difference. With my home PC (a 2.4 GHz Core2), I can hash about 10 millions of short strings per second with MD4, using a single CPU core (I have four cores). For MurmurHash to be faster than MD4 in a non-negligible way, it would have to be used in a context involving at least one million hash invocations per second. That does not happen very often...
I'd wait a little longer for MurmurHash3 to be finalized, then use that. The 128-bit version should give you adequate collision protection against the birthday paradox.
Related
I wondered a time ago why no technology exists to equalize the hash creation speed across different cpu's/gpu's. I have no idea if this is feasible or not, that's why I ask this question here. The idea behind this is to make the proof of work just between two parties with each a 50% chance to create the winning hash (equal hashing speed!). In combination with an easier to find nonce, this solution is energy friendlier than existing proof of work technologies, while the desired goal is still met.
This is more or less impossible for the simple reason that a faster machine is just … faster. If one of the two parties buys a faster machine, then they will compute the hash faster. That's just the way it is.
However, there is something we can do. Bitcoin, for example, is based on SHA-256 (the 256 bit long version of SHA-2). SHA-2 is specifically designed to be fast, and to be easy to speed up by specialized hardware. And that is exactly what we see happen in the Bitcoin mining space with the move from pure software-based mining to CPUs with built-in acceleration for SHA-2 to GPUs to FPGAs to ASICs.
The reason for this is that SHA-2 is designed as a general cryptographic hash function, and one of the main usage of cryptographic hashes is as the basis for TLS/SSL and digital signatures where large amounts of data need to be hashed in a short amount of time.
But, there are other use cases for cryptographic hash functions, in particular, password hashing. For password hashing, we want the hash function to be slow and hard to speed up, since a legitimate user only needs to hash a very small amount of data (the password) once (when logging in), whereas an attacker needs to hash large numbers of passwords over and over again, for a brute force attack.
Examples for such hash functions are PBKDF2, bcrypt, scrypt, Catena, Lyra2, yescrypt, Makwa, and Argon2 (the latter being the winner of the 2013 Password Hashing Competition). Scrypt in particular is designed to be hard to speed up using GPUs, FPGAs, and ASICs as well as through space-time or time-space tradeoffs. Scrypt uses a cryptographically secure pseudo-random number generator to initialize a huge array of pseudo-random numbers in memory, and then uses another CSPRNG to generate indices for accesses into this array, thus making both the memory contents as well as the memory access patterns pseudo-random.
Theoretically, of course, it would be possible to pre-compute the result, after all, accessing an array in some specific order is the same as accessing a much larger array in linear order, however, scrypt is designed in such a way that this pre-computed array would be prohibitively large. Plus, scrypt has a simple work-factor parameter that can be used to exponentially increase the size of this array, if memory capacity increases. So, trading space for time is not possible.
Likewise, it would be possible to create a PRNG which combines the two pseudo-random processes into one process and computes the results on the fly. However, scrypt is designed in such a way that the computing time for this would be prohibitively long, and again, there is the exponential work-factor which can be used to drastically increase the computing time without changes to the algorithm. So, trading time for space is not possible.
The pseudo-random access pattern to the memory also defeats any sort of branch-prediction, memory prefetching or caching scheme of the CPU.
And lastly, since the large array is a shared global mutable state, and there is no way to sensibly divide the work into independent units, the algorithm is not sensibly parallelizable, which means you can't speed it up using GPUs.
And in fact, some newer cryptocurrencies, smart contracts, blockchains etc. use an scrypt-based proof-of-work scheme.
Note, however, that running scrypt on a faster machine is still faster than running scrypt on a slower machine. There is no way around that. It just means that we cannot get the ridiculous amounts of speedup we get from using specialized hardware for SHA-2, for example. But, designing cryptographic algorithms is hard, and there actually are ASIC-based scrypt miners for Litecoin out there, that do get a significant speedup, however still less than the impressive ones we see for SHA-2 / Bitcoin.
I'm interested in optimizing the hashing of some large files (optimizing wall clock time). The I/O has been optimized well enough already and the I/O device (local SSD) is only tapped at about 25% of capacity, while one of the CPU cores is completely maxed-out.
I have more cores available, and in the future will likely have even more cores. So far I've only been able to tap into more cores if I happen to need multiple hashes of the same file, say an MD5 AND a SHA256 at the same time. I can use the same I/O stream to feed two or more hash algorithms, and I get the faster algorithms done for free (as far as wall clock time). As I understand most hash algorithms, each new bit changes the entire result, and it is inherently challenging/impossible to do in parallel.
Are any of the mainstream hash algorithms parallelizable?
Are there any non-mainstream hashes that are parallelizable (and that have at least a sample implementation available)?
As future CPUs will trend toward more cores and a leveling off in clock speed, is there any way to improve the performance of file hashing? (other than liquid nitrogen cooled overclocking?) or is it inherently non-parallelizable?
There is actually a lot of research going on in this area. The US National Institute of Standards and Technology is currently holding a competition to design the next-generation of government-grade hash function. Most of the proposals for that are parallelizable.
One example: http://www.schneier.com/skein1.2.pdf
Wikipedia's description of current status of the contest: http://en.wikipedia.org/wiki/SHA-3
What kind of SSD do you have ? My C implementation of MD5 runs at 400 MB/s on a single Intel Core2 core (2.4 GHz, not the latest Intel). Do you really have SSD which support a bandwidth of 1.6 GB/s ? I want the same !
Tree hashing can be applied on any hash function. There are a few subtleties and the Skein specification tries to deal with them, integrating some metadata in the function itself (this does not change much things for performance), but the "tree mode" of Skein is not "the" Skein as submitted to SHA-3. Even if Skein is selected as SHA-3, the output of a tree-mode hash would not be the same as the output of "plain Skein".
Hopefully, a standard will be defined at some point, to describe generic tree hashing. Right now there is none. However, some protocols have been defined with support for a custom tree hashing with the Tiger hash function, under the name "TTH" (Tiger Tree Hash) or "THEX" (Tree Hash Exchange Format). The specification for TTH appears to be a bit elusive; I find some references to drafts which have either moved or disappeared for good.
Still, I am a bit dubious about the concept. It is kind of neat, but provides a performance boost only if you can read data faster than what a single core can process, and, given the right function and the right implementation, a single core can hash quite a lot of data per second. A tree hash spread over several cores requires having the data sent to the proper cores, and 1.6 GB/s is not the smallest bandwidth ever.
SHA-256 and SHA-512 are not very fast. Among the SHA-3 candidates, assuming an x86 processor in 64-bit mode, some of them achieve high speed (more than 300 MB/s on my 2.4 GHz Intel Core2 Q6600, with a single core -- that's what I can get out of SHA-1, too), e.g. BMW, SHABAL or Skein. Cryptographically speaking, these designs are a bit too new, but MD5 and SHA-1 are already cryptographically "broken" (quite effectively in the case of MD5, rather theoretically for SHA-1) so any of the round-2 SHA-3 candidates should be fine.
When I put my "seer" cap, I foresee that processors will keep on becoming faster than RAM, to the point that hashing cost will be dwarfed out by memory bandwidth: the CPU will have clock cycles to spare while it waits for the data from the main RAM. At some point, the whole threading model (one big RAM for many cores) will have to be amended.
You didn't say what you need your hash for.
If you're not gonna exchange it with the outside world but just for internal use, simply divide each file in chunks, compute and store all the checksums. You can then use many cores just by throwing a chunk to each one.
Two solutions that comes to mind is dividing files in fixed-size chunks (simpler, but will use less cores for smaller files where you're not supposed to need all that power) or in a fixed-number of chunks (will use all the cores for every file). Really depends on what you want to achieve and what your file size distribution looks like.
If, on the other hand, you need hashes for the outside world, as you can read from the other replies it's not possible with "standard" hashes (eg. if you want to send out SHA1 hashes for others to check with different tools) so you must look somewhere else. Like computing the hash when you store the file, for later retrieval, or compute hashes in background with the 'free' cores and store for later retrieval.
The better solution depends on what your constraints are and where you can invest space, time or cpu power.
Say I'm using a hash to identify files, so I don't need it to be secure, I just need to minimize collisions. I was thinking that I could speed the hash up by running four hashes in parallel using SIMD and then hashing the final result. If the hash is designed to take a 512-bit block, I just step through the file taking 4x512 bit blocks at one go and generate four hashes out of that; then at the end of the file I hash the four resulting hashes together.
I'm pretty sure that this method would produce poorer hashes... but how much poorer? Any back of the envelope calculations?
The idea that you can read blocks of the file from disk quicker than you can hash them is, well, an untested assumption? Disk IO - even SSD - is many orders of magnitude slower than the RAM that the hashing is going though.
Ensuring low collisions is a design criteria for all hashes, and all mainstream hashes do a good job of it - just use a mainstream hash e.g. MD5.
Specific to the solution the poster is considering, its not a given that parallel hashing weakens the hash. There are hashes specifically designed for parallel hashing of blocks and combining the results as the poster said, although perhaps not yet in widespread adoption (e.g. MD6, which withdrew unbroken from SHA3)
More generally, there are mainstream implementations of hashing functions that do use SIMD. Hashing implementers are very performance-aware, and do take time to optimise their implementations; you'd have a hard job equaling their effort. The best software for strong hashing is around 6 to 10 cycles / byte. Hardware accelerated hashing is also available if hashing is the real bottleneck.
Does anyone know if there's a real benefit regarding decreasing collision probability by combining hash functions? I especially need to know this regarding 32 bit hashing, namely combining Adler32 and CRC32.
Basically, will adler32(crc32(data)) yield a smaller collision probability than crc32(data)?
The last comment here gives some test results in favor of combining, but no source is mentioned.
For my purpose, collision is not critical (i.e. the task does not involve security), but I'd rather minimize the probability anyway, if possible.
PS: I'm just starting in the wonderful world of hashing, doing a lot of reading about it. Sorry if I asked a silly question, I haven't even acquired the proper "hash dialect" yet, probably my Google searches regarding this were also poorly formed.
Thanks.
This doesn't make sense combining them in series like that. You are hashing one 32-bit space to another 32-bit space.
In the case of a crc32 collision in the first step, the final result is still a collision. Then you add on any potential collisions in the adler32 step. So it can not get any better, and can only be the same or worse.
To reduce collisions, you might try something like using the two hashes independently to create a 64-bit output space:
adler32(data) << 32 | crc32(data)
Whether there is significant benefit in doing that, I'm not sure.
Note that the original comment you referred to was storing the hashes independently:
Whichever algorithm you use there is
going to be some chance of false
positives. However, you can reduce
these chances by a considerable margin
by using two different hashing
algorithms. If you were to calculate
and store both the CRC32 and the
Alder32 for each url, the odds of a
simultaneous collision for both hashes
for any given pair of urls is vastly
reduced.
Of course that means storing twice as
much information which is a part of
your original problem. However, there
is a way of storing both sets of hash
data such that it requires minimal
memory (10kb or so) whilst giving
almost the same lookup performance (15
microsecs/lookup compared to 5
microsecs) as Perl's hashes.
This is basically a math problem, but very programing related: if I have 1 billion strings containing URLs, and I take the first 64 bits of the MD5 hash of each of them, what kind of collision frequency should I expect?
How does the answer change if I only have 100 million URLs?
It seems to me that collisions will be extremely rare, but these things tend to be confusing.
Would I be better off using something other than MD5? Mind you, I'm not looking for security, just a good fast hash function. Also, native support in MySQL is nice.
EDIT: not quite a duplicate
If the first 64 bits of the MD5 constituted a hash with ideal distribution, the birthday paradox would still mean you'd get collisions for every 2^32 URL's. In other words, the probability of a collision is the number of URL's divided by 4,294,967,296. See http://en.wikipedia.org/wiki/Birthday_paradox#Cast_as_a_collision_problem for details.
I wouldn't feel comfortable just throwing away half the bits in MD5; it would be better to XOR the high and low 64-bit words to give them a chance to mix. Then again, MD5 is by no means fast or secure, so I wouldn't bother with it at all. If you want blinding speed with good distribution, but no pretence of security, you could try the 64-bit versions of MurmurHash. See http://en.wikipedia.org/wiki/MurmurHash for details and code.
You have tagged this as "birthday-paradox", I think you know the answer already.
P(Collision) = 1 - (2^64)!/((2^64)^n (1 - n)!)
where n is 1 billion in your case.
You will be a bit better using something other then MD5, because MD5 have pratical collusion problem.
From what I see, you need a hash function with the following requirements,
Hash arbitrary length strings to a 64-bit value
Be good -- Avoid collisions
Not necessarily one-way (security not required)
Preferably fast -- which is a necessary characteristic for a non-security application
This hash function survey may be useful for drilling down to the function most suitable for you.
I will suggest trying out multiple functions from here and characterizing them for your likely input set (pick a few billion URL that you think you will see).
You can actually generate another column like this test survey for your test URL list to characterize and select from the existing or any new hash functions (more rows in that table) that you might want to check. They have MSVC++ source code to start with (reference to ZIP link).
Changing the hash functions to suit your output width (64-bit) will give you a more accurate characterization for your application.
If you have 2^n hash possibilities, there's over a 50% chance of collision when you have 2^(n/2) items.
E.G. if your hash is 64 bits, you have 2^64 hash possibilities, you'd have a 50% chance of collision if you have 2^32 items in a collection.
Just by using a hash, there is always a chance of collisions. And you don't know beforehand wether collisions will happen once or twice, or even hundreds or thousands of times in your list of urls.
The probability is still just a probability. Its like throwing a dice 10 or 100 times, what are the chances of getting all sixes? The probability says it is low, but it still can happen. Maybe even many times in a row...
So while the birthday paradox shows you how to calculate the probabilities, you still need to decide if collisions are acceptable or not.
...and collisions are acceptable, and hashes are still the right way to go; find a 64 bit hashing algorithm instead of relying on "half-a-MD5" having a good distribution. (Though it probably has...)