This is a hypothetical question but please hear me out. I know that Audio Fingerprinting systems such as Shazam use perceptual hashing instead of cryptographic becuase a single bit flip due to how the audio was encoded or noise when the recording took place wouldn't match the clean hashed fingerprints of the audio at the database side but would it be possible to use a perceptual hash to find the features of the audio you wanted to record and then run those frequency peaks (sub-fingerprints) through a cryptographic hash? You would do the same at the database end on the clean version of the song and then surely some of hashes would match if compared? Or am i missing something obvious here. I know this would make it computationally more expensive & slower but was just wondering if this would be possible..
You definitely could. As long as whatever features you're extracting are fuzzy and noise-resistant, any hash function will do.
Obviously you'd prefer a faster hash with less collisions of course!
Related
If you take a 32-bit sequence, and perform a CRC32 on it, you get another 32-bit sequence as the result; if you do CRC32 of this, you get another, and so on. It is easy to show that if you keep doing this, you end up with a single loop of 2^32 bit sequences, before starting over.
Simple question: does anyone know if the same holds true (or not) for SHA256, starting with a 256-bit sequence? Would a similar process cycle through a loop of all 2^256 possible 256-bit sequences before starting over? Or are there known (or likely) shorter loops within this hash?
Brian
SHA256 has not been designed to meet the property of 2^256 loop. However, as far as I know, nobody has proven there is no such loop. Also, there are not known any shorter loops because if anybody found some, then he would find also a collision and from the nature of the cryptographic hash function, it muse be difficult.
So, since nobody has not proven it, yes, there is a probability the 2^256 cycle would exists. However, it's extremely unlikely an I'm willing to bet my left testicle for it. :-)
Let me also note that, IMO to design a cryptographic hash function which has 2^256 loop would be extremely difficult even for the best crypto experts.
I'm looking for an algorithm which given a messy value (like a url with a querystring) produces a value which can be used as a key.
Ideally, it should have a fairly low collision rate, produce a value which is shorter than the input, be made of alphanumerics (a-z 1-9), and create the same output given the same input (though not necessarily reversible).
Anything come to mind?
Several excellent examples exist as industry standards, such as SHA-2 and MD5. There will be a library for running either of these any language you are using.
MD5 or or the SHA functions are fine (however, MD5 is less secure in terms of collision rate, I'd suggest SHA1, or the best, but longest SHA512).
They're implemented in OpenSSL for C, or CommonCrypto in Objective-C, etc.
You might also consider using CRC64 as a hash function, because it has a really small footprint and a rather good collision rate -- compared to its length, of course.
Quite a lot. MD5, SHA-1, SHA-128, SHA-256,...
If say... MD5's 32 character are to long for you, use substring and just take the first x characters.
Of course there are many other solutions like CRC32 that you could use, but it's hard to say what will be good for your use case without knowing what your goal is...
i am looking for a hash function to build a (global) fixed size id for
strings, most of them URIs.
it should be:
fast
low chance of collision
~ 64bit
exploiting the structure of an uri if that is possible?
would http://murmurhash.googlepages.com/ be a good choice or is there anything better suited?
Try MD4. As far as cryptography is concerned, it is "broken", but since you do not have any security concern (you want a 64-bit output size, which is too small to yield any decent security against collisions), that should not be a problem. MD4 yields a 128-bit value, which you just have to truncate to the size you wish.
Cryptographic hash functions are designed for resilience to explicit attempts at building collisions. Conceivably, one can build a faster function by relaxing that condition (it is easier to beat random collisions than a determinate attacker). There are a few such functions, e.g. MurmurHash. However it may take a quite specific setup to actually notice the speed difference. With my home PC (a 2.4 GHz Core2), I can hash about 10 millions of short strings per second with MD4, using a single CPU core (I have four cores). For MurmurHash to be faster than MD4 in a non-negligible way, it would have to be used in a context involving at least one million hash invocations per second. That does not happen very often...
I'd wait a little longer for MurmurHash3 to be finalized, then use that. The 128-bit version should give you adequate collision protection against the birthday paradox.
Say I'm using a hash to identify files, so I don't need it to be secure, I just need to minimize collisions. I was thinking that I could speed the hash up by running four hashes in parallel using SIMD and then hashing the final result. If the hash is designed to take a 512-bit block, I just step through the file taking 4x512 bit blocks at one go and generate four hashes out of that; then at the end of the file I hash the four resulting hashes together.
I'm pretty sure that this method would produce poorer hashes... but how much poorer? Any back of the envelope calculations?
The idea that you can read blocks of the file from disk quicker than you can hash them is, well, an untested assumption? Disk IO - even SSD - is many orders of magnitude slower than the RAM that the hashing is going though.
Ensuring low collisions is a design criteria for all hashes, and all mainstream hashes do a good job of it - just use a mainstream hash e.g. MD5.
Specific to the solution the poster is considering, its not a given that parallel hashing weakens the hash. There are hashes specifically designed for parallel hashing of blocks and combining the results as the poster said, although perhaps not yet in widespread adoption (e.g. MD6, which withdrew unbroken from SHA3)
More generally, there are mainstream implementations of hashing functions that do use SIMD. Hashing implementers are very performance-aware, and do take time to optimise their implementations; you'd have a hard job equaling their effort. The best software for strong hashing is around 6 to 10 cycles / byte. Hardware accelerated hashing is also available if hashing is the real bottleneck.
Does anyone know if there's a real benefit regarding decreasing collision probability by combining hash functions? I especially need to know this regarding 32 bit hashing, namely combining Adler32 and CRC32.
Basically, will adler32(crc32(data)) yield a smaller collision probability than crc32(data)?
The last comment here gives some test results in favor of combining, but no source is mentioned.
For my purpose, collision is not critical (i.e. the task does not involve security), but I'd rather minimize the probability anyway, if possible.
PS: I'm just starting in the wonderful world of hashing, doing a lot of reading about it. Sorry if I asked a silly question, I haven't even acquired the proper "hash dialect" yet, probably my Google searches regarding this were also poorly formed.
Thanks.
This doesn't make sense combining them in series like that. You are hashing one 32-bit space to another 32-bit space.
In the case of a crc32 collision in the first step, the final result is still a collision. Then you add on any potential collisions in the adler32 step. So it can not get any better, and can only be the same or worse.
To reduce collisions, you might try something like using the two hashes independently to create a 64-bit output space:
adler32(data) << 32 | crc32(data)
Whether there is significant benefit in doing that, I'm not sure.
Note that the original comment you referred to was storing the hashes independently:
Whichever algorithm you use there is
going to be some chance of false
positives. However, you can reduce
these chances by a considerable margin
by using two different hashing
algorithms. If you were to calculate
and store both the CRC32 and the
Alder32 for each url, the odds of a
simultaneous collision for both hashes
for any given pair of urls is vastly
reduced.
Of course that means storing twice as
much information which is a part of
your original problem. However, there
is a way of storing both sets of hash
data such that it requires minimal
memory (10kb or so) whilst giving
almost the same lookup performance (15
microsecs/lookup compared to 5
microsecs) as Perl's hashes.