CRC32+Size vs MD5/SHA1 - hash

We have a storage of files and the storage uniquely identifies a file on the basis of size appended to crc32.
I wanted to know if this checksum ( crc32 + size ) would be good enough for identifying files or should we consider some other hashing technique like MD5/SHA1?

CRC is most an error detection method than a serious hash function. It helps in identify corrupting files rather than uniquely identify them.
So your choice should be between MD5 and SHA1.
If you don't have strong security needings you can choose MD5 that should be faster.
(remember that MD5 is vulnerable to collision attacks).
If you need more security you better use SHA1 or even SHA2 .

CRC-32 is not good enough; it is trivial to build collisions, i.e. two files (of the same length if you wish it so) which have the same CRC-32. Even in the absence of a malicious attacker, collisions will happen randomly once you have about 65000 distinct files with the same length.
A hash function is designed to avoid collisions. With MD5 or SHA-1, you will not get random collisions. If your setup is security-related (i.e. there is someone, somewhere, who may actively try to create collisions), then you need a secure hash function. MD5 is not secure anymore (creating collisions with MD5 is easy) and SHA-1 is somewhat weak in that respect (no actual collisions were computed, but a method for creating one is known and, while expensive, it is much less expensive than what it ought to be). The usual recommendation is to use SHA-256 or SHA-512 (SHA-256 is enough for security; SHA-512 may be a tad faster on big, 64-bit systems, but file reading bandwidth will be more limitating than hashing speed).
Note: when using a cryptographic hash function, there is no need to store and compare the file lengths; the hash is sufficient to disambiguate files.
In a non-security setup (i.e. you only fear random collisions), then MD4 can be used. It is thoroughly "broken" as a cryptographic hash function, but it still is a very good checksum, and it is really fast (on some ARM-based platforms, it is even faster than CRC-32, for a much better resistance to random collisions). Basically, you should not use MD5: if you have security issues, then MD5 must not be used (it is broken; use SHA-256); and if you do not have security issues then MD4 is faster than MD5.

The space that would be used by a CRC32+size gives you enough room for a bigger CRC which would be a much better choice. If you are not worried about malicious collision that's it in which case Thomas' answer applies.
You didn't specify a language but for example in C++ you got Boost CRC giving you CRC of the size you want (or you can afford to store).

As others have said, CRC doesn't guarantee absence of collisions. However, your problem is be solved simply by giving the files incrementing 64-bit numbers. This is guaranteed to never collide (unless you want to keep gazillion of files in one directory which is not a good idea anyway).

Related

Is there a fast hashing algorithm which is resistant to deliberate collision, and if so, which one?

I am developing an "open distributed cloud storage system".
By open I mean that anyone can participate in hosting of files.
My current design uses a sha1 hash of the files content as global file id.
It is given that the client already knows this hash value and receives the file from a "bandwidth donor".
The client now needs to verify that the file indeed is the correct one, by generating the hash and comparing it to the expected value.
However my concern is that someone could deliberately modify a file to produce the same hash. As far as I know this is doable easily for hashes of the CRC family. Some "googling" around revealed a lot of claims that the same would be easy for MD5.
Now my question is: Is there a hashing algorithm which satisfies the criteria of beeing
fast for big amounts of data
well distributed in the hashing range (aka "unique")
has a sufficient target range ("bit length")
is resistant to deliberate collision attacks
All other means that I can think of achieving a setup that serves my needs involve a secret component, for example a secret openssl key or a shared secret salt for a hash function.
Unfortunately I cannot work with that.
What you are asking for is a one-way function, whose existence is a major open problem.
With cryptographic hash functions, the specific attack you wanted to avoid is called the "second pre-image attack".
That should help you Googling what you want, but as far as I know there is actually no known practical second pre-image attack for MD5.
First of all, you probably found that it is easy to find two arbitrary files that have the same hash, and to find two different such pairs every time you try.
But it is difficult to generate a file to disguise as some specific file - in other words, it is unlikely that one of the prementioned "two arbitrary files" actually belongs to a non-malicious agent in your storage.
If you're still not satisfied, you might want to try something like SHA-1 or SHA-2 or GOST.
First of all, a hash value can never identify a file, as there will always be collisions.
Having said that, what you are looking for is called a cryptographic hash. These are designed to not (easily, i.e. other than brute force) allow modifications of the data while keeping the hash, or producing new data with a given hash.
As such, the SHA family is ok.
For the moment, SHA1 is adequate. No collisions are known.
It would help a lot to know the average size of the thing you are hashing. But most likely, if your platforms are predominantly 64-bit, SHA512 is your best choice. You can truncate the hash and use only 256-bits of it. If your platforms are predominantly 32-bit, SHA256 is your best choice.

As of 2011, what hash algo is the most suitable for message digest?

I'm a bit conflicted with an answer when I google for this, as these algos are constantly improving and new exploits are being found and new issues come up all the time... a lot of advice on what algo to use is simply old, or keeping ideas from an older time when they were the best way.
I want to be very clear here: I'm not talking about passwords. I'm talking about message digests, not cryptographic hashes.
I could go ahead and use md5 as my first inkling for message digest (it's right in the name), but then I remembered there's more collisions than more modern algos out there. But then, what makes these newer algos more suitable for the message digest of a file or short string?
So that's my question, what's the modern message digest algo that should be used?
From that perspective, depending on the amount of data you are working with, SHA1 should do fine - if you will be working with larger amounts of data, a SHA-2 algorithm, such as SHA-256 might be more suitable as the fear of collisions in SHA1 is rising due to a flaw in its algorithm, but it isn't extremely serious when working with smallish amounts of data.
MD5 has been shown to be too vulnerable to collisions, as there have been attacks on SSL certificates that used MD5 to create a forged SSL certificate, so I'd stay away from there. Also depending on your application, MD5 is not FIPS 140 compliant, if that is of any importance to you.
SHA1 is ideal over MD5 because it is safer as MD5 is risky to use, and SHA1 has better performance in most common circumstances than SHA-2. The SHA-2 algorithms are by no means slow - but it has an edge. However, SHA1 is slightly riskier because you've probably locked yourself into using it - if collisions start to be found, it might be hard for you to change, so it might be better to invest in a SHA-2 algorithm up-front. The penalty for using SHA-256 over SHA-1 is very little, depending on how you will be using the SHA algorithm. SHA-2 algorithms produce a much larger output than SHA1, but at the benefit of the reduced chances of a collision.
So which one is right? It depends on what you are looking for and what your use case it. Hopefully now you can make a decision.
When in doubt, use SHA-256. The other SHA-2 functions are fine too; however, SHA-384 and SHA-512 may suffer from a non-negligible performance degradation on small (32-bit only) platforms. This may matter for some specific applications.
For non-security related usages (e.g. first pass of indexing in a hash table, or detection of accidental, non-malicious data alteration -- the kind of job where you could use a CRC), consider MD4, a predecessor to MD5. MD4 is even more broken than MD5, but also simpler to implement (with shorter code) and faster (actually, it has been measured to be faster than CRC32 on some ARM platforms).

is perfect hashing without buckets possible?

I've been asked to look for a perfect hash/one way function to be able to hash 10^11 numbers.
However as we'll be using a embedded device it wont have the memory to store the relevant buckets so I was wondering if it's possible to have a decent (minimal) perfect hash without them?
The plan is to use the device to hash the number(s) and we use a rainbow table or a file using the hash as the offset.
Cheers
Edit:
I'll try to provide some more info :)
1) 10^11 is actually now 10^10 so that makes it easer.This number is the possible combinations. So we could get a number between 0000000001 and 10000000000 (10^10).
2) The plan is to us it as part of a one way function to make the number secure so we can send it by insecure means.
We will then look up the original number at the other end using a rainbow table.
The problem is that the source the devices generally have 512k-4Meg of memory to use.
3) it must be perfect - we 100% cannot have a collision .
Edit2:
4) We cant use encryption as we've been told it's not really possable on the devices and keymanigment would be a nightmare if we could.
Edit3:
As this is not sensible, Its purely academic question now (I promise)
Okay, since you've clarified what you're trying to do, I rewrote my answer.
To summarize: Use a real encryption algorithm.
First, let me go over why your hashing system is a bad idea.
What is your hashing system, anyway?
As I understand it, your proposed system is something like this:
Your embedded system (which I will call C) is sending some sort of data with a value space of 10^11. This data needs to be kept confidential in transit to some sort of server (which I will call S).
Your proposal is to send the value hash(salt + data) to S. S will then use a rainbow table to reverse this hash and recover the data. salt is a shared value known to both C and S.
This is an encryption algorithm
An encryption algorithm, when you boil it down, is any algorithm that gives you confidentiality. Since your goal is confidentiality, any algorithm that satisfies your goals is an encryption algorithm, including this one.
This is a very poor encryption algorithm
First, there is an unavoidable chance of collision. Moreover, the set of colliding values differs each day.
Second, decryption is extremely CPU- and memory-intensive even for the legitimate server S. Changing the salt is even more expensive.
Third, although your stated goal is avoiding key management, your salt is a key! You haven't solved key management at all; anyone with the salt will be able to crack the message just as well as you can.
Fourth, it's only usable from C to S. Your embedded system C will not have enough computational resources to reverse hashes, and can only send data.
This isn't any faster than a real encryption algorithm on the embedded device
Most secure hashing algorithm are just as computationally expensive as a reasonable block cipher, if not worse. For example, SHA-1 requires doing the following for each 512-bit block:
Allocate 12 32-bit variables.
Allocate 80 32-bit words for the expanded message
64 times: Perform three array lookups, three 32-bit xors, and a rotate operation
80 times: Perform up to five 32-bit binary operations (some combination of xor, and, or, not, and and depending on the round); then a rotate, array lookup, four adds, another rotate, and several memory loads/stores.
Perform five 32-bit twos-complement adds
There is one chunk per 512-bits of the message, plus a possible extra chunk at the end. This is 1136 binary operations per chunk (not counting memory operations), or about 16 operations per byte.
For comparison, the RC4 encryption algorithm requires four operations (three additions, plus an xor on the message) per byte, plus two array reads and two array writes. It also requires only 258 bytes of working memory, vs a peak of 368 bytes for SHA-1.
Key management is fundamental
With any confidentiality system, you must have some sort of secret. If you have no secrets, then anyone else can implement the same decoding algorithm, and your data is exposed to the world.
So, you have two choices as to where to put the secrecy. One option is to make the encipherpent/decipherment algorithms secret. However, if the code (or binaries) for the algorithm is ever leaked, you lose - it's quite hard to replace such an algorithm.
Thus, secrets are generally made easy to replace - this is what we call a key.
Your proposed usage of hash algorithms would require a salt - this is the only secret in the system and is therefore a key. Whether you like it or not, you will have to manage this key carefully. And it's a lot harder to replace if leaked than other keys - you have to spend many CPU-hours generating a new rainbow table every time it's changed!
What should you do?
Use a real encryption algorithm, and spend some time actually thinking about key management. These issues have been solved before.
First, use a real encryption algorithm. AES has been designed for high performance and low RAM requirements. You could also use a stream cipher like RC4 as I mentioned before - the thing to watch out for with RC4, however, is that you must discard the first 4 kilobytes or so of output from the cipher, or you will be vulnerable to the same attacks that plauged WEP.
Second, think about key management. One option is to simply burn a key into each client, and physically go out and replace it if the client is compromised. This is reasonable if you have easy physical access to all of the clients.
Otherwise, if you don't care about man-in-the-middle attacks, you can simply use Diffie-Hellman key exchange to negotiate a shared key between S and C. If you are concerned about MitMs, then you'll need to start looking at ECDSA or something to authenticate the key obtained from the D-H exchange - beware that when you start going down that road, it's easy to get things wrong, however. I would recommend implementing TLS at that point. It's not beyond the capabilities of an embedded system - indeed, there are a number of embedded commercial (and open source) libraries available already. If you don't implement TLS, then at least have a professional cryptographer look over your algorithm before implementing it.
There is obviously no such thing as a "perfect" hash unless you have at least as many hash buckets as inputs; if you don't, then inevitably it will be possible for two of your inputs to share the same hash bucket.
However, it's unlikely you'll be storing all the numbers between 0 and 10^11. So what's the pattern? If there's a pattern, there may be a perfect hash function for your actual data set.
It's really not that important to find a "perfect" hash function anyway, though. Hash tables are very fast. A function with a very low collision rate - and when hashing integers, that means nearly any simple function, like modulus - is fine and you'll get O(1) average performance.

Why use SHA1 for hashing secrets when SHA-512 is more secure?

I don't mean for this to be a debate, but I'm trying to understand the technical rationale behind why so many apps use SHA1 for hashing secrets, when SHA512 is more secure. Perhaps it's simply for backwards compatibility.
Besides the obvious larger size (128 chars vs 40), or slight speed differences, is there any other reason why folks use the former?
Also, SHA-1 I believe was first cracked by a VCR's processor years ago. Has anyone cracked 512 yet (perhaps with a leaf blower), or is it still safe to use without salting?
Most uses of SHA-1 are for interoperability: we use SHA-1 when we implement protocols where SHA-1 is mandated. Ease of development also comes into account: SHA-1 implementations in various languages and programming environment are more common than SHA-512 implementations.
Also, even so most usages of hash functions do not have performance issues (at least, no performance issue where the hash function is the bottleneck), there are some architectures where SHA-1 is vastly more efficient than SHA-512. Consider a basic Linksys router: it uses a Mips-derivative CPU, clocked at 200 MHz. Such a machine can be reprogrammed, e.g. with OpenWRT (a small Linux for embedded systems). As a router, it has fast network (100Mbit/s). Suppose that you want to hash some data (e.g. as part of some VPN software -- a router looks like a good candidate for running a VPN). With SHA-1, you will get about 6 MB/s, using the full CPU. That's already quite lower than the network bandwidth. SHA-512 will give you no more than 1.5 MB/s on the same machine. On such a system, the difference in performance is not negligible. Also, if I use SHA-1 on my Linksys router for some communication protocol, then the machine at the other end of the link will also have to use SHA-1.
The good news is that there is an ongoing competition to select a new standard hash function, code-named SHA-3. Some of the competing candidates provide performance similar to SHA-1, or even somewhat better, while still yielding a 512-bit output and be (probably) as secure as SHA-512.
Both SHA1 and SHA512 are hash functions. If you are using them as a cryptographic hash, then perhaps that is good reason to use SHA512; however, there are applications that use these function simply to identify objects. For example, Git uses SHA1 to cheaply distinguish between objects. In that case, because the possibility of collision between two documents is incredibly small with SHA1, there really is no justification for the additional space requirement of SHA512 when SHA1 is more than suitable for the task.
In terms of cryptographic hashes and the choice to use a salt or not, you may be interested in reading Don't Hash Secrets. Even with SHA512, using a salt is a good idea (and it's cheap to do, too, so why not do it?), because you can guess the top passwords and see if they have the same hash, but the author points out that HMAC is a more secure mechanism. In any case, you will have to determine the costs associated with the extra time+space and the costs associated with the possibility of a breach, and determine how paranoid you want to be. As was recently discovered by Microsoft, constantly changing passwords is a waste of money and doesn't pay off, so while paranoia is usually good when it comes to security, you really have to do the math to determine if it makes sense.... do the gains in security outweigh time and storage costs?
If you need something to be hashed quickly, or only need a 160 bit hash, you'd use SHA-1.
For comparing database entries to one another quickly, you might take 100 fields and make a SHA-1 hash from them, yielding 160 bits. Those 160 bits are 10^50ish values.
If I'm unlikely to ever have more than a tiny fraction of 10^50th values, it's quicker to just hash what I have with the simpler and faster algorithm.

Purposely create two files to have the same hash?

If someone is purposely trying to modify two files to have the same hash, what are ways to stop them? Can md5 and sha1 prevent the majority case?
I was thinking of writing my own and I figure even if I don't do a good job if the user doesn't know my hash he may not be able to fool mine.
What's the best way to prevent this?
MD5 is generally considered insecure if hash collisions are a major concern. SHA1 is likewise no longer considered acceptable by the US government. There is was a competition under way to find a replacement hash algorithm, but the recommendation at the moment is to use the SHA2 family - SHA-256, SHA-384 or SHA-512. [Update: 2012-10-02 NIST has chosen SHA-3 to be the algorithm Keccak.]
You can try to create your own hash — it would probably not be as good as MD5, and 'security through obscurity' is likewise not advisable.
If you want security, hash with multiple hash algorithms. Being able to simultaneously create files that have hash collisions using a number of algorithms is excessively improbable. [And, in the light of comments, let me make it clear: I mean publish both the SHA-256 and the Whirlpool values for the file — not combining hash algorithms to create a single value, but using separate algorithms to create separate values. Generally, a corrupted file will fail to match any of the algorithms; if, perchance, someone has managed to create a collision value using one algorithm, the chance of also producing a second collision in one of the other algorithms is negligible.]
The Public TimeStamp uses an array of algorithms. See, for example, sqlcmd-86.00.tgz for an illustration.
If the user doesn't know your hashing algorithm he also can't verify your signature on a document that you actually signed.
The best option is to use public-key one-way hashing algorithms that generate the longest hash. SHA-256 creates a 256-bit hash, so a forger would have to try 2255 different documents (on average) before they created one that matched a given document, which is pretty secure. If that's still not secure enough for you, there's SHA-512.
Also, I think it's worth mentioning that a good low-tech way to protect yourself against forged digitally-signed documents is to simply keep a copy of anything you sign. That way, if it comes down to a dispute, you can show that the original document you signed was altered.
There is a hierarchy of difficulty (for an attacker) here. It is easier to find two files with the same hash than to generate one to match a given hash, and easier to do the later if you don't have to respect form/content/lengths restrictions.
Thus, if it is possible to use a well defined document structure and lengths, you can make an attackers life a bit harder no matter what underling hash you use.
Why are you trying to create your own hash algorithm? What's wrong with SHA1HMAC?
Yes, there are repeats for hashes.
Any hash that is shorter than the plaintext is necessarily less information. That means there will be some repeats. The key for hashes is that the repeats are hard to reverse-engineer.
Consider CRC32 - commonly used as a hash. It's a 32-bit quantity. Because there are more than 2^32 messages in the universe, then there will be repeats with CRC32.
The same idea applies to other hashes.
This is called a "hash collision", and the best way to avoid it is to use a strong hash function. MD5 is relatively easy to artificially build colliding files, as seen here. Similarly, it's known there is a relatively efficient method for computing colliding SH1 files, although in this case "relatively efficient" still takes hunreds of hours of compute time.
Generally, MD5 and SHA1 are still expensive to crack, but not impossible. If you're really worried about it, use a stronger hash function, like SHA256.
Writing your own isn't actually a good idea unless you're a pretty expert cryptographer. most of the simple ideas have been tried and there are well-known attacks against them.
If you really want to learn more about it, have a look at Schneier's Applied Cryptography.
I don't think coming up with your own hash algorithm is a good choice.
Another good option is used Salted MD5. For example, the input to your MD5 hash function is appended with string "acidzom!##" before passing to MD5 function.
There is also a good reading at Slashdot.