Length of SHA1 hash to identify an object? - hash

Can I create a really short SHA1 hash to uniquely identify an object that would typically have an id like 1300992607?
This is relatively a theoretical question but how short can a SHA1 hash be and still be unique for an objects id? Please help me if not asking the right question here.

No, you can't, a hash doesn't work that way.
You can create a hash for the id, and just take as many bits you like from it. The more bits you use, the less likely it is that you get a hash that is the same for two different id, but no matter how many bits you use from the hash there is still no guarantee that there will never be a collision.

Related

Any way to get possible messages according to the sha512 hash signature?

I have hash (signature) of a message. I know that a message is always a hex number with 64 digits. Is there a way how to get the set of possible 64 digits messages? Or is there a way how to get any assumptions/features about the message according to the signature?
No
If SHA-512 is a secure hash function, which it is generally believed to be, then knowing the hash value provides no useful information about the input. In particular, no method for guessing the possible input(s) will be significantly more efficient than just brute-force guessing all possible inputs, which will take an average of 2^255 tries; far more than you can possibly hope to do, unless you have other independent knowledge about the message that drastically reduces the number of possibilities.
This property of secure hash functions is key to modern cryptography. If you were hoping to use this to recover passwords from a hash or something like that, think again.

How do Hash-functions encode an infinite amount of data into a finite amount?

Hash-functions always create an output with a fixed length, even though the input can be infinitely large.
So how is it possible, that no information is lost here? Shouldn't some inputs result in the same output then?
Yes. Two inputs can result in the same output, resulting in a hash collision.
Hashes are designed so that hashing text is very easy, but reversing the process is difficult. The point of hashing isn't to store information. Instead, hashes are commonly used in security (and also data structures).
For instance, websites will hash a user's passwords and store the hashes instead of the physical passwords. This way, if the website's security is breached, the attacker can only obtain the hashes, which still doesn't let the attacker log in, as it is very difficult to reverse-engineer the password.
The hash set is another application of hashing. By hashing an object and storing only the hashes, you can check whether an object is present or not present in the set in constant time. You only have to search through all of the objects in the hash set that have the same hash as the object that you are checking. As the size of the hash set grows, so does the chance of a hash collision.
So how is it possible, that no information is lost here?
It's not possible, and lots of information is lost.
In the case of a perfect hash there is no collision and we could even argue that information isn't really lost (it's just not contained in the system alone) because we know all possible inputs and know there is no collisions in the hashes produced, but they can be used as an index in a way that isn't possible or as good with the input data, so they are useful.
In the case of a hash-based collection we use a hash code to (hopefully) have few collisions so we get close to O(1) lookup, but have some means to handle it if a collision does happen.
In the case of a cryptographic hash we could have collisions but it's extremely hard to deliberately do so, for similar (roughly speaking) reasons as to why its hard to break modern cryptography, so while you could have two passwords with the same hash you couldn't find it easily (especially if you aren't going to e.g. have a password of several thousand pages of text).
In the case of a checksum hash we could have collisions, but that they're unlikely means that if we have corruption we probably won't have the matching hash.

Is there a fast hashing algorithm which is resistant to deliberate collision, and if so, which one?

I am developing an "open distributed cloud storage system".
By open I mean that anyone can participate in hosting of files.
My current design uses a sha1 hash of the files content as global file id.
It is given that the client already knows this hash value and receives the file from a "bandwidth donor".
The client now needs to verify that the file indeed is the correct one, by generating the hash and comparing it to the expected value.
However my concern is that someone could deliberately modify a file to produce the same hash. As far as I know this is doable easily for hashes of the CRC family. Some "googling" around revealed a lot of claims that the same would be easy for MD5.
Now my question is: Is there a hashing algorithm which satisfies the criteria of beeing
fast for big amounts of data
well distributed in the hashing range (aka "unique")
has a sufficient target range ("bit length")
is resistant to deliberate collision attacks
All other means that I can think of achieving a setup that serves my needs involve a secret component, for example a secret openssl key or a shared secret salt for a hash function.
Unfortunately I cannot work with that.
What you are asking for is a one-way function, whose existence is a major open problem.
With cryptographic hash functions, the specific attack you wanted to avoid is called the "second pre-image attack".
That should help you Googling what you want, but as far as I know there is actually no known practical second pre-image attack for MD5.
First of all, you probably found that it is easy to find two arbitrary files that have the same hash, and to find two different such pairs every time you try.
But it is difficult to generate a file to disguise as some specific file - in other words, it is unlikely that one of the prementioned "two arbitrary files" actually belongs to a non-malicious agent in your storage.
If you're still not satisfied, you might want to try something like SHA-1 or SHA-2 or GOST.
First of all, a hash value can never identify a file, as there will always be collisions.
Having said that, what you are looking for is called a cryptographic hash. These are designed to not (easily, i.e. other than brute force) allow modifications of the data while keeping the hash, or producing new data with a given hash.
As such, the SHA family is ok.
For the moment, SHA1 is adequate. No collisions are known.
It would help a lot to know the average size of the thing you are hashing. But most likely, if your platforms are predominantly 64-bit, SHA512 is your best choice. You can truncate the hash and use only 256-bits of it. If your platforms are predominantly 32-bit, SHA256 is your best choice.

Purposely create two files to have the same hash?

If someone is purposely trying to modify two files to have the same hash, what are ways to stop them? Can md5 and sha1 prevent the majority case?
I was thinking of writing my own and I figure even if I don't do a good job if the user doesn't know my hash he may not be able to fool mine.
What's the best way to prevent this?
MD5 is generally considered insecure if hash collisions are a major concern. SHA1 is likewise no longer considered acceptable by the US government. There is was a competition under way to find a replacement hash algorithm, but the recommendation at the moment is to use the SHA2 family - SHA-256, SHA-384 or SHA-512. [Update: 2012-10-02 NIST has chosen SHA-3 to be the algorithm Keccak.]
You can try to create your own hash — it would probably not be as good as MD5, and 'security through obscurity' is likewise not advisable.
If you want security, hash with multiple hash algorithms. Being able to simultaneously create files that have hash collisions using a number of algorithms is excessively improbable. [And, in the light of comments, let me make it clear: I mean publish both the SHA-256 and the Whirlpool values for the file — not combining hash algorithms to create a single value, but using separate algorithms to create separate values. Generally, a corrupted file will fail to match any of the algorithms; if, perchance, someone has managed to create a collision value using one algorithm, the chance of also producing a second collision in one of the other algorithms is negligible.]
The Public TimeStamp uses an array of algorithms. See, for example, sqlcmd-86.00.tgz for an illustration.
If the user doesn't know your hashing algorithm he also can't verify your signature on a document that you actually signed.
The best option is to use public-key one-way hashing algorithms that generate the longest hash. SHA-256 creates a 256-bit hash, so a forger would have to try 2255 different documents (on average) before they created one that matched a given document, which is pretty secure. If that's still not secure enough for you, there's SHA-512.
Also, I think it's worth mentioning that a good low-tech way to protect yourself against forged digitally-signed documents is to simply keep a copy of anything you sign. That way, if it comes down to a dispute, you can show that the original document you signed was altered.
There is a hierarchy of difficulty (for an attacker) here. It is easier to find two files with the same hash than to generate one to match a given hash, and easier to do the later if you don't have to respect form/content/lengths restrictions.
Thus, if it is possible to use a well defined document structure and lengths, you can make an attackers life a bit harder no matter what underling hash you use.
Why are you trying to create your own hash algorithm? What's wrong with SHA1HMAC?
Yes, there are repeats for hashes.
Any hash that is shorter than the plaintext is necessarily less information. That means there will be some repeats. The key for hashes is that the repeats are hard to reverse-engineer.
Consider CRC32 - commonly used as a hash. It's a 32-bit quantity. Because there are more than 2^32 messages in the universe, then there will be repeats with CRC32.
The same idea applies to other hashes.
This is called a "hash collision", and the best way to avoid it is to use a strong hash function. MD5 is relatively easy to artificially build colliding files, as seen here. Similarly, it's known there is a relatively efficient method for computing colliding SH1 files, although in this case "relatively efficient" still takes hunreds of hours of compute time.
Generally, MD5 and SHA1 are still expensive to crack, but not impossible. If you're really worried about it, use a stronger hash function, like SHA256.
Writing your own isn't actually a good idea unless you're a pretty expert cryptographer. most of the simple ideas have been tried and there are well-known attacks against them.
If you really want to learn more about it, have a look at Schneier's Applied Cryptography.
I don't think coming up with your own hash algorithm is a good choice.
Another good option is used Salted MD5. For example, the input to your MD5 hash function is appended with string "acidzom!##" before passing to MD5 function.
There is also a good reading at Slashdot.

Is the hash of a GUID unique?

I create a GUID (as a string) and get the hash of it. Can I consider this hash to be unique?
Not as reliably unique as the GUID itself, no.
Just to expand, you are reducing your uniqueness by a factor of 4, going from 16 bytes to 4 bytes of possible combinations.
As pointed out in the comments the hash size will make a difference. The 4 byte thing was an assumption, horrible at best I know, that it may be used in .NET, where the default hash size is 4 bytes (int). So you can replace what I said above with whatever byte size your hash may be.
Nope.
See here, if you want a mini GUID: https://devblogs.microsoft.com/oldnewthing/20080627-00/?p=21823
In a word, no.
Let's assume that your hash has fewer bits than the GUID, by the pigeon hole principle, there must exist more than one mapping of some GUID -> hash simply because there are fewer hashes than GUIDS.
If we assume that the hash has a larger number of bits than the GUID, there is a very small--but finite--chance of a collision, assuming you're using a good hash function.
No hash function that reduces an arbitrary sized data block to a fixed size number of bits will produce a 1-to-1 mapping between the two. There will always exist a chance of having two different data blocks be reduced to the same sequence of bits in the hash.
Good hash algorithms minimizes the likelihood of this happening, and generally, the more bits in the hash, the less chance of a collision.
It's not guranteed to be, due to hash collisions. The GUID itself is almost-guaranteed to be.
For practical reasons you probably can assume that a hash is unique, but why not use the GUID itself?
No, and I wouldn't assume uniqueness of any hash value. That shouldn't matter because hash values don't need to unique, they just need to evenly distributed across their range. The more even the distribution, the fewer collisions occur (in the hashtable). Fewer collisions mean better hashtable performance.
fyi For a good description of how hash tables work, read the accepted answer to What are hashtables and hashmaps and their typical use cases?
If you use cryptographic hash (MD5, SHA1, RIPEMD160), the hash will be unique (modulo collisions which are very improbable -- SHA1 is used e.g. for digital signatures, and MD5 is also collision-resistant on random inputs). Though, why do you want to hash a GUID?