How can I make sure that a hash function won't produce the same cypher for 2+ different entries? - hash

Edit: some people flagged this question as a potential duplicate of this other one. While I agree that knowing how the birthday paradox applies to hashing functions, the 2 questions (and respective answers) address 2 different, albeit related, subjects.
The other question is asking "what are the odds of collision", whereas this question main focus is "how can I make sure that collision never happens".
I have a data lake stored in S3 where each day an ETL script dumps additional data from the day before.
Due to how the pipeline is built, it is possible for a very inconsiderate user that has admin access to produce duplicates in said data lake by manually interacting with the dump files coming from our OLTP database, and triggering the ETL script when it's not supposed to.
I thought that a good idea to prevent data duplication was to insert a form of security measure in my ETL script:
Produce a hash for each entry.
Store said hashes somewhere else (like a dynamodb table).
Whenever new data comes in, hash that as well and compare it with the already existing hashes.
If any of new hash is in the existing hashes, reject the associated entry entirely.
However, I know very little about hashing and I was reading that, although unlikely, 2 different sources can produce the same hash.
I understand it's really hard for it to happen in this situation, but I was wondering if there is a way to be 100% sure about it.
Any idea is much appreciated.

Long answer: what you want to study and explore is called "perfect hashing" (ie hashing guaranteed not to have collisions. https://en.wikipedia.org/wiki/Perfect_hash_function
Short answer: A cryptographic collision resistant algorithm like sha-1 is probably safe to use for all but the largest (PBs a day) datasets and even then its probably all right. Git uses sha-1 internally and code repositories probably deal with the most files on the planet and rarely have collisions.
See for details: https://ericsink.com/vcbe/html/cryptographic_hashes.html#:~:text=Git%20uses%20hashes%20in%20two,computed%20when%20it%20was%20stored.
Medium answer: this is actually a pretty hard problem overall and a frequent area of study for computer science and a lot depends on your particular use case and the context you're operating in. Cuckoo hashing, collision resistant algorithms, and hashing in general are probably all good terms to research. There's also a lot of art and science behind space (memory) and time (computer power needed) when picking these methods. A good rule of thumb is that perfect hashing will generally take up more space and time than a collision resistant cryptographic hash like sha-1.

Related

What are the downsides of using my own hashing algorithm instead of popular ones available?

I am a noob in algorithms and not really so smart. But I have a question in my mind. There are a lot of hashing algorithms available and those might be 10 times more complex than what I wrote, but almost all of them are predictable these days. Recently, I read that writing my own hashing function is not a good idea. But why? I was wondering how a program/programmer can break my logic that (for example) creates a unique hash for each string in 5+ steps. Suppose someone successfully injected a SQL query in my server and got all the hashes stored. How a program (like hashcat) may help him to decrypt those hashes? I can see a strong side of my own algorithm in this case, that it is known by no one and the hacker has no idea how it was implemented. On the other hand, well-known algorithms (like sha-1) are not unpredictable anymore. There are websites available that are highly eligible to efficiently break those hashes. So, my simple question is, why smart people do not recommend to use self-written hashing algorithms?
Security by obscurity can be an advantage, but you should never rely on it. You rely on the fact that your code stays secret, as soon as it becomes known (shared hosting, backups, source-control, ...) the stored passwords are propably not safe anymore.
Inventing a new safe algorithm is extremely difficult, even for cryptographers. There are many points to consider like correct salting or key-stretching, making sure that similar output does not allow to draw conclusions about the similarity of the input, and so on... Only algorithms withstanding years of attacks by other cryptographers are regarded as safe.
There is a better alternative to inventing your own scheme. With inventing an algorithm you actually add a secret to the hashing (your code), only with the knowledge of this code an attacker can start brute-forcing the passwords. A better way to add a secret is:
Hash the passwords with a known proven algorithm (BCrypt, SCrypt, PBKDF2).
Encrypt the resulting hash with a secret server-side key (two-way encryption).
This way you can also add a secret (the server side key). Only if the attacker has privileges on the server he can know the key, in this case (s)he would also know your algorithm. This scheme also allows to exchange the key when necessary, exchanging the hash algorithm would be much more difficult.

Ensuring a hash function is well-mixed with slicing

Forgive me if this question is silly, but I'm starting to learn about consistent hashing and after reading Tom White blog post on it here and realizing that most default hash functions are NOT well mixed I had a thought on ensuring that an arbitrary hash function is minimally well-mixed.
My thought is best explained using an example like this:
Bucket 1: 11000110
Bucket 2: 11001110
Bucket 3: 11010110
Bucket 4: 11011110
Under a standard hash ring implementation for consistent caching across these buckets, you would be get terribly performance, and nearly every entry would be lumped into Bucket 1. However, if we use bits 4&5 as the MSBs in each case then these buckets are suddenly excellently mixed, and assigning a new object to a cache becomes trivial and only requires examining 2 bits.
In my mind this concept could very easily be extended when building distributed networks across multiple nodes. In my particular case I would be using this to determine which cache to place a given piece of data into. The increased placement speed isn't a real concern, but ensuring that my caches are well-mixed is and I was considering just choosing a few bits that are optimally mixed for my given caches. Any information later indexed would be indexed on the basis of the same bits.
In my naive mind this is a much simpler solution than introducing virtual nodes or building a better hash function. That said, I can't see any mention of an approach like this and I'm concerned that in my hashing ignorance I'm doing something wrong here and I might be introducing unintended consequences.
Is this approach safe? Should I use it? Has this approach been used before and are there any established algorithms for determining the minimum unique group of bits?

Is there a fast hashing algorithm which is resistant to deliberate collision, and if so, which one?

I am developing an "open distributed cloud storage system".
By open I mean that anyone can participate in hosting of files.
My current design uses a sha1 hash of the files content as global file id.
It is given that the client already knows this hash value and receives the file from a "bandwidth donor".
The client now needs to verify that the file indeed is the correct one, by generating the hash and comparing it to the expected value.
However my concern is that someone could deliberately modify a file to produce the same hash. As far as I know this is doable easily for hashes of the CRC family. Some "googling" around revealed a lot of claims that the same would be easy for MD5.
Now my question is: Is there a hashing algorithm which satisfies the criteria of beeing
fast for big amounts of data
well distributed in the hashing range (aka "unique")
has a sufficient target range ("bit length")
is resistant to deliberate collision attacks
All other means that I can think of achieving a setup that serves my needs involve a secret component, for example a secret openssl key or a shared secret salt for a hash function.
Unfortunately I cannot work with that.
What you are asking for is a one-way function, whose existence is a major open problem.
With cryptographic hash functions, the specific attack you wanted to avoid is called the "second pre-image attack".
That should help you Googling what you want, but as far as I know there is actually no known practical second pre-image attack for MD5.
First of all, you probably found that it is easy to find two arbitrary files that have the same hash, and to find two different such pairs every time you try.
But it is difficult to generate a file to disguise as some specific file - in other words, it is unlikely that one of the prementioned "two arbitrary files" actually belongs to a non-malicious agent in your storage.
If you're still not satisfied, you might want to try something like SHA-1 or SHA-2 or GOST.
First of all, a hash value can never identify a file, as there will always be collisions.
Having said that, what you are looking for is called a cryptographic hash. These are designed to not (easily, i.e. other than brute force) allow modifications of the data while keeping the hash, or producing new data with a given hash.
As such, the SHA family is ok.
For the moment, SHA1 is adequate. No collisions are known.
It would help a lot to know the average size of the thing you are hashing. But most likely, if your platforms are predominantly 64-bit, SHA512 is your best choice. You can truncate the hash and use only 256-bits of it. If your platforms are predominantly 32-bit, SHA256 is your best choice.

How safe is it to rely on hashes for file identification?

I am designing a storage cloud software on top of a LAMP stack.
Files could have an internal ID, but it would have many advantages to store them not with an incrementing id as filename in the servers filesystems, but using an hash as filename.
Also hashes as identifier in the database would have a lot of advantages if the currently centralized database should be sharded or decentralized or some sort of master-master high availability environment should be set up. But I am not sure about that yet.
Clients can store files under any string (usually some sort of path and filename).
This string is guaranteed to be unique, because on the first level is something like "buckets" that users have go register like in Amazon S3 and Google storage.
My plan is to store files as hash of the client side defined path.
This way the storage server can directly serve the file without needing the database to ask which ID it is because it can calculate the hash and thus the filename on the fly.
But I am afraid of collisions. I currently think about using SHA1 hashes.
I heard that GIT uses hashes also revision identifiers as well.
I know that the chances of collisions are really really low, but possible.
I just cannot judge this. Would you or would you not rely on hash for this purpose?
I could also us some normalization of encoding of the path. Maybe base64 as filename, but i really do not want that because it could get messy and paths could get too long and possibly other complications.
Assuming you have a hash function with "perfect" properties and assuming cryptographic hash functions approach that the theory that applies is the same that applies to birthday attacks . What this says is that given a maximum number of files you can make the collision probability as small as you want by using a larger hash digest size. SHA has 160 bits so for any practical number of files the probability of collision is going to be just about zero. If you look at the table in the link you'll see that a 128 bit hash with 10^10 files has a collision probability of 10^-18 .
As long as the probability is low enough I think the solution is good. Compare with the probability of the planet being hit by an asteroid, undetectable errors in the disk drive, bits flipping in your memory etc. - as long as those probabilities are low enough we don't worry about them because they'll "never" happen. Just take enough margin and make sure this isn't the weakest link.
One thing to be concerned about is the choice of the hash function and it's possible vulnerabilities. Is there any other authentication in place or does the user simply present a path and retrieve a file?
If you think about an attacker trying to brute force the scenario above they would need to request 2^18 files before they can get some other random file stored in the system (again assuming 128 bit hash and 10^10 files, you'll have a lot less files and a longer hash). 2^18 is a pretty big number and the speed you can brute force this is limited by the network and the server. A simple lock the user out after x attempts policy can completely close this hole (which is why many systems implement this sort of policy). Building a secure system is complicated and there will be many points to consider but this sort of scheme can be perfectly secure.
Hope this is useful...
EDIT: another way to think about this is that practically every encryption or authentication system relies on certain events having very low probability for its security. e.g. I can be lucky and guess the prime factor on a 512 bit RSA key but it is so unlikely that the system is considered very secure.
Whilst the probability of a collision might be vanishingly small, imagine serving a highly confidential file from one customer to their competitor just because there happens to be a hash collision.
= end of business
I'd rather use hashing for things that were less critical when collisions DO occur ;-)
If you have a database, store the files under GUIDs - so not an incrementing index, but a proper globally unique identifier. They work nicely when it comes to distributed shards / high availability etc.
Imagine the worst case scenario and assume it will happen the week after you are featured in wired magazine as an amazing startup ... that's a good stress test for the algorithm.

Purposely create two files to have the same hash?

If someone is purposely trying to modify two files to have the same hash, what are ways to stop them? Can md5 and sha1 prevent the majority case?
I was thinking of writing my own and I figure even if I don't do a good job if the user doesn't know my hash he may not be able to fool mine.
What's the best way to prevent this?
MD5 is generally considered insecure if hash collisions are a major concern. SHA1 is likewise no longer considered acceptable by the US government. There is was a competition under way to find a replacement hash algorithm, but the recommendation at the moment is to use the SHA2 family - SHA-256, SHA-384 or SHA-512. [Update: 2012-10-02 NIST has chosen SHA-3 to be the algorithm Keccak.]
You can try to create your own hash — it would probably not be as good as MD5, and 'security through obscurity' is likewise not advisable.
If you want security, hash with multiple hash algorithms. Being able to simultaneously create files that have hash collisions using a number of algorithms is excessively improbable. [And, in the light of comments, let me make it clear: I mean publish both the SHA-256 and the Whirlpool values for the file — not combining hash algorithms to create a single value, but using separate algorithms to create separate values. Generally, a corrupted file will fail to match any of the algorithms; if, perchance, someone has managed to create a collision value using one algorithm, the chance of also producing a second collision in one of the other algorithms is negligible.]
The Public TimeStamp uses an array of algorithms. See, for example, sqlcmd-86.00.tgz for an illustration.
If the user doesn't know your hashing algorithm he also can't verify your signature on a document that you actually signed.
The best option is to use public-key one-way hashing algorithms that generate the longest hash. SHA-256 creates a 256-bit hash, so a forger would have to try 2255 different documents (on average) before they created one that matched a given document, which is pretty secure. If that's still not secure enough for you, there's SHA-512.
Also, I think it's worth mentioning that a good low-tech way to protect yourself against forged digitally-signed documents is to simply keep a copy of anything you sign. That way, if it comes down to a dispute, you can show that the original document you signed was altered.
There is a hierarchy of difficulty (for an attacker) here. It is easier to find two files with the same hash than to generate one to match a given hash, and easier to do the later if you don't have to respect form/content/lengths restrictions.
Thus, if it is possible to use a well defined document structure and lengths, you can make an attackers life a bit harder no matter what underling hash you use.
Why are you trying to create your own hash algorithm? What's wrong with SHA1HMAC?
Yes, there are repeats for hashes.
Any hash that is shorter than the plaintext is necessarily less information. That means there will be some repeats. The key for hashes is that the repeats are hard to reverse-engineer.
Consider CRC32 - commonly used as a hash. It's a 32-bit quantity. Because there are more than 2^32 messages in the universe, then there will be repeats with CRC32.
The same idea applies to other hashes.
This is called a "hash collision", and the best way to avoid it is to use a strong hash function. MD5 is relatively easy to artificially build colliding files, as seen here. Similarly, it's known there is a relatively efficient method for computing colliding SH1 files, although in this case "relatively efficient" still takes hunreds of hours of compute time.
Generally, MD5 and SHA1 are still expensive to crack, but not impossible. If you're really worried about it, use a stronger hash function, like SHA256.
Writing your own isn't actually a good idea unless you're a pretty expert cryptographer. most of the simple ideas have been tried and there are well-known attacks against them.
If you really want to learn more about it, have a look at Schneier's Applied Cryptography.
I don't think coming up with your own hash algorithm is a good choice.
Another good option is used Salted MD5. For example, the input to your MD5 hash function is appended with string "acidzom!##" before passing to MD5 function.
There is also a good reading at Slashdot.