Data in a blockchain is hashed by sha256 so when the data is modified by bad actors so is the hash, this is what secures a blockchain so why is POW necessary?
I understand the double spending problem but couldn’t you just include time stamps in that hash, so where would POW come into play?
POW is a consensus algorithm, it helps multiple actors/nodes/parties to agree on what should be the actual state of the system that everyone will consider valid. So it simply solves a different problem than linking blocks to each other in a blockchain. There are actually many different other consensus algorithms like Proof of Stake for example.
Related
Edit: some people flagged this question as a potential duplicate of this other one. While I agree that knowing how the birthday paradox applies to hashing functions, the 2 questions (and respective answers) address 2 different, albeit related, subjects.
The other question is asking "what are the odds of collision", whereas this question main focus is "how can I make sure that collision never happens".
I have a data lake stored in S3 where each day an ETL script dumps additional data from the day before.
Due to how the pipeline is built, it is possible for a very inconsiderate user that has admin access to produce duplicates in said data lake by manually interacting with the dump files coming from our OLTP database, and triggering the ETL script when it's not supposed to.
I thought that a good idea to prevent data duplication was to insert a form of security measure in my ETL script:
Produce a hash for each entry.
Store said hashes somewhere else (like a dynamodb table).
Whenever new data comes in, hash that as well and compare it with the already existing hashes.
If any of new hash is in the existing hashes, reject the associated entry entirely.
However, I know very little about hashing and I was reading that, although unlikely, 2 different sources can produce the same hash.
I understand it's really hard for it to happen in this situation, but I was wondering if there is a way to be 100% sure about it.
Any idea is much appreciated.
Long answer: what you want to study and explore is called "perfect hashing" (ie hashing guaranteed not to have collisions. https://en.wikipedia.org/wiki/Perfect_hash_function
Short answer: A cryptographic collision resistant algorithm like sha-1 is probably safe to use for all but the largest (PBs a day) datasets and even then its probably all right. Git uses sha-1 internally and code repositories probably deal with the most files on the planet and rarely have collisions.
See for details: https://ericsink.com/vcbe/html/cryptographic_hashes.html#:~:text=Git%20uses%20hashes%20in%20two,computed%20when%20it%20was%20stored.
Medium answer: this is actually a pretty hard problem overall and a frequent area of study for computer science and a lot depends on your particular use case and the context you're operating in. Cuckoo hashing, collision resistant algorithms, and hashing in general are probably all good terms to research. There's also a lot of art and science behind space (memory) and time (computer power needed) when picking these methods. A good rule of thumb is that perfect hashing will generally take up more space and time than a collision resistant cryptographic hash like sha-1.
I am a noob in algorithms and not really so smart. But I have a question in my mind. There are a lot of hashing algorithms available and those might be 10 times more complex than what I wrote, but almost all of them are predictable these days. Recently, I read that writing my own hashing function is not a good idea. But why? I was wondering how a program/programmer can break my logic that (for example) creates a unique hash for each string in 5+ steps. Suppose someone successfully injected a SQL query in my server and got all the hashes stored. How a program (like hashcat) may help him to decrypt those hashes? I can see a strong side of my own algorithm in this case, that it is known by no one and the hacker has no idea how it was implemented. On the other hand, well-known algorithms (like sha-1) are not unpredictable anymore. There are websites available that are highly eligible to efficiently break those hashes. So, my simple question is, why smart people do not recommend to use self-written hashing algorithms?
Security by obscurity can be an advantage, but you should never rely on it. You rely on the fact that your code stays secret, as soon as it becomes known (shared hosting, backups, source-control, ...) the stored passwords are propably not safe anymore.
Inventing a new safe algorithm is extremely difficult, even for cryptographers. There are many points to consider like correct salting or key-stretching, making sure that similar output does not allow to draw conclusions about the similarity of the input, and so on... Only algorithms withstanding years of attacks by other cryptographers are regarded as safe.
There is a better alternative to inventing your own scheme. With inventing an algorithm you actually add a secret to the hashing (your code), only with the knowledge of this code an attacker can start brute-forcing the passwords. A better way to add a secret is:
Hash the passwords with a known proven algorithm (BCrypt, SCrypt, PBKDF2).
Encrypt the resulting hash with a secret server-side key (two-way encryption).
This way you can also add a secret (the server side key). Only if the attacker has privileges on the server he can know the key, in this case (s)he would also know your algorithm. This scheme also allows to exchange the key when necessary, exchanging the hash algorithm would be much more difficult.
I am developing an "open distributed cloud storage system".
By open I mean that anyone can participate in hosting of files.
My current design uses a sha1 hash of the files content as global file id.
It is given that the client already knows this hash value and receives the file from a "bandwidth donor".
The client now needs to verify that the file indeed is the correct one, by generating the hash and comparing it to the expected value.
However my concern is that someone could deliberately modify a file to produce the same hash. As far as I know this is doable easily for hashes of the CRC family. Some "googling" around revealed a lot of claims that the same would be easy for MD5.
Now my question is: Is there a hashing algorithm which satisfies the criteria of beeing
fast for big amounts of data
well distributed in the hashing range (aka "unique")
has a sufficient target range ("bit length")
is resistant to deliberate collision attacks
All other means that I can think of achieving a setup that serves my needs involve a secret component, for example a secret openssl key or a shared secret salt for a hash function.
Unfortunately I cannot work with that.
What you are asking for is a one-way function, whose existence is a major open problem.
With cryptographic hash functions, the specific attack you wanted to avoid is called the "second pre-image attack".
That should help you Googling what you want, but as far as I know there is actually no known practical second pre-image attack for MD5.
First of all, you probably found that it is easy to find two arbitrary files that have the same hash, and to find two different such pairs every time you try.
But it is difficult to generate a file to disguise as some specific file - in other words, it is unlikely that one of the prementioned "two arbitrary files" actually belongs to a non-malicious agent in your storage.
If you're still not satisfied, you might want to try something like SHA-1 or SHA-2 or GOST.
First of all, a hash value can never identify a file, as there will always be collisions.
Having said that, what you are looking for is called a cryptographic hash. These are designed to not (easily, i.e. other than brute force) allow modifications of the data while keeping the hash, or producing new data with a given hash.
As such, the SHA family is ok.
For the moment, SHA1 is adequate. No collisions are known.
It would help a lot to know the average size of the thing you are hashing. But most likely, if your platforms are predominantly 64-bit, SHA512 is your best choice. You can truncate the hash and use only 256-bits of it. If your platforms are predominantly 32-bit, SHA256 is your best choice.
I am designing a storage cloud software on top of a LAMP stack.
Files could have an internal ID, but it would have many advantages to store them not with an incrementing id as filename in the servers filesystems, but using an hash as filename.
Also hashes as identifier in the database would have a lot of advantages if the currently centralized database should be sharded or decentralized or some sort of master-master high availability environment should be set up. But I am not sure about that yet.
Clients can store files under any string (usually some sort of path and filename).
This string is guaranteed to be unique, because on the first level is something like "buckets" that users have go register like in Amazon S3 and Google storage.
My plan is to store files as hash of the client side defined path.
This way the storage server can directly serve the file without needing the database to ask which ID it is because it can calculate the hash and thus the filename on the fly.
But I am afraid of collisions. I currently think about using SHA1 hashes.
I heard that GIT uses hashes also revision identifiers as well.
I know that the chances of collisions are really really low, but possible.
I just cannot judge this. Would you or would you not rely on hash for this purpose?
I could also us some normalization of encoding of the path. Maybe base64 as filename, but i really do not want that because it could get messy and paths could get too long and possibly other complications.
Assuming you have a hash function with "perfect" properties and assuming cryptographic hash functions approach that the theory that applies is the same that applies to birthday attacks . What this says is that given a maximum number of files you can make the collision probability as small as you want by using a larger hash digest size. SHA has 160 bits so for any practical number of files the probability of collision is going to be just about zero. If you look at the table in the link you'll see that a 128 bit hash with 10^10 files has a collision probability of 10^-18 .
As long as the probability is low enough I think the solution is good. Compare with the probability of the planet being hit by an asteroid, undetectable errors in the disk drive, bits flipping in your memory etc. - as long as those probabilities are low enough we don't worry about them because they'll "never" happen. Just take enough margin and make sure this isn't the weakest link.
One thing to be concerned about is the choice of the hash function and it's possible vulnerabilities. Is there any other authentication in place or does the user simply present a path and retrieve a file?
If you think about an attacker trying to brute force the scenario above they would need to request 2^18 files before they can get some other random file stored in the system (again assuming 128 bit hash and 10^10 files, you'll have a lot less files and a longer hash). 2^18 is a pretty big number and the speed you can brute force this is limited by the network and the server. A simple lock the user out after x attempts policy can completely close this hole (which is why many systems implement this sort of policy). Building a secure system is complicated and there will be many points to consider but this sort of scheme can be perfectly secure.
Hope this is useful...
EDIT: another way to think about this is that practically every encryption or authentication system relies on certain events having very low probability for its security. e.g. I can be lucky and guess the prime factor on a 512 bit RSA key but it is so unlikely that the system is considered very secure.
Whilst the probability of a collision might be vanishingly small, imagine serving a highly confidential file from one customer to their competitor just because there happens to be a hash collision.
= end of business
I'd rather use hashing for things that were less critical when collisions DO occur ;-)
If you have a database, store the files under GUIDs - so not an incrementing index, but a proper globally unique identifier. They work nicely when it comes to distributed shards / high availability etc.
Imagine the worst case scenario and assume it will happen the week after you are featured in wired magazine as an amazing startup ... that's a good stress test for the algorithm.
I've been asked to look for a perfect hash/one way function to be able to hash 10^11 numbers.
However as we'll be using a embedded device it wont have the memory to store the relevant buckets so I was wondering if it's possible to have a decent (minimal) perfect hash without them?
The plan is to use the device to hash the number(s) and we use a rainbow table or a file using the hash as the offset.
Cheers
Edit:
I'll try to provide some more info :)
1) 10^11 is actually now 10^10 so that makes it easer.This number is the possible combinations. So we could get a number between 0000000001 and 10000000000 (10^10).
2) The plan is to us it as part of a one way function to make the number secure so we can send it by insecure means.
We will then look up the original number at the other end using a rainbow table.
The problem is that the source the devices generally have 512k-4Meg of memory to use.
3) it must be perfect - we 100% cannot have a collision .
Edit2:
4) We cant use encryption as we've been told it's not really possable on the devices and keymanigment would be a nightmare if we could.
Edit3:
As this is not sensible, Its purely academic question now (I promise)
Okay, since you've clarified what you're trying to do, I rewrote my answer.
To summarize: Use a real encryption algorithm.
First, let me go over why your hashing system is a bad idea.
What is your hashing system, anyway?
As I understand it, your proposed system is something like this:
Your embedded system (which I will call C) is sending some sort of data with a value space of 10^11. This data needs to be kept confidential in transit to some sort of server (which I will call S).
Your proposal is to send the value hash(salt + data) to S. S will then use a rainbow table to reverse this hash and recover the data. salt is a shared value known to both C and S.
This is an encryption algorithm
An encryption algorithm, when you boil it down, is any algorithm that gives you confidentiality. Since your goal is confidentiality, any algorithm that satisfies your goals is an encryption algorithm, including this one.
This is a very poor encryption algorithm
First, there is an unavoidable chance of collision. Moreover, the set of colliding values differs each day.
Second, decryption is extremely CPU- and memory-intensive even for the legitimate server S. Changing the salt is even more expensive.
Third, although your stated goal is avoiding key management, your salt is a key! You haven't solved key management at all; anyone with the salt will be able to crack the message just as well as you can.
Fourth, it's only usable from C to S. Your embedded system C will not have enough computational resources to reverse hashes, and can only send data.
This isn't any faster than a real encryption algorithm on the embedded device
Most secure hashing algorithm are just as computationally expensive as a reasonable block cipher, if not worse. For example, SHA-1 requires doing the following for each 512-bit block:
Allocate 12 32-bit variables.
Allocate 80 32-bit words for the expanded message
64 times: Perform three array lookups, three 32-bit xors, and a rotate operation
80 times: Perform up to five 32-bit binary operations (some combination of xor, and, or, not, and and depending on the round); then a rotate, array lookup, four adds, another rotate, and several memory loads/stores.
Perform five 32-bit twos-complement adds
There is one chunk per 512-bits of the message, plus a possible extra chunk at the end. This is 1136 binary operations per chunk (not counting memory operations), or about 16 operations per byte.
For comparison, the RC4 encryption algorithm requires four operations (three additions, plus an xor on the message) per byte, plus two array reads and two array writes. It also requires only 258 bytes of working memory, vs a peak of 368 bytes for SHA-1.
Key management is fundamental
With any confidentiality system, you must have some sort of secret. If you have no secrets, then anyone else can implement the same decoding algorithm, and your data is exposed to the world.
So, you have two choices as to where to put the secrecy. One option is to make the encipherpent/decipherment algorithms secret. However, if the code (or binaries) for the algorithm is ever leaked, you lose - it's quite hard to replace such an algorithm.
Thus, secrets are generally made easy to replace - this is what we call a key.
Your proposed usage of hash algorithms would require a salt - this is the only secret in the system and is therefore a key. Whether you like it or not, you will have to manage this key carefully. And it's a lot harder to replace if leaked than other keys - you have to spend many CPU-hours generating a new rainbow table every time it's changed!
What should you do?
Use a real encryption algorithm, and spend some time actually thinking about key management. These issues have been solved before.
First, use a real encryption algorithm. AES has been designed for high performance and low RAM requirements. You could also use a stream cipher like RC4 as I mentioned before - the thing to watch out for with RC4, however, is that you must discard the first 4 kilobytes or so of output from the cipher, or you will be vulnerable to the same attacks that plauged WEP.
Second, think about key management. One option is to simply burn a key into each client, and physically go out and replace it if the client is compromised. This is reasonable if you have easy physical access to all of the clients.
Otherwise, if you don't care about man-in-the-middle attacks, you can simply use Diffie-Hellman key exchange to negotiate a shared key between S and C. If you are concerned about MitMs, then you'll need to start looking at ECDSA or something to authenticate the key obtained from the D-H exchange - beware that when you start going down that road, it's easy to get things wrong, however. I would recommend implementing TLS at that point. It's not beyond the capabilities of an embedded system - indeed, there are a number of embedded commercial (and open source) libraries available already. If you don't implement TLS, then at least have a professional cryptographer look over your algorithm before implementing it.
There is obviously no such thing as a "perfect" hash unless you have at least as many hash buckets as inputs; if you don't, then inevitably it will be possible for two of your inputs to share the same hash bucket.
However, it's unlikely you'll be storing all the numbers between 0 and 10^11. So what's the pattern? If there's a pattern, there may be a perfect hash function for your actual data set.
It's really not that important to find a "perfect" hash function anyway, though. Hash tables are very fast. A function with a very low collision rate - and when hashing integers, that means nearly any simple function, like modulus - is fine and you'll get O(1) average performance.