Cryptographically hashing a tree of elements

Cryptographically hashing a tree of elements - hash

I am working on a project where I have a tree of objects. This tree of objects can be quite large, and can be subject to very frequent modifications (e.g. adding or removing a node, changing some properties of a node, and so on) by more users. Now, every time an update is published by an user, I need to be able to get some hash of the tree as it is after the user modified it, so that the user can sign the update with his private RSA key. Therefore I obviously need the hash to be cryptographically secure. However, hashing a linear representation of the whole tree over and over every time an user changes just one node is unfeasible.
I thought about this strategy, but I am not sure if that will work out properly:
I add to each node of a new field, that is the SHA256 hash of all its children nodes.
The hash of a node is now the hash of each of the fields of the node, therefore included the hash of its children.
Now, updating the tree should be easy: every time I update a node, I change the hash field of its parent, then it's grandparent and so on until the root is reached, and use the hash of the root as hash value. This would reduce the complexity of this operation to O(ln(N)) rather than O(N).
However, I know that it is never safe to trust one's own intuition about cryptography. So is this procedure secure?

This is called a hash tree or Merkle tree. It's nothing new and it is secure. It is often used to parallelize hashing as the hash methods themselves are strictly sequential in nature.
Don't concatenate data and hashes though unless you explicitly include the size of the data. It's better to only concatenate hashes.

In my opinion your algorithm is already good enough.
Assuming that SHA-256 is secure (well, at least, its name is "Secure Hash Algorithm"), one can prove, by induction on the depth of the tree, that your algorithm is secure as well.

Related

SHA of SHA as signatures of composite objects

I have a bunch of large objects, and structures of those, and vectors of those. It's important to check integrity of the composite objects sometimes; for that I'm using Sha256 "signature" of the objects.
There are at least two ways to define a signature of a composite object: by computing sha of concatenation of components, or by computing sha of the concatenations of shas of the components.
That is, the 1st method signature of a vector Object0, Object1, Object2 would be sha(Object0 Object1 Object2), and the 2nd method would be sha(sha(Object0) sha(Object1) sha(Object2)).
It's a lot faster in what I'm doing to sign composite objects with the 2nd method. The question is, does this method, computing shas of shas, increase the chances of collisions? Do I sacrifice any security because I'm hashing not the objects but hashes of the objects?

What you have described there is well known structure of Merkle tree or hash tree. Git repository is basically giant Merkle tree.
Security of such structure is as strong as preimage resistance of hash function of your choice.

Although I can't provide a mathematical proof for this, I'd say: No, it doesn't matter.

Which hashes are needed in a Merkle tree?

When you have a Merkle tree, what is the minimal number of hashes needed to verify a change to one leaf node?
Am I correct in my understanding that, at first, only the top hash (the Merkle tree root or hash of the Merkle tree root) is needed? And then once a leaf is modified, you need to obtain the hashes of each row "visited" while descending to the leaf node that got modified?
So if a root has, say, ten children and one grand-children that is modified and I want to verify that particular grand-children, I need to obtain the new merkle root hashes, the hashes of the ten children and the hashes of the children of the parent of the grand-children.
So at every modification you always need to obtain, at least, all the hashes from the first row? (otherwise how do you reconstruct and verify the merkle root hash?)

In general Merkle trees have not been designed to indicate which hash value is actually incorrect. Instead, it makes it possible to obtain a efficient hash over large data structures. The hash of each leaf node can be calculated separately (and, of course, each branch as well, although that's just hashes).
If you want to validate which node is invalid you should keep the entire Merkle tree. If you have another party doing the calculations you can indeed descent into a branch of the tree to find the altered leaf node.

Difference between preimage resistance and second-preimage resistance

Wikipedia says:
preimage resistance: for essentially all pre-specified outputs, it is computationally infeasible to find any input which hashes to that output, i.e., it is difficult to find any preimage x given a "y" such that h(x) = y.
second-preimage resistance: it is computationally infeasible to find any second input which has the same output as a specified input, i.e., given x, it is difficult to find a second preimage x' ≠ x such that h(x) = h(x′).
Yet, I don't understand it. Doesn't h(x′) (where x' is input) generate that y (the output), which is then compared to the same h(x)?
Say, I have a string "example". It generates the MD5 "1a79a4d60de6718e8e5b326e338ae533". Why is it different to just use the MD5 compared to doing the MD5(example)?

Ideal hashing is like taking the fingerprint of a person, it is unique, it is non-reversible (you can't get the whole person back just from the fingerprint) and it can serve as a short and simple identifier for the given person.
If we bring some of the terminology you introduced into our analogy, we see that preimage resistance refers to the hash function's ability to be non-reversible. Imagine if you could generate the likeness of a whole person from their fingerprint, aside from being really cool, this could also be very dangerous. For the same reason, hash functions must be made so that an attacker cannot find the original message that generated the hash. In that sense, hash functions are one-way in that the message generates the hash and not the other way round.
Second preimage resistance refers to a given hash function's ability to be unique. Forensic fingerprinting would be a gross waste of time if any number of individuals could share the same fingerprint (lets exclude identical twins for now. Edit: See Det's comment below). If a given hash was used for verification against data corruption, it would quite pointless if there is a good chance corrupt data can generate the same hash.
To have both preimage resistance and second preimage resistance hash functions adopt several traits to help them. One trait very common for hash functions is where the given input has no correspondence to the output. A single bit change can produce a hash that has completely no bytes shared with the hash of the original input. For this reason, a good hash function is commonly used in message authentication.
Whilst you are right comparing the original message directly would be functionally equivalent to comparing the hashes, it is simply not feasible in the majority of cases. For example:
If party A wanted to reliably send a message to party B, party A/B would need to agree upon a scheme to detect data corruption during transfer. Note: party B does not have the original message until party A sends it.
A possible scheme of transfer could be to transfer the message twice such that party B can verify if the second message equals the first. The problem with this is that there is a chance that corruption can occur twice in the same place (as well as the significantly higher bandwidth). This can only be reduced by sending the messages even more times, incurring severe bandwidth costs.
As an alternative, party A can pass his/her long message into a hash function and generate a short hash which he/she sends to party B, followed by the original message. Party B can then take the received message and pass it into the hash function and match the hashes. If either the message or the hash got corrupted even by a single bit during transfer, the resultant hashes will not match, thanks to second preimage resistance (no two plaintext should have same hash).
Preimage Resistance in this case would be useful if the message is encrypted during transfer but the hash was taken prior encryption (whether this is appropriate is another discussion). If the hash was reversible, a eavesdropper could intercept the hash and reverse to find the original message.
All hash functions are not equal, that's why its important to consider their preimage resistance/second preimage resistance when choosing which ones to use, which ones are secure and which ones should be deprecated and replaced.

You understood preimage and second preimage resistance? It says the output of a hash function is unique, at least in theory.. And obtaining the original string from a hash is "computationally" in-feasible. It is possible (brute-force) though but takes up a lot of time and resources.
Now, output of a hash function and the string itself are different.. For example, consider a website with a dashboard. You provide your user_id and password at the time of signing up. If the website stores your password as such in their database, it is accessible to a hacker. He can access your account. But if a hash of your password is stored, even if he manages to hack down the server, that hash is of no use to him. Because, he cannot access your account without your password, and it is computationally in-feasible to obtain your password from the hash (preimage resistance). Comparing md5 (yourpassword) with the hash stored in the db is different. Each time you enter your password, it is hashed with the sampe hash function and compared to the existing hash. According to second-preimage resistance, if you entered an incorrect password, the hashes won't match.
Another example of hashing is in the version control or source control mechanisms. To track down changes in a file, hashing is used. They hash the entire file and keeps it. If a file is modified, its hash changes accordingly.
These are all examples explaining what you asked.

Meaning of Open hashing and Closed hashing

Open Hashing (Separate Chaining):
In open hashing, keys are stored in linked lists attached to cells of a hash table.
Closed Hashing (Open Addressing):
In closed hashing, all keys are stored in the hash table itself without the use of linked lists.
I am unable to understand why they are called open, closed and Separate. Can some one explain it?

The use of "closed" vs. "open" reflects whether or not we are locked in to using a certain position or data structure (this is an extremely vague description, but hopefully the rest helps).
For instance, the "open" in "open addressing" tells us the index (aka. address) at which an object will be stored in the hash table is not completely determined by its hash code. Instead, the index may vary depending on what's already in the hash table.
The "closed" in "closed hashing" refers to the fact that we never leave the hash table; every object is stored directly at an index in the hash table's internal array. Note that this is only possible by using some sort of open addressing strategy. This explains why "closed hashing" and "open addressing" are synonyms.
Contrast this with open hashing - in this strategy, none of the objects are actually stored in the hash table's array; instead once an object is hashed, it is stored in a list which is separate from the hash table's internal array. "open" refers to the freedom we get by leaving the hash table, and using a separate list. By the way, "separate list" hints at why open hashing is also known as "separate chaining".
In short, "closed" always refers to some sort of strict guarantee, like when we guarantee that objects are always stored directly within the hash table (closed hashing). Then, the opposite of "closed" is "open", so if you don't have such guarantees, the strategy is considered "open".

You have an array that is the "hash table".
In Open Hashing each cell in the array points to a list containg the collisions. The hashing has produced the same index for all items in the linked list.
In Closed Hashing you use only one array for everything. You store the collisions in the same array. The trick is to use some smart way to jump from collision to collision until you find what you want. And do this in a reproducible / deterministic way.

The name open addressing refers to the fact that the location ("address") of the element is not determined by its hash value. (This method is also called closed hashing).
In separate chaining, each bucket is independent, and has some sort of ADT (list, binary search trees, etc) of entries with the same index.
In a good hash table, each bucket has zero or one entries, because we need operations of order O(1) for insert, search, etc.
This is a example of separate chaining using C++ with a simple hash function using mod operator (clearly, a bad hash function)

Am I misunderstanding what a hash salt is?

I am working on adding hash digest generating functionality to our code base. I wanted to use a String as a hash salt so that a pre-known key/passphrase could be prepended to whatever it was that needed to be hashed. Am I misunderstanding this concept?

A salt is a random element which is added to the input of a cryptographic function, with the goal of impacting the processing and output in a distinct way upon each invocation. The salt, as opposed to a "key", is not meant to be confidential.
One century ago, cryptographic methods for encryption or authentication were "secret". Then, with the advent of computers, people realized that keeping a method completely secret was difficult, because this meant keeping software itself confidential. Something which is regularly written to a disk, or incarnated as some dedicated hardware, has trouble being kept confidential. So the researchers split the "method" into two distinct concepts: the algorithm (which is public and becomes software and hardware) and the key (a parameter to the algorithm, present in volatile RAM only during processing). The key concentrates the secret and is pure data. When the key is stored in the brain of a human being, it is often called a "password" because humans are better at memorizing words than bits.
Then the key itself was split later on. It turned out that, for proper cryptographic security, we needed two things: a confidential parameter, and a variable parameter. Basically, reusing the same key for distinct usages tends to create trouble; it often leaks information. In some cases (especially stream ciphers, but also for hashing passwords), it leaks too much and leads to successful attacks. So there is often a need for variability, something which changes every time the cryptographic method runs. Now the good part is that most of the time, variability and secret need not be merged. That is, we can separate the confidential from the variable. So the key was split into:
the secret key, often called "the key";
a variable element, usually chosen at random, and called "salt" or "IV" (as "Initial Value") depending on the algorithm type.
Only the key needs to be secret. The variable element needs to be known by all involved parties but it can be public. This is a blessing because sharing a secret key is difficult; systems used to distribute such a secret would find it expensive to accommodate a variable part which changes every time the algorithm runs.
In the context of storing hashed passwords, the explanation above becomes the following:
"Reusing the key" means that two users happen to choose the same password. If passwords are simply hashed, then both users will get the same hash value, and this will show. Here is the leakage.
Similarly, without a hash, an attacker could use precomputed tables for fast lookup; he could also attack thousands of passwords in parallel. This still uses the same leak, only in a way which demonstrates why this leak is bad.
Salting means adding some variable data to the hash function input. That variable data is the salt. The point of the salt is that two distinct users should use, as much as possible, distinct salts. But password verifiers need to be able to recompute the same hash from the password, hence they must have access to the salt.
Since the salt must be accessible to verifiers but needs not be secret, it is customary to store the salt value along with the hash value. For instance, on a Linux system, I may use this command:
openssl passwd -1 -salt "zap" "blah"
This computes a hashed password, with the hash function MD5, suitable for usage in the /etc/password or /etc/shadow file, for the password "blah" and the salt "zap" (here, I choose the salt explicitly, but under practical conditions it should be selected randomly). The output is then:
$1$zap$t3KZajBWMA7dVxwut6y921
in which the dollar signs serve as separators. The initial "1" identifies the hashing method (MD5). The salt is in there, in cleartext notation. The last part is the hash function output.
There is a specification (somewhere) on how the salt and password are sent as input to the hash function (at least in the glibc source code, possibly elsewhere).
Edit: in a "login-and-password" user authentication system, the "login" could act as a passable salt (two distinct users will have distinct logins) but this does not capture the situation of a given user changing his password (whether the new password is identical to an older password will leak).

You are understanding the concept perfectly. Just make sure the prepended salt is repeatable each and every time.

If I'm understanding you correctly, it sounds like you've got it right. The psuedocode for the process looks something like:
string saltedValue = plainTextValue + saltString;
// or string saltedalue = saltString + plainTextValue;
Hash(saltedValue);
The Salt just adds another level of complexity for people trying to get at your information.

And it's even better if the salt is different for each encrypted phrase since each salt requires its own rainbow table.

Its worth mentioning that even though the salt should be different for each password usage, your salt should in NO WAY be computed FROM the password itself! This sort of thing has the practical upshot of completely invalidating your security.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse