Which hashes are needed in a Merkle tree? - hash

When you have a Merkle tree, what is the minimal number of hashes needed to verify a change to one leaf node?
Am I correct in my understanding that, at first, only the top hash (the Merkle tree root or hash of the Merkle tree root) is needed? And then once a leaf is modified, you need to obtain the hashes of each row "visited" while descending to the leaf node that got modified?
So if a root has, say, ten children and one grand-children that is modified and I want to verify that particular grand-children, I need to obtain the new merkle root hashes, the hashes of the ten children and the hashes of the children of the parent of the grand-children.
So at every modification you always need to obtain, at least, all the hashes from the first row? (otherwise how do you reconstruct and verify the merkle root hash?)

In general Merkle trees have not been designed to indicate which hash value is actually incorrect. Instead, it makes it possible to obtain a efficient hash over large data structures. The hash of each leaf node can be calculated separately (and, of course, each branch as well, although that's just hashes).
If you want to validate which node is invalid you should keep the entire Merkle tree. If you have another party doing the calculations you can indeed descent into a branch of the tree to find the altered leaf node.

Related

SHA of SHA as signatures of composite objects

I have a bunch of large objects, and structures of those, and vectors of those. It's important to check integrity of the composite objects sometimes; for that I'm using Sha256 "signature" of the objects.
There are at least two ways to define a signature of a composite object: by computing sha of concatenation of components, or by computing sha of the concatenations of shas of the components.
That is, the 1st method signature of a vector Object0, Object1, Object2 would be sha(Object0 Object1 Object2), and the 2nd method would be sha(sha(Object0) sha(Object1) sha(Object2)).
It's a lot faster in what I'm doing to sign composite objects with the 2nd method. The question is, does this method, computing shas of shas, increase the chances of collisions? Do I sacrifice any security because I'm hashing not the objects but hashes of the objects?
What you have described there is well known structure of Merkle tree or hash tree. Git repository is basically giant Merkle tree.
Security of such structure is as strong as preimage resistance of hash function of your choice.
Although I can't provide a mathematical proof for this, I'd say: No, it doesn't matter.

Cryptographically hashing a tree of elements

I am working on a project where I have a tree of objects. This tree of objects can be quite large, and can be subject to very frequent modifications (e.g. adding or removing a node, changing some properties of a node, and so on) by more users. Now, every time an update is published by an user, I need to be able to get some hash of the tree as it is after the user modified it, so that the user can sign the update with his private RSA key. Therefore I obviously need the hash to be cryptographically secure. However, hashing a linear representation of the whole tree over and over every time an user changes just one node is unfeasible.
I thought about this strategy, but I am not sure if that will work out properly:
I add to each node of a new field, that is the SHA256 hash of all its children nodes.
The hash of a node is now the hash of each of the fields of the node, therefore included the hash of its children.
Now, updating the tree should be easy: every time I update a node, I change the hash field of its parent, then it's grandparent and so on until the root is reached, and use the hash of the root as hash value. This would reduce the complexity of this operation to O(ln(N)) rather than O(N).
However, I know that it is never safe to trust one's own intuition about cryptography. So is this procedure secure?
This is called a hash tree or Merkle tree. It's nothing new and it is secure. It is often used to parallelize hashing as the hash methods themselves are strictly sequential in nature.
Don't concatenate data and hashes though unless you explicitly include the size of the data. It's better to only concatenate hashes.
In my opinion your algorithm is already good enough.
Assuming that SHA-256 is secure (well, at least, its name is "Secure Hash Algorithm"), one can prove, by induction on the depth of the tree, that your algorithm is secure as well.

Riak, merkle tree, implementation

Could somebody explain to me how merkle tree implementation works in riak-core, please?
https://github.com/basho/riak_core/blob/develop/src/merkerl.erl
I don't understand what is it offfset, for example.
Thanks!
The tree is both a K/V lookup tree and a Merkle tree in one, more or less. The tree is defined by looking at a 160 bit sha1 hash. The 160 bits gives 20 bytes. At the first level of the tree, we store up to 256 subtrees according to the 0th byte of the hash. At the next level, it is the 1st byte, then the 2nd and so on.
This is a called a digital tree scheme, where the digits in the hash encode the path to take in the tree. This allows us to replace data in the tree. Alternatively, look up the concept trie. At the same time, we sign each nodes kids with sha1 to track a change in the given subtree. When running to find the diff, we can thus ignore subtrees with the same signature as they must be equivalent by construction.
The value offset encodes how far in the 160 bit key we are currently. The offset_key/1 function offsets to the right byte in the key to look at.

Check if a given order is a legal postorder traversal

If you have a binary search tree with ten nodes, storing integers 0 through 9, how do we decide if a sequence cannot represent a postorder traversal of the tree? I understand that the root has to be the last one in the sequence, but I could not arrive at any pattern. A pseudocode would be great too! (It's not homework, practicing for interviews)
As you said, you know what the root is. So you know the range of values in each sub tree. If the sequence, less the root, doesn't split into two sequences, one less than and one greater than the root then it isn't valid. If it does, you need to recursively check the two sub-traversals. If everything works, then it is valid.

Meaning of Open hashing and Closed hashing

Open Hashing (Separate Chaining):
In open hashing, keys are stored in linked lists attached to cells of a hash table.
Closed Hashing (Open Addressing):
In closed hashing, all keys are stored in the hash table itself without the use of linked lists.
I am unable to understand why they are called open, closed and Separate. Can some one explain it?
The use of "closed" vs. "open" reflects whether or not we are locked in to using a certain position or data structure (this is an extremely vague description, but hopefully the rest helps).
For instance, the "open" in "open addressing" tells us the index (aka. address) at which an object will be stored in the hash table is not completely determined by its hash code. Instead, the index may vary depending on what's already in the hash table.
The "closed" in "closed hashing" refers to the fact that we never leave the hash table; every object is stored directly at an index in the hash table's internal array. Note that this is only possible by using some sort of open addressing strategy. This explains why "closed hashing" and "open addressing" are synonyms.
Contrast this with open hashing - in this strategy, none of the objects are actually stored in the hash table's array; instead once an object is hashed, it is stored in a list which is separate from the hash table's internal array. "open" refers to the freedom we get by leaving the hash table, and using a separate list. By the way, "separate list" hints at why open hashing is also known as "separate chaining".
In short, "closed" always refers to some sort of strict guarantee, like when we guarantee that objects are always stored directly within the hash table (closed hashing). Then, the opposite of "closed" is "open", so if you don't have such guarantees, the strategy is considered "open".
You have an array that is the "hash table".
In Open Hashing each cell in the array points to a list containg the collisions. The hashing has produced the same index for all items in the linked list.
In Closed Hashing you use only one array for everything. You store the collisions in the same array. The trick is to use some smart way to jump from collision to collision until you find what you want. And do this in a reproducible / deterministic way.
The name open addressing refers to the fact that the location ("address") of the element is not determined by its hash value. (This method is also called closed hashing).
In separate chaining, each bucket is independent, and has some sort of ADT (list, binary search trees, etc) of entries with the same index.
In a good hash table, each bucket has zero or one entries, because we need operations of order O(1) for insert, search, etc.
This is a example of separate chaining using C++ with a simple hash function using mod operator (clearly, a bad hash function)