Riak, merkle tree, implementation - hash

Could somebody explain to me how merkle tree implementation works in riak-core, please?
https://github.com/basho/riak_core/blob/develop/src/merkerl.erl
I don't understand what is it offfset, for example.
Thanks!

The tree is both a K/V lookup tree and a Merkle tree in one, more or less. The tree is defined by looking at a 160 bit sha1 hash. The 160 bits gives 20 bytes. At the first level of the tree, we store up to 256 subtrees according to the 0th byte of the hash. At the next level, it is the 1st byte, then the 2nd and so on.
This is a called a digital tree scheme, where the digits in the hash encode the path to take in the tree. This allows us to replace data in the tree. Alternatively, look up the concept trie. At the same time, we sign each nodes kids with sha1 to track a change in the given subtree. When running to find the diff, we can thus ignore subtrees with the same signature as they must be equivalent by construction.
The value offset encodes how far in the 160 bit key we are currently. The offset_key/1 function offsets to the right byte in the key to look at.

Related

Which hashes are needed in a Merkle tree?

When you have a Merkle tree, what is the minimal number of hashes needed to verify a change to one leaf node?
Am I correct in my understanding that, at first, only the top hash (the Merkle tree root or hash of the Merkle tree root) is needed? And then once a leaf is modified, you need to obtain the hashes of each row "visited" while descending to the leaf node that got modified?
So if a root has, say, ten children and one grand-children that is modified and I want to verify that particular grand-children, I need to obtain the new merkle root hashes, the hashes of the ten children and the hashes of the children of the parent of the grand-children.
So at every modification you always need to obtain, at least, all the hashes from the first row? (otherwise how do you reconstruct and verify the merkle root hash?)
In general Merkle trees have not been designed to indicate which hash value is actually incorrect. Instead, it makes it possible to obtain a efficient hash over large data structures. The hash of each leaf node can be calculated separately (and, of course, each branch as well, although that's just hashes).
If you want to validate which node is invalid you should keep the entire Merkle tree. If you have another party doing the calculations you can indeed descent into a branch of the tree to find the altered leaf node.

Cryptographically hashing a tree of elements

I am working on a project where I have a tree of objects. This tree of objects can be quite large, and can be subject to very frequent modifications (e.g. adding or removing a node, changing some properties of a node, and so on) by more users. Now, every time an update is published by an user, I need to be able to get some hash of the tree as it is after the user modified it, so that the user can sign the update with his private RSA key. Therefore I obviously need the hash to be cryptographically secure. However, hashing a linear representation of the whole tree over and over every time an user changes just one node is unfeasible.
I thought about this strategy, but I am not sure if that will work out properly:
I add to each node of a new field, that is the SHA256 hash of all its children nodes.
The hash of a node is now the hash of each of the fields of the node, therefore included the hash of its children.
Now, updating the tree should be easy: every time I update a node, I change the hash field of its parent, then it's grandparent and so on until the root is reached, and use the hash of the root as hash value. This would reduce the complexity of this operation to O(ln(N)) rather than O(N).
However, I know that it is never safe to trust one's own intuition about cryptography. So is this procedure secure?
This is called a hash tree or Merkle tree. It's nothing new and it is secure. It is often used to parallelize hashing as the hash methods themselves are strictly sequential in nature.
Don't concatenate data and hashes though unless you explicitly include the size of the data. It's better to only concatenate hashes.
In my opinion your algorithm is already good enough.
Assuming that SHA-256 is secure (well, at least, its name is "Secure Hash Algorithm"), one can prove, by induction on the depth of the tree, that your algorithm is secure as well.

What to use (best/good practice) for the secret key in HMAC solution?

I am implementing a HMAC-like solution based upon specifications provided to me by another company. The hashing parameters and use of the secret key is not an issue, and neither is the distribution of the key itself, since we are in close contact and close geographical location.
However - what is best practice for the actual secret key value?
Since both companies are working together, it would seem that
c9ac56dd392a3206fc80145406517d02 generated with a Rijndael algorithm and
Daisy Daisy give me your answer doare both pretty much equally secure (in this context) as a secret key used to add to the hash?
Citing Wikipedia page on HMAC:
The cryptographic strength of the HMAC depends upon the cryptographic strength of the underlying hash function, the size of its hash output, and on the size and quality of the key.
This means that completely random key, where every bit is randomly generated, is far better than set of characters.
The optimum size of the key is equal to block size. If the key is too short then it is padded usually with zeroes (which are not random). If the key is too long then its hash function is used. The length of hash output is anyway block size.
Use of visible characters as a key makes the key easier to guess as there are far less combinations of visible characters than if we allow for every possible combination of bits. For example:
There are 95 visible characters in ASCII (out of 256 combinations). If the block size is 16 bytes (HMAC_MD5) then there are 95^16 ~= 4.4*10^31 combinations. But for 16 bytes there are 3.4*10^38 possibilities. Attacker knowing that the key consists only of visible ASCII characters knows that he requires around 10 000 000 times less time than if he had to consider every possible combination of bits.
Summarizing I recommend use of cryptographic pseudo-random number generator to generate secret keys instead of coming up with your own keys.
Edit:
As martinstoeckli suggested if you have to you can use key-derivation-function to generate byte key of specified length from text password. This is much safer than converting plain text to bytes and using these bytes as a key directly. Nevertheless, there is nothing more secure than key consisting of random bytes.

Check if a given order is a legal postorder traversal

If you have a binary search tree with ten nodes, storing integers 0 through 9, how do we decide if a sequence cannot represent a postorder traversal of the tree? I understand that the root has to be the last one in the sequence, but I could not arrive at any pattern. A pseudocode would be great too! (It's not homework, practicing for interviews)
As you said, you know what the root is. So you know the range of values in each sub tree. If the sequence, less the root, doesn't split into two sequences, one less than and one greater than the root then it isn't valid. If it does, you need to recursively check the two sub-traversals. If everything works, then it is valid.

What is hash exactly?

I am learning MD5. I found a term 'hash' in most description of MD5. I googled 'hash', but I could not find exact term of 'hash' in computer programming.
Why are we using 'hash' in computer programming? What is origin of the word??
I would say any answer must be guesswork, so I will make this a community wiki.
Hash, or hash browns, is breakfast food made from cutting potatoes into long thin strips (smaller than french fries, and shorter, but proportionally similar), then frying the mass of strips in animal or vegetable fat until browned, stuck together, and cooked. By analogy, 'hashing' a number meant turning it into another, usually smaller, number using a method which still deterministically depending on the input number.
I believe the term "hash" was first used in the context of "hash table", which was commonly used in the 1960's on mainframe-type machines. In these cases, usually an integer value with a large range is converted to a "hash table index" which is a small integer. It is important for an efficient hash table that the "hash function" be evenly distributed, or "flat."
I don't have a citation, that is how I have understood the analogy since I heard it in the 80's. Someone must have been there when the term was first applied, though.
A hash value (or simply hash), also
called a message digest, is a number
generated from a string of text. The
hash is substantially smaller than the
text itself, and is generated by a
formula in such a way that it is
extremely unlikely that some other
text will produce the same hash value.
You're refering to the "hash function". It is used to generate a unique value for a given set of parameters.
One great use of a hash is password security. Instead of saving a password in a database, you save a hash of the password.
A hash is supposed to be a unique combination of values from 00 to FF (hexadecimal) that represents a certain piece of data, be it a file or a string of bytes. It is used primarily for password storage and verification, and to test if a file is the same as another file (i.e. you hash two files, if they match, they're the same file).
Generally, any of the SHA algorithms are preferred over MD5, due to hash collisions that can occur when using it. See this Wikipedia article.
According to the Wikipedia article on hash functions, Donald Knuth in the Art of Computer Programming was able to trace the concept of hash functions back to an internal IBM memo by Hans Peter Luhn in 1953.
And just for fun, here's a scrap of overheard conversation quoted in Two Women in the Klondike: the Story of a Journey to the Gold Fields of Alaska (1899):
They'll have to keep the hash table going all day long to feed us. 'T will be a short order affair.
the hash function hashes input to a value, requires a salt value and no proof salt is needed to store. Indications are everybody says we must store the salt same time match and new still work. Mathematically related concept is bijection
adding to gabriel1836's answer, one of the important properties of hash function is that it is a one way function, which means you cannot generate the original string from its hash value.