I have a bunch of large objects, and structures of those, and vectors of those. It's important to check integrity of the composite objects sometimes; for that I'm using Sha256 "signature" of the objects.
There are at least two ways to define a signature of a composite object: by computing sha of concatenation of components, or by computing sha of the concatenations of shas of the components.
That is, the 1st method signature of a vector Object0, Object1, Object2 would be sha(Object0 Object1 Object2), and the 2nd method would be sha(sha(Object0) sha(Object1) sha(Object2)).
It's a lot faster in what I'm doing to sign composite objects with the 2nd method. The question is, does this method, computing shas of shas, increase the chances of collisions? Do I sacrifice any security because I'm hashing not the objects but hashes of the objects?
What you have described there is well known structure of Merkle tree or hash tree. Git repository is basically giant Merkle tree.
Security of such structure is as strong as preimage resistance of hash function of your choice.
Although I can't provide a mathematical proof for this, I'd say: No, it doesn't matter.
Related
print(hash('hello world'))
result :
6266945022561323786
What is the mathematical used by the python hash() function ?
The Python documentation makes no guarantee about the particular algorithm that is used by hash() (or more precisely, object.__hash__(self)).
The documentation only says this:
object.__hash__(self)
Called by built-in function hash() and for operations on members of hashed collections including set, frozenset, and dict. The __hash__() method should return an integer. The only required property is that objects which compare equal have the same hash value; it is advised to mix together the hash values of the components of the object that also play a part in comparison of objects by packing them into a tuple and hashing the tuple. Example:
def __hash__(self):
return hash((self.name, self.nick, self.color))
The only thing that is guaranteed is this: "The only required property is that objects which compare equal have the same hash value".
There are a couple of desirable properties related to security, safety, and performance, but they are not required.
Every object can implement its own __hash__() however it wants, as long as it satisfies the property that two equal objects have the same hash value. And, in fact, many objects do provide their own implementations.
Even for built-in core objects such as strings, different implementations (and even different versions of different implementations) use different algorithms. CPython even uses a different seed value every time you run it (again, for security reasons).
So, the answer is: you can't know what the algorithm is. All you know is that if a.__eq__(b) is True, then a.__hash__().__eq__(b.__hash__()) is also True.
Most importantly, there is no guarantee that a.__hash__().__eq__(b.__hash__()) being True implies a and b are equal, nor does a.__eq__(b) being False imply that a.__hash__().__eq__(b.__hash__()) is False.
I am working on a project where I have a tree of objects. This tree of objects can be quite large, and can be subject to very frequent modifications (e.g. adding or removing a node, changing some properties of a node, and so on) by more users. Now, every time an update is published by an user, I need to be able to get some hash of the tree as it is after the user modified it, so that the user can sign the update with his private RSA key. Therefore I obviously need the hash to be cryptographically secure. However, hashing a linear representation of the whole tree over and over every time an user changes just one node is unfeasible.
I thought about this strategy, but I am not sure if that will work out properly:
I add to each node of a new field, that is the SHA256 hash of all its children nodes.
The hash of a node is now the hash of each of the fields of the node, therefore included the hash of its children.
Now, updating the tree should be easy: every time I update a node, I change the hash field of its parent, then it's grandparent and so on until the root is reached, and use the hash of the root as hash value. This would reduce the complexity of this operation to O(ln(N)) rather than O(N).
However, I know that it is never safe to trust one's own intuition about cryptography. So is this procedure secure?
This is called a hash tree or Merkle tree. It's nothing new and it is secure. It is often used to parallelize hashing as the hash methods themselves are strictly sequential in nature.
Don't concatenate data and hashes though unless you explicitly include the size of the data. It's better to only concatenate hashes.
In my opinion your algorithm is already good enough.
Assuming that SHA-256 is secure (well, at least, its name is "Secure Hash Algorithm"), one can prove, by induction on the depth of the tree, that your algorithm is secure as well.
Open Hashing (Separate Chaining):
In open hashing, keys are stored in linked lists attached to cells of a hash table.
Closed Hashing (Open Addressing):
In closed hashing, all keys are stored in the hash table itself without the use of linked lists.
I am unable to understand why they are called open, closed and Separate. Can some one explain it?
The use of "closed" vs. "open" reflects whether or not we are locked in to using a certain position or data structure (this is an extremely vague description, but hopefully the rest helps).
For instance, the "open" in "open addressing" tells us the index (aka. address) at which an object will be stored in the hash table is not completely determined by its hash code. Instead, the index may vary depending on what's already in the hash table.
The "closed" in "closed hashing" refers to the fact that we never leave the hash table; every object is stored directly at an index in the hash table's internal array. Note that this is only possible by using some sort of open addressing strategy. This explains why "closed hashing" and "open addressing" are synonyms.
Contrast this with open hashing - in this strategy, none of the objects are actually stored in the hash table's array; instead once an object is hashed, it is stored in a list which is separate from the hash table's internal array. "open" refers to the freedom we get by leaving the hash table, and using a separate list. By the way, "separate list" hints at why open hashing is also known as "separate chaining".
In short, "closed" always refers to some sort of strict guarantee, like when we guarantee that objects are always stored directly within the hash table (closed hashing). Then, the opposite of "closed" is "open", so if you don't have such guarantees, the strategy is considered "open".
You have an array that is the "hash table".
In Open Hashing each cell in the array points to a list containg the collisions. The hashing has produced the same index for all items in the linked list.
In Closed Hashing you use only one array for everything. You store the collisions in the same array. The trick is to use some smart way to jump from collision to collision until you find what you want. And do this in a reproducible / deterministic way.
The name open addressing refers to the fact that the location ("address") of the element is not determined by its hash value. (This method is also called closed hashing).
In separate chaining, each bucket is independent, and has some sort of ADT (list, binary search trees, etc) of entries with the same index.
In a good hash table, each bucket has zero or one entries, because we need operations of order O(1) for insert, search, etc.
This is a example of separate chaining using C++ with a simple hash function using mod operator (clearly, a bad hash function)
At first I assumed that every collection class would receive an additional par method which would convert the collection to a fitting parallel data structure (like map returns the best collection for the element type in Scala 2.8).
Now it seems that some collection classes support a par method (e. g. Array) but others have toParSeq, toParIterable methods (e. g. List). This is a bit weird, since Array isn't used or recommended that often.
What is the reason for that? Wouldn't it be better to just have a par available on all collection classes doing the "right thing"?
If I have data which might be processed in parallel, what types should I use? The traits in scala.collection or the type of the implementation directly?
Or should I prefer Arrays now, because they seem to be cheaper to parallelize?
Lists aren't that well suited for parallel processing. The reason is that to get to the end of the list, you have to walk through every single element. Thus, you may as well just treat the list as an iterator, and thus may as well just use something more generic like toParIterable.
Any collection that has a fast index is a good candidate for parallel processing. This includes anything implementing LinearSeqOptimized, plus trees and hash tables. Array has as fast of an index as you can get, so it's a fairly natural choice. You can also use things like ArrayBuffer (which has a par method returning a ParArray).
I am learning MD5. I found a term 'hash' in most description of MD5. I googled 'hash', but I could not find exact term of 'hash' in computer programming.
Why are we using 'hash' in computer programming? What is origin of the word??
I would say any answer must be guesswork, so I will make this a community wiki.
Hash, or hash browns, is breakfast food made from cutting potatoes into long thin strips (smaller than french fries, and shorter, but proportionally similar), then frying the mass of strips in animal or vegetable fat until browned, stuck together, and cooked. By analogy, 'hashing' a number meant turning it into another, usually smaller, number using a method which still deterministically depending on the input number.
I believe the term "hash" was first used in the context of "hash table", which was commonly used in the 1960's on mainframe-type machines. In these cases, usually an integer value with a large range is converted to a "hash table index" which is a small integer. It is important for an efficient hash table that the "hash function" be evenly distributed, or "flat."
I don't have a citation, that is how I have understood the analogy since I heard it in the 80's. Someone must have been there when the term was first applied, though.
A hash value (or simply hash), also
called a message digest, is a number
generated from a string of text. The
hash is substantially smaller than the
text itself, and is generated by a
formula in such a way that it is
extremely unlikely that some other
text will produce the same hash value.
You're refering to the "hash function". It is used to generate a unique value for a given set of parameters.
One great use of a hash is password security. Instead of saving a password in a database, you save a hash of the password.
A hash is supposed to be a unique combination of values from 00 to FF (hexadecimal) that represents a certain piece of data, be it a file or a string of bytes. It is used primarily for password storage and verification, and to test if a file is the same as another file (i.e. you hash two files, if they match, they're the same file).
Generally, any of the SHA algorithms are preferred over MD5, due to hash collisions that can occur when using it. See this Wikipedia article.
According to the Wikipedia article on hash functions, Donald Knuth in the Art of Computer Programming was able to trace the concept of hash functions back to an internal IBM memo by Hans Peter Luhn in 1953.
And just for fun, here's a scrap of overheard conversation quoted in Two Women in the Klondike: the Story of a Journey to the Gold Fields of Alaska (1899):
They'll have to keep the hash table going all day long to feed us. 'T will be a short order affair.
the hash function hashes input to a value, requires a salt value and no proof salt is needed to store. Indications are everybody says we must store the salt same time match and new still work. Mathematically related concept is bijection
adding to gabriel1836's answer, one of the important properties of hash function is that it is a one way function, which means you cannot generate the original string from its hash value.