My code assigns UUIDv4 IDs for various entities (e.g. user accounts) in my system. While UUIDv4 is pretty much considered guaranteed unique, I'd rather be 100%, and as such, I've created an index of type UNIQUE_HASH_INDEX in my OrientDB instance.
However, I realized that the hash generated by OrientDB might be of a smaller size than my UUIDv4 value (128 bits), and I might get collision errors although the IDs are actually different.
I am not very familiar with the underlying hash algorithm used in OrientDB, and I was wondering if anyone can point out if whether or not it will be a possibility for me to get false positive collisions on my UNIQUE constraint?
In unique hash index we use concept similar to HashMap concept, if hash codes of keys are the same we put data inside of list of values associated with given hash code.
Related
This is the controversial line from Cracking the Coding Interview on hash tables.
Another common implementation(besides linked-list) for a hash table is to use a BST as the underlying data structure.
I know this question has been asked before... it's so confusing because everyone is giving two different answers. For example
Why implement a Hashtable with a Binary Search Tree?
The highest voted answer in this post says that the quoted statement above is saying talking about a hash table implementation using a binary search tree, without an underlying array. I understood that since each element inserted gets a hash value (an integer), the elements form a total order (every pair can be compared with < and >). Therefore, we can simply use a binary search tree to hold the elements of the hash table.
On the other hand, others say
Hash table - implementing with Binary Search Tree
the book is saying that we should handle collisions with a binary search tree. So there is an underlying array and when collisions because multiple elements get the same hash value and get placed in the same slot in the array, that's where the BST comes in.
So each slot in the array will be a pointer to a BST, which holds elements with the same hash value.
I'm leaning towards the second post's argument because the first post does not really explain how such implementation of a hash table can handle collisions. And I don't think it can achieve expected O(1) insert/delete/lookup time.
But for the second post, if we have multiple elements that get the same hash value and placed in a BST, I'm not sure how these elements are ordered (how can they be compared against each other?)
Please, help me put an end to this question once and for all!
the first post does not really explain how such implementation of a hash table can handle collisions
With a BST, you can use a hashing function that would produce no duplicate keys so there would be no collisions. The advantage here isn't speed but to reduce memory consumption, and to have better worst-case guarantees. If you're writing software for a critical real-time system, you might not be able to tolerate a O(n) resizing of your hash table.
if we have multiple elements that get the same hash value and placed in a BST, I'm not sure how these elements are ordered (how can they be compared against each other?)
Rehash with another function.
In the end, it all depends on what your data structure is used for (Is memory vs. speed more important? Is amortized performance vs worst-case performance more important? etc.)
Let's say we have many data tables structured as timestamp(hash) - value pairs, where values could be for example temperatures or other kinds of varied measurement data.
To get timestamps of certain values we can build a secondary index with value(hash) - timestamp(range), but what if we want to query the value with comparison operations like GT, LT, BETWEEN to get timestamps of a range of values?
Obviously, I want to avoid using scan. The only thing I've come up with is using a dummy hash key and putting the values+timestamps into range attribute, but I'm guessing this has its own problems (better or worse compared to scan?).
Is there a better solution or can this be done with DynamoDB at all?
You need to know the HASH then you can perform a query on the RANGE. To get around this you would need to denormalize your table, i.e. create a duplicate with the keys reversed. Although that seems like a bit of a pain in the butt, it is one of the tradeoffs that is at times required for all the performance benefits of a key value store.
Example for this case:
With both keys completely random, then you are out of luck. Rather than set your HASH as a dummy value you could try using a monthly time stamp instead, that way you should always be able to work out pragmatically what the hash should be. You can also then look at setting the range as a combination of both values separated by a hyphen, i.e. timestamp-value, then in the denormalized table, value-timestamp, that way you should be able to use the comparison operators with no performance hit.
I am creating a service for which I will use MongoDB as a storage backend.
The service will produce a hash of the user input and then see if that same hash (+ input) already exists in our dataset.
The hash will be unique yet random ( = non-incremental/sequential), so my question is:
Is it -legitimate- to use a random value for an Object ID? Example:
$object_id = new MongoId(HEX-OF-96BIT-HASH);
Or will MongoDB treat the ObjectID differently from other server-produced ones, since a "real" ObjectID also contains timestamps, machine_id, etc?
What are the pros and cons of using a 'random' value? I guess it would be statistically slower for the engine to update the index on inserts when the new _id's are not in any way incremental - am I correct on that?
Yes it is perfectly fine to use a random value for an object id, if some value is present in _id field of a document being stored, it is treated as objectId.
Since _id field is always indexed, and primary key, you need to make sure that different objectid is generated for each object.
There are some guidelines to optimize user defined object ids :
https://docs.mongodb.com/manual/core/document/#the-id-field.
While any values, including hashes, can be used for the _id field, I would recommend against using random values for two reasons:
You may need to develop a collision-management strategy in the case you produce identical random values for two different objects. In the question, you imply that you'll generate IDs using a some type of a hash algorithm. I would not consider these values "random" as they are based on the content you are digesting with the hash. The probability of a collision then is a function of the diversity of content and the hash algorithm. If you are using something like MD5 or SHA-1, I wouldn't worry about the algorithm, just the content you are hashing. If you need to develop a collision-management strategy then you definitely should not use random or hash-based IDs as collision management in a clustered environment is complicated and requires additional queries.
Random values as well as hash values are purposefully meant to be dispersed on the number line. That (a) will require more of the B-tree index to be kept in memory at all times and (b) may cause variable insert performance due to B-tree rebalancing. MongoDB is optimized to handle ObjectIDs, which come in ascending order (with one second time granularity). You're likely better off sticking with them.
I just found out an answer to one of my questions, regarding indexing performance:
If the _id's are in a somewhat well defined order, on inserts the entire b-tree for the _id index need not be loaded. BSON ObjectIds have this property.
Source: http://www.mongodb.org/display/DOCS/Optimizing+Object+IDs
Whether it is good or bad depends upon it's uniqueness. Of course the ObjectId provided by MongoDB is quite unique so this is a good thing. So long as you can replicate that uniqueness then you should be fine.
There are no inherent risks/performance loses by using your own ID. I guess using it in string form might use up more index/storage/querying power but there you are using it in MongoID (ObjectId) form which should preserve the strengths of not storing it in a simple string.
How do hashes work in programming? How I think of a hash is something that allows me the ability to use some unique value to retrieve some data. Like if we have an array and I start to put things in the array, if I have another variable that keeps track of what item is in slot 0,1,2... then I have that instant ability to find an item. Is that hashing?
What is the purpose of a hash?
When should a hash be implemented? What's a hash similar to in terms of data structure?
What I think I know about hashes is that it allows us the ability to retrieve the item within O(1). Is that correct?
A hash is like a person's first name -- it's a short way of remembering a person, even though it doesn't have to be unique. If you need to find some information about someone, you can just look them up by their name, and you only need to perform other checks if two or more people have the same name.
That's the power of hashing, and just as remembering people is much easier by name than by Social Security Number, finding an object by its hash code is much easier than actually comparing the object to everything already in your collection.
Now, in this example, if you're looking someone up in a phone book by name, you'd probably find them in O(log n) time, because the names are sorted alphabetically, and because you need to do a binary search. If, however, you instead "hashed" 100 people born in the 1900s by their years of birth, then you'd only need at most 4 comparisons in the hashtable/phonebook (one per digit) to find any one year by hash, which is constant time. Then, if two people are born in the same year, you can use other information to find the person you need, and on average, if your table isn't too full (say, if you have at most 50 people for 100 different years of birth), your lookups will be constant-time.
(If your table gets more than, say, 50% full, you can always double its size, to keep the number of collisions low and hence to keep your lookups fast.)
More information:
If you've ever heard of MD5 or SHA-1 SHA-2 hashes for files, they're like the "fingerprints" of the file. While it's possible to have two files with the same hash, this is made so unlikely that, for practical purposes, it's impossible; hence, if you have the hash of two files, you can compare the files by their fingerprints rather than by their data, which is immensely faster.
A hash map / dictionary is a key/value data structure that stores objects in buckets based on the value of a hash function. These keys must be unique but the hash function values (sometimes called hashcodes) aren't necessarily unique.
Like if we have an array and I start to put htings in the array, if I have another varible that keeps track of what item is in slot 0,1,2... then I have that instant ability to find an item. Is that hashing?
No. A hash function is a deterministic function that always gives the same value for an object. The hash code does not change depending on where the object is stored.
What I think I know about hashes is that it allows us the ability to retrieve the item within O(1). Is that correct?
Nearly. A dictionary has O(1) complexity for lookups if there are not too many hash code collisions. However if the hash function is poor and every object has the same hash value then a dictionary could have O(n) performance instead.
A hash makes it fast to lookup instead of iterating over an array or tree. It makes it possible search O(1) time with little use of memory.
After being taught how to create a hash table in class, I don't understand when hashing data would be useful. It seems to me that all hashing does is storing information in semi-random positions in an array. I want to know how any of the data can be made useful after it's stored.
My question is this: what are some examples where hashing information is beneficial? How is data retrieved in any organized manner? It seems to be placed in arbitrary positions where it would be difficult to retrieve.
Hashing can be used for many purposes:
It can be used to compare large amounts of data. You create the hashes for the data, store the hashes and later if you want to compare the data, you just compare the hashes.
Hashes can be used to index data. They can be used in hash tables to point to the correct row. If you want to quickly find a record, you calculate the hash of the data and directly go to the record where the corresponding hash record is pointing to. (This assumes that you have a sorted list of hashes that point to the actual records)
They can be used in cryptographic applications like digital signatures.
Hashing can be used to generate seemingly random strings.
Here are the applications of hash functions that wikipedia lists:
Finding duplicate records
Finding similar records
Finding similar substrings
Geometric hashing
Now regarding hash table, here are some points to note:
If you are using a hash table, the hashes in the table should be in a sorted manner. If not, you will have to create an index on the hash column. Some implementations store the hash separately in a sorted manner and point to the original record.
If somebody is storing hashes in a semi-random order, it must be either because of the above reasons or because they just want to store the message digest of the information for comparing, finding duplicates etc. and not as an index to the data.
One of the major uses of the hash tables you created in class is when you need fast O(1) lookup times. You'll have have two components, keys and the values.
The hash function transforms the key into a hash. That hash is a number, and specifically, it is the index of the data in the array.
So, when you need to look up Agscala's reputation up in a hash table and you have used your username as the key, it takes almost no time to locate and find the relevant value. It simply re-hashes your username and viola, there is the index of the data you were looking for. You didn't have to iterate over the entire array looking for that specific value.
For some reference the Wikipedia page on Hash tables is pretty good.
There are a couple of typical reasons to hash data. In the example you reference, you would hash the data and use that as a key to extract the actual value of the hashed item. The hashed data is often referred to as a key and it references a bucket where the actual, non-hashed value can be found.
The other typical reason is to create a signature of the hashed value so that you can check if the value has been changed by someone else. Since it's usually rare, depending on the algorithm used, to have two items hash to the same value, you can rehash a value and compare it to the saved hash value to check if the item is still the same.
Hashing is a technique useful for fast key lookup. It allows one to more efficiently find a value rather than scanning a list from beginning to end.
Have you ever used a dictionary or a set? They're typically implemented in terms of a hashtable because the value associated with a key can be found quickly.
{
'WA': 'Washington',
'WV': 'West Virginia',
'WY': 'Wyoming'
}