Both hashing and indexing are use to partition data on some pre- defined formula. But I am unable to understand the key difference between the two.
As in hashing we are dividing the data on the basis of some key value pair, similarly in Indexing also we are dividing data on some pre defined values.
Can any one please help me out the difference between Hashing and Indexing, and how to decide whether to use hashing or Indexing.
Hashing is a specific case of indexing:
Indexing is a general name for a process of partitioning intended at speeding up data look-ups. Indexing can partition the data set based on a value of a field or a combination of fields. It can also partition the data set based on a value of a function, called hash function, computed from the data in a field or a combination of fields. In this specific case, indexing is called data hashing.
I did some research on web:
What is indexing?
Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a table creates another data structure which holds the field value, and pointer to the record it relates to. This index structure is then sorted, allowing Binary Searches to be performed on it.
What is hashing?
Hashing is the transformation of a string of characters into a usually shorter fixed-length value or key that represents the original string. Hashing is used to index and retrieve items in a database because it is faster to find the item using the shorter hashed key than to find it using the original value.
Related
From the MongoDB docs: MongoDB indexes use a B-tree data structure.
But, does it also apply to the compound indexes? In affirmative case, how is it actually implemented?
PD: The only way I imagine it is as a B-tree in which each node has not a single value, but as many values as indexes stored, for example, in an array (i.e. as if two binary trees (or more, one for each index) had been merged).
I cannot say anything with 100% certainty on the actual implementation, but I do think of compound indices as implemented as B-trees as well, with, as you describe, nodes that have sequences of values (respecting the index key order), and, very importantly using a lexicographic order on the key values.
Having said that, in order to save some space, I would also consider a B-tree that only uses the values associated with the first key on the top of the tree, then the values of the second key a bit further down, ... and the value of the last key close to the leaves. This is nothing else than an optimization that makes the boundaries of the nodes coincide with the individual keys.
The advantage of implementing a compound index this way (with or without the said optimization) is that, if you have an index on several keys, let's say A then B then C, then you get an index on A alone for free, and an index on A then B for free: indeed, all values for any prefix of the sequence of keys are always grouped together in such a B-tree, because of the lexicographic order.
Since MongoDB documents that such is the case, I think of the implementation of compound indices this way when I use MongoDB.
Furthermore, the documentation specifies that hashed index fields are forbidden on compound indices. This is one more clue, as B-trees implement range indices.
Also, I would expect MongoDB's hash indices to be implemented with a hash table rather than B-trees, as it would be less efficient to use a B-tree for point queries only (logarithmic lookup vs. O(1))
Mongo objectid, and MD5 hash function, which one is more likely to collide, now I am building a website, and look for a way to index my products.
Thanks in advance.
MongoDB's ObjectId is unlikely to collide. It contains a counter, a random number, process id, etc. MD5 hashes depend on the value of the input. If you pass two inputs that have the same value, then hashes will be the same.
I should know more about how you hash your products. If you are sure your product values won't be the same, then you can use both. But I would use ObjectId, because you won't need to worry about product values and hashing at all. The size of ObjectId is also smaller than the size of an MD5 hash, which is better for indexing.
If your product model has some relation to real world products, then there are existing ways to index them - pick article numbers or EAN(IAN). They can be an addition to natural auto-incremental ids of mysql. UUIDs have pros (mainly for distributed data bases) and cons - read https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439
Let's say we have many data tables structured as timestamp(hash) - value pairs, where values could be for example temperatures or other kinds of varied measurement data.
To get timestamps of certain values we can build a secondary index with value(hash) - timestamp(range), but what if we want to query the value with comparison operations like GT, LT, BETWEEN to get timestamps of a range of values?
Obviously, I want to avoid using scan. The only thing I've come up with is using a dummy hash key and putting the values+timestamps into range attribute, but I'm guessing this has its own problems (better or worse compared to scan?).
Is there a better solution or can this be done with DynamoDB at all?
You need to know the HASH then you can perform a query on the RANGE. To get around this you would need to denormalize your table, i.e. create a duplicate with the keys reversed. Although that seems like a bit of a pain in the butt, it is one of the tradeoffs that is at times required for all the performance benefits of a key value store.
Example for this case:
With both keys completely random, then you are out of luck. Rather than set your HASH as a dummy value you could try using a monthly time stamp instead, that way you should always be able to work out pragmatically what the hash should be. You can also then look at setting the range as a combination of both values separated by a hyphen, i.e. timestamp-value, then in the denormalized table, value-timestamp, that way you should be able to use the comparison operators with no performance hit.
I am creating a service for which I will use MongoDB as a storage backend.
The service will produce a hash of the user input and then see if that same hash (+ input) already exists in our dataset.
The hash will be unique yet random ( = non-incremental/sequential), so my question is:
Is it -legitimate- to use a random value for an Object ID? Example:
$object_id = new MongoId(HEX-OF-96BIT-HASH);
Or will MongoDB treat the ObjectID differently from other server-produced ones, since a "real" ObjectID also contains timestamps, machine_id, etc?
What are the pros and cons of using a 'random' value? I guess it would be statistically slower for the engine to update the index on inserts when the new _id's are not in any way incremental - am I correct on that?
Yes it is perfectly fine to use a random value for an object id, if some value is present in _id field of a document being stored, it is treated as objectId.
Since _id field is always indexed, and primary key, you need to make sure that different objectid is generated for each object.
There are some guidelines to optimize user defined object ids :
https://docs.mongodb.com/manual/core/document/#the-id-field.
While any values, including hashes, can be used for the _id field, I would recommend against using random values for two reasons:
You may need to develop a collision-management strategy in the case you produce identical random values for two different objects. In the question, you imply that you'll generate IDs using a some type of a hash algorithm. I would not consider these values "random" as they are based on the content you are digesting with the hash. The probability of a collision then is a function of the diversity of content and the hash algorithm. If you are using something like MD5 or SHA-1, I wouldn't worry about the algorithm, just the content you are hashing. If you need to develop a collision-management strategy then you definitely should not use random or hash-based IDs as collision management in a clustered environment is complicated and requires additional queries.
Random values as well as hash values are purposefully meant to be dispersed on the number line. That (a) will require more of the B-tree index to be kept in memory at all times and (b) may cause variable insert performance due to B-tree rebalancing. MongoDB is optimized to handle ObjectIDs, which come in ascending order (with one second time granularity). You're likely better off sticking with them.
I just found out an answer to one of my questions, regarding indexing performance:
If the _id's are in a somewhat well defined order, on inserts the entire b-tree for the _id index need not be loaded. BSON ObjectIds have this property.
Source: http://www.mongodb.org/display/DOCS/Optimizing+Object+IDs
Whether it is good or bad depends upon it's uniqueness. Of course the ObjectId provided by MongoDB is quite unique so this is a good thing. So long as you can replicate that uniqueness then you should be fine.
There are no inherent risks/performance loses by using your own ID. I guess using it in string form might use up more index/storage/querying power but there you are using it in MongoID (ObjectId) form which should preserve the strengths of not storing it in a simple string.
After being taught how to create a hash table in class, I don't understand when hashing data would be useful. It seems to me that all hashing does is storing information in semi-random positions in an array. I want to know how any of the data can be made useful after it's stored.
My question is this: what are some examples where hashing information is beneficial? How is data retrieved in any organized manner? It seems to be placed in arbitrary positions where it would be difficult to retrieve.
Hashing can be used for many purposes:
It can be used to compare large amounts of data. You create the hashes for the data, store the hashes and later if you want to compare the data, you just compare the hashes.
Hashes can be used to index data. They can be used in hash tables to point to the correct row. If you want to quickly find a record, you calculate the hash of the data and directly go to the record where the corresponding hash record is pointing to. (This assumes that you have a sorted list of hashes that point to the actual records)
They can be used in cryptographic applications like digital signatures.
Hashing can be used to generate seemingly random strings.
Here are the applications of hash functions that wikipedia lists:
Finding duplicate records
Finding similar records
Finding similar substrings
Geometric hashing
Now regarding hash table, here are some points to note:
If you are using a hash table, the hashes in the table should be in a sorted manner. If not, you will have to create an index on the hash column. Some implementations store the hash separately in a sorted manner and point to the original record.
If somebody is storing hashes in a semi-random order, it must be either because of the above reasons or because they just want to store the message digest of the information for comparing, finding duplicates etc. and not as an index to the data.
One of the major uses of the hash tables you created in class is when you need fast O(1) lookup times. You'll have have two components, keys and the values.
The hash function transforms the key into a hash. That hash is a number, and specifically, it is the index of the data in the array.
So, when you need to look up Agscala's reputation up in a hash table and you have used your username as the key, it takes almost no time to locate and find the relevant value. It simply re-hashes your username and viola, there is the index of the data you were looking for. You didn't have to iterate over the entire array looking for that specific value.
For some reference the Wikipedia page on Hash tables is pretty good.
There are a couple of typical reasons to hash data. In the example you reference, you would hash the data and use that as a key to extract the actual value of the hashed item. The hashed data is often referred to as a key and it references a bucket where the actual, non-hashed value can be found.
The other typical reason is to create a signature of the hashed value so that you can check if the value has been changed by someone else. Since it's usually rare, depending on the algorithm used, to have two items hash to the same value, you can rehash a value and compare it to the saved hash value to check if the item is still the same.
Hashing is a technique useful for fast key lookup. It allows one to more efficiently find a value rather than scanning a list from beginning to end.
Have you ever used a dictionary or a set? They're typically implemented in terms of a hashtable because the value associated with a key can be found quickly.
{
'WA': 'Washington',
'WV': 'West Virginia',
'WY': 'Wyoming'
}