Mongo Objectid collision possibility VS MD5 - mongodb

Mongo objectid, and MD5 hash function, which one is more likely to collide, now I am building a website, and look for a way to index my products.
Thanks in advance.

MongoDB's ObjectId is unlikely to collide. It contains a counter, a random number, process id, etc. MD5 hashes depend on the value of the input. If you pass two inputs that have the same value, then hashes will be the same.
I should know more about how you hash your products. If you are sure your product values won't be the same, then you can use both. But I would use ObjectId, because you won't need to worry about product values and hashing at all. The size of ObjectId is also smaller than the size of an MD5 hash, which is better for indexing.

If your product model has some relation to real world products, then there are existing ways to index them - pick article numbers or EAN(IAN). They can be an addition to natural auto-incremental ids of mysql. UUIDs have pros (mainly for distributed data bases) and cons - read https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439

Related

Text equality operator performance in Postgresql

How does this query work in terms of string comparison performance (assume there is a standard B-tree index on last_name ?
select * from employee where last_name = 'Wolfeschlegelsteinhausenbergerdorff';
So as it walks the B-Tree, I am assuming it it doesn't do a linear search on each character in the last_name field. EG, it doesn't start to check that the fist letter starts with a W... Assuming it doesn't do a linear comparison, what does it do?
I ask because I am considering to write my own duplicate prevention mechanism, but I want the performance to be sound. I was originally thinking of hashing each string (into some primitive datatype, probably a Long) that is coming in through an API, and storing the hash codes in a set/cache (each entry expires after 5 minutes). Any collisions would/could prompt a true duplicate check, where the already processed strings are stored in postgresql. But I'm thinking, would it be better to just simply query postgresql, in stead of maintaining my own memory based Set of Hashes that fluhses old entries after 5-10 minutes. I would probably use redis for scalability since multiple nodes will be reading different streams. Is my set of memory cached hash codes going to be faster than just querying indexed postgres String columns (full text matching not text searching) ?
When strings are compared for equality, the function texteq is called.
If you look up the function in src/backend/utils/adt/varlena.c, you will find that the comparison is made using the C library function memcmp. I doubt that you can get faster than that.
When you look up the value in a B-tree index, it will be compared to the values stored in each index page from the root page to the leaf page, that are at most 5 or 6 pages.
Frankly, I doubt that you can manage to be faster than that, but I wish you luck trying.

Mongo _id as string indexed key. Good or bad?

I'm developing an API that the only method to get a resource is providing a string key like my_resource.
It's a good practice to override _id (this make some mongodb drivers more easy to use) or its bad? What about in the long term?
Thank you
If there is a more natural primary key to use than an ObjectID (for example, a string value) feel free to use it.
The purpose of ObjectIDs is to allow distributed clients to quickly and independently generate unique identifiers according to a standard formula. The 12-byte ObjectID formula includes a 4-byte timestamp, 3-byte machine identifier, 2 byte process ID, and a 3-byte counter starting with a random value.
If you generate/assign your own unique identifiers (i.e. using strings), the potential performance consideration is that you won't if know this name is unique until you try to insert your document using that _id. In this case you would need to handle the duplicate key exception by retrying the insert with a new _id.
In my experience, overriding _id is not the best idea. Only if your data has a value field that is naturally unique and can easily be used to replace _id should _id be overridden. But it wouldn't make a whole lot of sense to override _id only to replace it with a contrived value.
I would recommend against it for a few reasons:
First of all, doing so requires an additional implementation to handle the inevitable instances when your "unique" values will conflict. And this will almost certainly arise in a database of any significant size. This can be a problem, since MongoDB can be unforgiving when it comes to overwriting values and generally handling conflicts. In other words, you're almost certain to overwrite values or meet unhandled exceptions unless you design your database structure very carefully from the beginning.
Second, and equally important: ObjectIDs naturally have an optimized insertion formula which allows for a very good creation of indexes. When a new document is inserted, that ObjectID is created to be mathematically as close as possible to the previous ObjectID, optimizing memory and indexing capabilities. It might be more trouble than it's worth to recreate this very handy item yourself.
And lastly, although this isn't as significant, overriding _id makes it that much harder to use the standard ObjectID methods.
Now, there is at least one positive that I can think of for overriding the ObjectID:
If there is an instance when _id will certainly never be used in your database, then it can save you a good amount of memory, as indexes are pretty costly in MongoDB.

Creating custom Object ID in MongoDB

I am creating a service for which I will use MongoDB as a storage backend.
The service will produce a hash of the user input and then see if that same hash (+ input) already exists in our dataset.
The hash will be unique yet random ( = non-incremental/sequential), so my question is:
Is it -legitimate- to use a random value for an Object ID? Example:
$object_id = new MongoId(HEX-OF-96BIT-HASH);
Or will MongoDB treat the ObjectID differently from other server-produced ones, since a "real" ObjectID also contains timestamps, machine_id, etc?
What are the pros and cons of using a 'random' value? I guess it would be statistically slower for the engine to update the index on inserts when the new _id's are not in any way incremental - am I correct on that?
Yes it is perfectly fine to use a random value for an object id, if some value is present in _id field of a document being stored, it is treated as objectId.
Since _id field is always indexed, and primary key, you need to make sure that different objectid is generated for each object.
There are some guidelines to optimize user defined object ids :
https://docs.mongodb.com/manual/core/document/#the-id-field.
While any values, including hashes, can be used for the _id field, I would recommend against using random values for two reasons:
You may need to develop a collision-management strategy in the case you produce identical random values for two different objects. In the question, you imply that you'll generate IDs using a some type of a hash algorithm. I would not consider these values "random" as they are based on the content you are digesting with the hash. The probability of a collision then is a function of the diversity of content and the hash algorithm. If you are using something like MD5 or SHA-1, I wouldn't worry about the algorithm, just the content you are hashing. If you need to develop a collision-management strategy then you definitely should not use random or hash-based IDs as collision management in a clustered environment is complicated and requires additional queries.
Random values as well as hash values are purposefully meant to be dispersed on the number line. That (a) will require more of the B-tree index to be kept in memory at all times and (b) may cause variable insert performance due to B-tree rebalancing. MongoDB is optimized to handle ObjectIDs, which come in ascending order (with one second time granularity). You're likely better off sticking with them.
I just found out an answer to one of my questions, regarding indexing performance:
If the _id's are in a somewhat well defined order, on inserts the entire b-tree for the _id index need not be loaded. BSON ObjectIds have this property.
Source: http://www.mongodb.org/display/DOCS/Optimizing+Object+IDs
Whether it is good or bad depends upon it's uniqueness. Of course the ObjectId provided by MongoDB is quite unique so this is a good thing. So long as you can replicate that uniqueness then you should be fine.
There are no inherent risks/performance loses by using your own ID. I guess using it in string form might use up more index/storage/querying power but there you are using it in MongoID (ObjectId) form which should preserve the strengths of not storing it in a simple string.

How Do Hashes Work in Programming?

How do hashes work in programming? How I think of a hash is something that allows me the ability to use some unique value to retrieve some data. Like if we have an array and I start to put things in the array, if I have another variable that keeps track of what item is in slot 0,1,2... then I have that instant ability to find an item. Is that hashing?
What is the purpose of a hash?
When should a hash be implemented? What's a hash similar to in terms of data structure?
What I think I know about hashes is that it allows us the ability to retrieve the item within O(1). Is that correct?
A hash is like a person's first name -- it's a short way of remembering a person, even though it doesn't have to be unique. If you need to find some information about someone, you can just look them up by their name, and you only need to perform other checks if two or more people have the same name.
That's the power of hashing, and just as remembering people is much easier by name than by Social Security Number, finding an object by its hash code is much easier than actually comparing the object to everything already in your collection.
Now, in this example, if you're looking someone up in a phone book by name, you'd probably find them in O(log n) time, because the names are sorted alphabetically, and because you need to do a binary search. If, however, you instead "hashed" 100 people born in the 1900s by their years of birth, then you'd only need at most 4 comparisons in the hashtable/phonebook (one per digit) to find any one year by hash, which is constant time. Then, if two people are born in the same year, you can use other information to find the person you need, and on average, if your table isn't too full (say, if you have at most 50 people for 100 different years of birth), your lookups will be constant-time.
(If your table gets more than, say, 50% full, you can always double its size, to keep the number of collisions low and hence to keep your lookups fast.)
More information:
If you've ever heard of MD5 or SHA-1 SHA-2 hashes for files, they're like the "fingerprints" of the file. While it's possible to have two files with the same hash, this is made so unlikely that, for practical purposes, it's impossible; hence, if you have the hash of two files, you can compare the files by their fingerprints rather than by their data, which is immensely faster.
A hash map / dictionary is a key/value data structure that stores objects in buckets based on the value of a hash function. These keys must be unique but the hash function values (sometimes called hashcodes) aren't necessarily unique.
Like if we have an array and I start to put htings in the array, if I have another varible that keeps track of what item is in slot 0,1,2... then I have that instant ability to find an item. Is that hashing?
No. A hash function is a deterministic function that always gives the same value for an object. The hash code does not change depending on where the object is stored.
What I think I know about hashes is that it allows us the ability to retrieve the item within O(1). Is that correct?
Nearly. A dictionary has O(1) complexity for lookups if there are not too many hash code collisions. However if the hash function is poor and every object has the same hash value then a dictionary could have O(n) performance instead.
A hash makes it fast to lookup instead of iterating over an array or tree. It makes it possible search O(1) time with little use of memory.

What's the purpose in hashing information?

After being taught how to create a hash table in class, I don't understand when hashing data would be useful. It seems to me that all hashing does is storing information in semi-random positions in an array. I want to know how any of the data can be made useful after it's stored.
My question is this: what are some examples where hashing information is beneficial? How is data retrieved in any organized manner? It seems to be placed in arbitrary positions where it would be difficult to retrieve.
Hashing can be used for many purposes:
It can be used to compare large amounts of data. You create the hashes for the data, store the hashes and later if you want to compare the data, you just compare the hashes.
Hashes can be used to index data. They can be used in hash tables to point to the correct row. If you want to quickly find a record, you calculate the hash of the data and directly go to the record where the corresponding hash record is pointing to. (This assumes that you have a sorted list of hashes that point to the actual records)
They can be used in cryptographic applications like digital signatures.
Hashing can be used to generate seemingly random strings.
Here are the applications of hash functions that wikipedia lists:
Finding duplicate records
Finding similar records
Finding similar substrings
Geometric hashing
Now regarding hash table, here are some points to note:
If you are using a hash table, the hashes in the table should be in a sorted manner. If not, you will have to create an index on the hash column. Some implementations store the hash separately in a sorted manner and point to the original record.
If somebody is storing hashes in a semi-random order, it must be either because of the above reasons or because they just want to store the message digest of the information for comparing, finding duplicates etc. and not as an index to the data.
One of the major uses of the hash tables you created in class is when you need fast O(1) lookup times. You'll have have two components, keys and the values.
The hash function transforms the key into a hash. That hash is a number, and specifically, it is the index of the data in the array.
So, when you need to look up Agscala's reputation up in a hash table and you have used your username as the key, it takes almost no time to locate and find the relevant value. It simply re-hashes your username and viola, there is the index of the data you were looking for. You didn't have to iterate over the entire array looking for that specific value.
For some reference the Wikipedia page on Hash tables is pretty good.
There are a couple of typical reasons to hash data. In the example you reference, you would hash the data and use that as a key to extract the actual value of the hashed item. The hashed data is often referred to as a key and it references a bucket where the actual, non-hashed value can be found.
The other typical reason is to create a signature of the hashed value so that you can check if the value has been changed by someone else. Since it's usually rare, depending on the algorithm used, to have two items hash to the same value, you can rehash a value and compare it to the saved hash value to check if the item is still the same.
Hashing is a technique useful for fast key lookup. It allows one to more efficiently find a value rather than scanning a list from beginning to end.
Have you ever used a dictionary or a set? They're typically implemented in terms of a hashtable because the value associated with a key can be found quickly.
{
'WA': 'Washington',
'WV': 'West Virginia',
'WY': 'Wyoming'
}