How does this query work in terms of string comparison performance (assume there is a standard B-tree index on last_name ?
select * from employee where last_name = 'Wolfeschlegelsteinhausenbergerdorff';
So as it walks the B-Tree, I am assuming it it doesn't do a linear search on each character in the last_name field. EG, it doesn't start to check that the fist letter starts with a W... Assuming it doesn't do a linear comparison, what does it do?
I ask because I am considering to write my own duplicate prevention mechanism, but I want the performance to be sound. I was originally thinking of hashing each string (into some primitive datatype, probably a Long) that is coming in through an API, and storing the hash codes in a set/cache (each entry expires after 5 minutes). Any collisions would/could prompt a true duplicate check, where the already processed strings are stored in postgresql. But I'm thinking, would it be better to just simply query postgresql, in stead of maintaining my own memory based Set of Hashes that fluhses old entries after 5-10 minutes. I would probably use redis for scalability since multiple nodes will be reading different streams. Is my set of memory cached hash codes going to be faster than just querying indexed postgres String columns (full text matching not text searching) ?
When strings are compared for equality, the function texteq is called.
If you look up the function in src/backend/utils/adt/varlena.c, you will find that the comparison is made using the C library function memcmp. I doubt that you can get faster than that.
When you look up the value in a B-tree index, it will be compared to the values stored in each index page from the root page to the leaf page, that are at most 5 or 6 pages.
Frankly, I doubt that you can manage to be faster than that, but I wish you luck trying.
Related
I have a very simple Mongo database for a personal nodejs project. It's basically just records of registered users.
My most important field is an alpha-numeric string (let's call it user_id and assume it can't be only numeric) of about 15 to 20 characters.
Now the most important operation is checking if the user exists at or all not. I do this by querying db.collection.find("user_id": "testuser-123")
if no record returns, I save the user along with some other not so important data like first name, last and signup date.
Now I obviously want to make user_id an index.
I read the Indexing Tutorials on the official MongoDB Manual.
First I tried setting a text index because I thought that would fit the alpha-numeric field. I also tried setting language:none. But it turned out that my query returned in ~12ms instead of 6ms without indexing.
Then I tried just setting an ordered index like {user_id: 1}, but I haven't seen any difference (is it only working for numeric values?).
Can anyone recommend me the best type of index for this case or quickest query to check if the user exists? Or maybe is MongoDB not the best match for this?
Some random thoughts first:
A text index is used to help full text search. Given your description this is not what is needed here, as, if I understand it well, you need to use an exact match of the whole field.
Without any index, MongoDB will use a linear search. Using big O notation, this is an O(n) operation. With an (ordered) index, the search is performed in O(log(n)). That means that an index will dramatically speed up queries when you will have many documents. But you will not necessary see any improvement if you have a small number of documents. In that case, O(n) can even be worst than O(log(n)). Some database management systems don't even bother using the index if the optimizer estimate that it will not provide enough benefits. I don't know if MongoDB does that, though.
Given your use case, I think the proper index is an unique index. This is an ordered index that would prevent insertion of two identical documents.
In your application, do not test before insert. In real application, this could lead to race condition when you have concurrent inserts. If you use an unique index, just try to insert -- and be prepared to gracefully handle an error caused by a duplicate key.
We are migrating a database from MySQL to MongoDB for performance reasons and considering what to use for IDs of the MongoDB documents. We are debating between using ObjectIDs, which is the MongoDB default, or using UUIDs instead (which is what we have been using up until now in MySQL). So far, the arguments we have to support any of these options are the following:
ObjectIDs:
ObjectIDs are the MongoDB default and I assume (although I'm not sure) that this is for a reason, meaning that I expect that MongoDB can handle them more efficiently than UUIDs or has another reason for preferring them. I also found this stackoverflow answer that mentions that usage of ObjectIDs makes indexing more efficient, it would be nice however to have some metrics on how much this "more efficient" is.
UUIDs:
Our basic argument in favour of using UUIDs (and it is a quite important one) is that they are supported, one way or another, by virtually any database. This means that if some way down the road we decide to switch from MongoDB to something else for whatever reason and we already have an API that retrieves documents from the DB based on their IDs nothing changes for the clients of this API since the IDs can continue to be exactly the same. If we were to use ObjectIDs I'm not really sure how we would go about migrating them to another DB.
Does anyone have any insights on whether one of these options may be better than the other and why? Have you ever used UUIDs in MongoDB instead of ObjectIDs and if yes what were the advantages / problems you came across?
Using UUIDs in Mongo is certainly possible and reasonably well supported. For example, the Mongo docs list UUIDs as one of the common options for the _id field.
Considerations
Performance – As other answers mention, benchmarks show UUIDs cause a performance drop for inserts. In the worst case measured (going from 10M to 20M docs in a collection) they've about ~2-3x slower – the difference between inserting 2,000 (UUID) and 7,500 (ObjectID) docs per second. This is a large difference but its significance depends entirely on you use case. Will you be batch inserting millions of docs at a time? For most apps I've build the common case is inserting individual documents. The same benchmarks show that, for that usage pattern, the difference is much smaller (6,250 -vs- 7,500; ~20%). Not insignificant.. but not earth shattering either.
Portability – Many other DB platforms have good UUID support so portability would be improved. Alternatively, since UUIDs are larger (more bits) it is possible to repack an ObjectID into the "shape" of a UUID. This approach isn't as nice as direct portability but it does give you a way to "map" between existing ObjectIDs and UUIDs.
Decentralisation – One of the big selling points of UUIDs is that they're universally unique. This makes it practical to generate them anywhere, in a decentralised fashion (in contrast to, for example an auto-incrementing value, that requires a centralised source of truth to determine the "next" value). Of course, Mongo Object IDs profess this benefit too. The difference is, UUIDs are based on a 15+ year old standard and supported on (nearly?) all platforms, languages, etc. This makes them very useful if you ever need to create entities (or specifically, sets of related entities) in disjointed systems, without interacting with the database. You can create a dataset with IDs and foreign keys in place, then write the whole graph into the database at some point in the future without conflict. Although this is also possible with Mongo ObjectIDs, finding code to generate them/work with the format will often be harder.
Corrections
Contrary to some of the other answers:
UUIDs do have native Mongo support – You can use the UUID() function in the Mongo Shell exactly the same way you'd use ObjectID(); to convert a UUID string into equivalent BSON object.
UUIDs are not especially large – When encoded using binary subtype 0x04 they're 128 bits, compared to 96 bits for ObjectIDs. (If encoded as strings they will be pretty wasteful, taking around 288 bits.)
UUIDs can include a timestamp – Specifically, UUIDv1 encodes a timestamp with 60 bits of precision, compared to 32 bits in ObjectIDs. In decimal, this is over 6 orders of magnitude more precision – so nanoseconds instead of seconds. It can actually be a decent way of storing create timestamps with more accuracy than Mongo/JS Date objects support, however...
The build in UUID() function only generates v4 (random) UUIDs so, to leverage this this, you'd to lean on on your app or Mongo driver for ID creation.
Unlike ObjectIDs, because of the way UUIDs are chunked, the timestamp doesn't give you a natural order. This can be good or bad depending on your use case. (New standards may change this; see 2021 update below.)
Including timestamps in your IDs is sometimes a Bad Idea. You end up leaking the created time of documents anywhere an ID is exposed. (Of course ObjectIDs also encode a timestamp so this is partly true for them too.)
If you do this with (spec compliant) v1 UUIDs, you're also encoding part of the servers MAC address, which can potentially be used to identify the machine. Probably not an issue for most systems but also not ideal. (New standards may change this; see 2021 update below.)
Conclusion
If you think about your Mongo DB in isolation, ObjectIDs are the obvious choice. They work well out of the box and are a perfectly capable default. Using UUIDs instead does add some friction, both when working with the values (needing to convert to binary types, etc.) and in terms of performance. Whether this slight inconvenience is worth having a standardised ID format really depends on the importance you place on portability and your architectural choices.
Will you be syncing data between different database platforms? Will you migrate your data to a different platform in the future? Do you need to generate IDs outside the database, in other systems or in the browser? If not now at some point in the future? UUIDs might be worth the hassle.
Aug 2021 Update
The IEFT recently published a draft update to the UUID spec that would introduce some new versions of the format.
Specifically, UUIDv6 and UUIDv7 are based on UUIDv1 but flip the timestamp chunks so the bits are arranged from most significant to least significant. This gives the resultant values a natural order that (more or less) reflects the order in which they were created. The new versions also exclude data derived from the servers MAC address, addressing a long-standing criticism of v1 UUIDs.
It'll take time for these changes to flow though to implementations but (IMHO) they significantly modernise and improve the format.
The _id field of MongoDB can have any value you want as long as you can guarantee that it is unique for the collection. When your data already has a natural key, there is no reason not to use this in place of the auto-generated ObjectIDs.
ObjectIDs are provided as a reasonable default solution to safe time generating an own unique key (and to discourage beginners from trying to copy SQL's AUTO INCREMENT which is a bad idea in a distributed database).
By not using ObjectIDs you also miss out on another convenience feature: An ObjectID also includes an unix timestamp when it was generated, and many drivers provide a funtion to extract it and convert it to a date. This can sometimes make a separate create-date field redundant.
But when neither is a concern for you, you are free to use your UUIDs as _id field.
Consider the amount of data you would store in each case.
A MongoDB ObjectID is 12 bytes in size, is packed for storage, and its parts are organized for performance (i.e. timestamp is stored first, which is a logical ordering criteria).
Conversely, a standard UUID is 36 bytes, contains dashes and is typically stored as a string. Further, even if you strip non-numeric characters and intend to store numerically, you must still content with its "indexy" portion (the part of a UUID v1 that is timestamp-based) is in the middle of the UUID, and doesn't lend itself well to sorting. There are studies done which allow for performant UUID storage, and I even wrote a Node.js library to assist in its management.
If you intend to use a UUID, consider reorganizing it for optimal indexing and sorting; otherwise you'll likely hit a performance wall.
We must be careful to distinguish the cost of MongoDB inserting a thing vs. the cost to generate the thing in the first place plus that cost relative to the size of the payload. Below is a little matrix that shows method of generating the _id crossed against the size of an optional extra bytes worth of payload. Tests are using javascript only, conducted on MacBook Pro localhost for 100,000 inserts using insertMany of batches of 100 without transactions to try to remove network, chatty, and other factors. Two runs with batch = 1 were also done just to highlight the dramatic difference.
Method
A : Simple int: _id:0, _id:1, ...
B : ObjectId _id:ObjectId("5e0e6a804888946fa61a1976"), ...
C : Simple string: _id:"A0", _id:"A1", ...
D : UUID length string _id:"9575edcc-cb70-4d63-97ed-ee5d624de87b0", ...
(but not actually
generated by UUID()
E : Real generated UUID _id: UUID("35992974-21ea-4f61-b715-2dfaed663b73"), ...
(stored UUID() object)
F : Real generated UUID _id: "6b16f733-ff24-4172-83f9-e4f96ace6775"
(stored as string, e.g.
UUID().toString().substr(6,36)
Time in milliseconds to perform 100,000 inserts on fresh (empty) collection.
Extra M E T H O D (Batch = 100)
Payload A B C D E F % drop A to F
-------- ---- ---- ---- ---- ---- ---- ------------
None 2379 2386 2418 2492 3472 4267 80%
512 2934 2928 3048 3128 4151 4870 66%
1024 3249 3309 3375 3390 4847 5237 61%
2048 3953 3832 3987 4342 5448 5888 49%
4096 6299 6343 6199 6449 7634 8640 37%
8192 9716 9292 9397 10816 11212 11321 16%
Extra M E T H O D (Batch = 1)
Payload A B C D E F % drop A to F
-------- ----- ----- ----- ----- ----- -----
None 48006 48419 49136 48757 50649 51280 6.8%
1024 50986 50894 49383 49373 51200 51821 1.2%
This was a quicky test but it seems clear that basic strings and ints as _id are roughly the same speed but actually generating a UUID adds time -- especially if you take the string version of the UUID() object, e.g. UUID().toString().substr(6,36) It is also worth noting that constructing an ObjectId appears to be as quick.
I have been thinking about this for last several weeks. What I simply found is ObjectId and UUID both are unique. In fact in collection level you can not have duplicate _id whatever type you use. Some Of the answers talked about insertion performance. The important thing is it is not about insertion performance it needs to be indexing performance. That needs to be counted based on how much ram you are going to use for indexing _ids. We know that ObjectId is 12 bytes where UUID is 36 bytes. It says that for same amount of index you will need 2 times more ram space if you use UUID instead of ObjectId.
So from that point of view it is better to use ObjectId over UUID in mongodb.
UUID are 128 bit (16 byte) and are globally unique. See RFC 4122.
Object Ids are a MongoDB specific construct and is 96 bits (12 bytes). And although it would be enough to provide uniqueness globally but there are some edge conditions. MongoDB has this official document to compare the two.
We prefer to not be tied down with MongoDB specific ID generation and prefer to do it on the client side. We also use multiple kinds of databases. The bottom line is that choosing UUID over ObjectId is a decision one can take based on their specific use cases.
I found these Benchmarks sometime ago when I had the same question.
They basically show that using a Guid instead of ObjectId causes Index Performance drop.
I would anyways recommend that you customize the Benchmarks to imitate your specific real life scenario and see how the numbers look like, one cannot rely 100% on generic Benchmarks.
Try this
const uuid = require('uuid')
const mongoose = require('mongoose')
const YourSchema = new Schema({
_id:{
type: String,
default: () => uuid.v4().replace(/\-/g, ""),
}
})
Let's say we have many data tables structured as timestamp(hash) - value pairs, where values could be for example temperatures or other kinds of varied measurement data.
To get timestamps of certain values we can build a secondary index with value(hash) - timestamp(range), but what if we want to query the value with comparison operations like GT, LT, BETWEEN to get timestamps of a range of values?
Obviously, I want to avoid using scan. The only thing I've come up with is using a dummy hash key and putting the values+timestamps into range attribute, but I'm guessing this has its own problems (better or worse compared to scan?).
Is there a better solution or can this be done with DynamoDB at all?
You need to know the HASH then you can perform a query on the RANGE. To get around this you would need to denormalize your table, i.e. create a duplicate with the keys reversed. Although that seems like a bit of a pain in the butt, it is one of the tradeoffs that is at times required for all the performance benefits of a key value store.
Example for this case:
With both keys completely random, then you are out of luck. Rather than set your HASH as a dummy value you could try using a monthly time stamp instead, that way you should always be able to work out pragmatically what the hash should be. You can also then look at setting the range as a combination of both values separated by a hyphen, i.e. timestamp-value, then in the denormalized table, value-timestamp, that way you should be able to use the comparison operators with no performance hit.
I am creating a service for which I will use MongoDB as a storage backend.
The service will produce a hash of the user input and then see if that same hash (+ input) already exists in our dataset.
The hash will be unique yet random ( = non-incremental/sequential), so my question is:
Is it -legitimate- to use a random value for an Object ID? Example:
$object_id = new MongoId(HEX-OF-96BIT-HASH);
Or will MongoDB treat the ObjectID differently from other server-produced ones, since a "real" ObjectID also contains timestamps, machine_id, etc?
What are the pros and cons of using a 'random' value? I guess it would be statistically slower for the engine to update the index on inserts when the new _id's are not in any way incremental - am I correct on that?
Yes it is perfectly fine to use a random value for an object id, if some value is present in _id field of a document being stored, it is treated as objectId.
Since _id field is always indexed, and primary key, you need to make sure that different objectid is generated for each object.
There are some guidelines to optimize user defined object ids :
https://docs.mongodb.com/manual/core/document/#the-id-field.
While any values, including hashes, can be used for the _id field, I would recommend against using random values for two reasons:
You may need to develop a collision-management strategy in the case you produce identical random values for two different objects. In the question, you imply that you'll generate IDs using a some type of a hash algorithm. I would not consider these values "random" as they are based on the content you are digesting with the hash. The probability of a collision then is a function of the diversity of content and the hash algorithm. If you are using something like MD5 or SHA-1, I wouldn't worry about the algorithm, just the content you are hashing. If you need to develop a collision-management strategy then you definitely should not use random or hash-based IDs as collision management in a clustered environment is complicated and requires additional queries.
Random values as well as hash values are purposefully meant to be dispersed on the number line. That (a) will require more of the B-tree index to be kept in memory at all times and (b) may cause variable insert performance due to B-tree rebalancing. MongoDB is optimized to handle ObjectIDs, which come in ascending order (with one second time granularity). You're likely better off sticking with them.
I just found out an answer to one of my questions, regarding indexing performance:
If the _id's are in a somewhat well defined order, on inserts the entire b-tree for the _id index need not be loaded. BSON ObjectIds have this property.
Source: http://www.mongodb.org/display/DOCS/Optimizing+Object+IDs
Whether it is good or bad depends upon it's uniqueness. Of course the ObjectId provided by MongoDB is quite unique so this is a good thing. So long as you can replicate that uniqueness then you should be fine.
There are no inherent risks/performance loses by using your own ID. I guess using it in string form might use up more index/storage/querying power but there you are using it in MongoID (ObjectId) form which should preserve the strengths of not storing it in a simple string.
Is there a way to create an index on only part of a field in MongoDB, for example on the first 10 characters? I couldn't find it documented (or asked about on here).
The MySQL equivalent would be CREATE INDEX part_of_name ON customer (name(10));.
Reason: I have a collection with a single field that varies in length from a few characters up to over 1000 characters, average 50 characters. As there are a hundred million or so documents it's going to be hard to fit the full index in memory (testing with 8% of the data the index is already 400MB, according to stats). Indexing just the first part of the field would reduce the index size by about 75%. In most cases the search term is quite short, it's not a full-text search.
A work-around would be to add a second field of 10 (lowercased) characters for each item, index that, then add logic to filter the results if the search term is over ten characters (and that extra field is probably needed anyway for case-insensitive searches, unless anybody has a better way). Seems like an ugly way to do it though.
[added later]
I tried adding the second field, containing the first 12 characters from the main field, lowercased. It wasn't a big success.
Previously, the average object size was 50 bytes, but I forgot that includes the _id and other overheads, so my main field length (there was only one) averaged nearer to 30 bytes than 50. Then, the second field index contains the _id and other overheads.
Net result (for my 8% sample) is the index on the main field is 415MB and on the 12 byte field is 330MB - only a 20% saving in space, not worthwhile. I could duplicate the entire field (to work around the case insensitive search problem) but realistically it looks like I should reconsider whether MongoDB is the right tool for the job (or just buy more memory and use twice as much disk space).
[added even later]
This is a typical document, with the source field, and the short lowercased field:
{ "_id" : ObjectId("505d0e89f56588f20f000041"), "q" : "Continental Airlines", "f" : "continental " }
Indexes:
db.test.ensureIndex({q:1});
db.test.ensureIndex({f:1});
The 'f" index, working on a shorter field, is 80% of the size of the "q" index. I didn't mean to imply I included the _id in the index, just that it needs to use that somewhere to show where the index will point to, so it's an overhead that probably helps explain why a shorter key makes so little difference.
Access to the index will be essentially random, no part of it is more likely to be accessed than any other. Total index size for the full file will likely be 5GB, so it's not extreme for that one index. Adding some other fields for other search cases, and their associated indexes, and copies of data for lower case, does start to add up, and make paging and swapping more likely (it's an 8GB server) which I why I started looking into a more concise index.
MongoDB has no way to create an index on a portion of a field's value. Your best approach is to create the second field, as you've suggested.
Since you'll need the second field for efficient case-insensitive searching anyway, there's really no reason to not create it.
The indexes don't store the '_id' field of the document, they store a DiscLoc structure, which is a much lower-level structure: see here for details
http://www.10gen.com/presentations/MongoNYC-2012/storage-engine-internals
Also, note that the "ugly" is really an artifact of "relational thinking". (As a long-time SQL user myself, I often find that the hardest part about learning MongoDB is un-learning my relational thinking.) In a document-oriented database, denormalizing and duplicating data are actually Best Practices.