Mongodb id on bulk insert performance

Mongodb id on bulk insert performance - mongodb

I have a class/object that have a guid and i want to use that field as the _id object when it is saved to Mongodb. Is it possible to use other value instead of the ObjectId?
Is there any performance consideration when doing bulk insert when there is an _id field? Is _id an index? If i set the _id to different field, would it slow down the bulk insert? I'm inserting about 10 million records.

1) Yes you can use that field as the id. There is no mention of what API (if any) you are using for inserting the documents. So if you would do the insertion at the command line, the command would be:
db.collection.insert({_id : <BSONString_version_of_your_guid_value>, field1 : value1, ...});
It doesn't have to be BsonString. Change it to whatever Bson value is closest matching to your guid's original type (except the array type. Arrays aren't allowed as the value of _id field).
2) As far as i know, there IS effect on performance when db.collection.insert when you provide your own ids, especially in bulk, BUT if the id's are sorted etc., there shouldn't be a performance loss. The reason, i am quoting:
The structure of index is a B-tree. ObjectIds have an excellent
insertion order as far as the index tree is concerned: they are always
increasing, meaning they are always inserted at the right edge of
B-tree. This, in turn, means that MongoDB only has to keep the right
edge of the B-Tree in memory.
Conversely, a random value in the _id field means that _ids will be
inserted all over the tree. Then the machine must move a page of the
index into memory, update a tiny piece of it, then probably ignore it
until it slides out of memory again. This is less efficient.
:from the book `50 Tips and Tricks for MongoDB Developers`
The tip's title says - "Override _id when you have your own simple, unique id." Clearly it is better to use your id if you have one and you don't need the properties of an ObjectId. And it is best if your ids are increasing for the reason stated above.
3) There is a default index on _id field by MongoDB.

So...
Yes. It is possible to use other types than ObjectId, including GUID that will be saved as BinData.
Yes, there are considerations. It's better if your _id is always increasing (like a growing number, or ObjectId) otherwise the index needs to rebuild itself more often. If you plan on using sharding, the _id should also be hashed evenly.
_id indeed has an index automatically.
It depends on the type you choose. See section 2.
Conclusion: It's better to keep using ObjectId unless you have a good reason not to.

Related

DB Compound indexing best practices Mongo DB

How costly is it to index some fields in MongoDB,
I have a table where i want uniqueness combining two fields, Every where i search they suggested compound index with unique set to true. But what i was doing is " Appending both field1_field2 and making it a key, so that field2 will be always unique for field1.(and add Application logic) As i thought indexing is costly.
And also as MongoDB documentation advices us not to use Custom Object ID like auto incrementing number, I end up giving big numbers to Models like Classes, Students etc, (where i could have used easily used 1,2,3 in sql lite), I didn't think to add a new field for numbering and index that field for querying.
What are the best practices advice for production

The advantage of using compound indexes vs your own indexed field system is that compound indexes allows sorting quicker than regular indexed fields. It also lowers the size of every documents.
In your case, if you want to get the documents sorted with values in field1 ascending and in field2 descending, it is better to use a compound index. If you only want to get the documents that have some specific value contained in field1_field2, it does not really matter if you use compound indexes or a regular indexed field.
However, if you already have field1 and field2 in seperate fields in the documents, and you also have a field containing field1_field2, it could be better to use a compound index on field1 and field2, and simply delete the field containing field1_field2. This could lower the size of every document and ultimately reduce the size of your database.
Regarding the cost of the indexing, you almost have to index field1_field2 if you want to go down that route anyways. Queries based on unindexed fields in MongoDB are really slow. And it does not take much more time adding a document to a database when the document has an indexed field (we're talking 1 millisecond or so). Note that adding an index on many existing documents can take a few minutes. This is why you usually plan the indexing strategy before adding any documents.
TL;DR:
If you have limited disk space or need to sort the results, go with a compound index and delete field1_field2. Otherwise, use field1_field2, but it has to be indexed!

How does MongoDB order their docs in one collection? [duplicate]

This question already has answers here:
How does MongoDB sort records when no sort order is specified?
(2 answers)
Closed 7 years ago.
In my User collection, MongoDB usually orders each new doc in the same order I create them: the last one created is the last one in the collection. But I have detected another collection where the last one I created has the 6 position between 27 docs.
Why is that?
Which order follows each doc in MongoDB collection?

It's called natural order:
natural order
The order in which the database refers to documents on disk. This is the default sort order. See $natural and Return in Natural Order.
This confirms that in general you get them in the same order you inserted, but that's not guaranteed–as you noticed.
Return in Natural Order
The $natural parameter returns items according to their natural order within the database. This ordering is an internal implementation feature, and you should not rely on any particular structure within it.
Index Use
Queries that include a sort by $natural order do not use indexes to fulfill the query predicate with the following exception: If the query predicate is an equality condition on the _id field { _id: <value> }, then the query with the sort by $natural order can use the _id index.
MMAPv1
Typically, the natural order reflects insertion order with the following exception for the MMAPv1 storage engine. For the MMAPv1 storage engine, the natural order does not reflect insertion order if the documents relocate because of document growth or remove operations free up space which are then taken up by newly inserted documents.
Obviously, like the docs mentioned, you should not rely on this default order (This ordering is an internal implementation feature, and you should not rely on any particular structure within it.).
If you need to sort the things, use the sort solutions.
Basically, the following two calls should return documents in the same order (since the default order is $natural):
db.mycollection.find().sort({ "$natural": 1 })
db.mycollection.find()
If you want to sort by another field (e.g. name) you can do that:
db.mycollection.find().sort({ "name": 1 })

For performance reasons, MongoDB never splits a document on the hard drive.
When you start with an empty collection and start inserting document after document into it, mongoDB will place them consecutively on the disk.
But what happens when you update a document and it now takes more space and doesn't fit into its old position anymore without overlapping the next? In that case MongoDB will delete it and re-append it as a new one at the end of the collection file.
Your collection file now has a hole of unused space. This is quite a waste, isn't it? That's why the next document which is inserted and small enough to fit into that hole will be inserted in that hole. That's likely what happened in the case of your second collection.
Bottom line: Never rely on documents being returned in insertion order. When you care about the order, always sort your results.

MongoDB does not "order" the documents at all, unless you ask it to.
The basic insertion will create an ObjectId in the _id primary key value unless you tell it to do otherwise. This ObjectId value is a special value with "monotonic" or "ever increasing" properties, which means each value created is guaranteed to be larger than the last.
If you want "sorted" then do an explicit "sort":
db.collection.find().sort({ "_id": 1 })
Or a "natural" sort means in the order stored on disk:
db.collection.find().sort({ "$natural": 1 })
Which is pretty much the standard unless stated otherwise or an "index" is selected by the query criteria that will determine the sort order. But you can use that to "force" that order if query criteria selected an index that sorted otherwise.
MongoDB documents "move" when grown, and therefore the _id order is not always explicitly the same order as documents are retrieved.

I could find out more about it thanks to the link Return in Natural Order provided by Ionică Bizău.
"The $natural parameter returns items according to their natural order within the database.This ordering is an internal implementation feature, and you should not rely on any particular structure within it.
Typically, the natural order reflects insertion order with the following exception for the MMAPv1 storage engine. For the MMAPv1 storage engine, the natural order does not reflect insertion order if the documents relocate because of document growth or remove operations free up space which are then taken up by newly inserted documents."

Mongo _id Insert Uniqueness Check

I have a medium to large Mongo collection containing image metadata for >100k images. I am generating a UUID for each image generated and using it as the _id field in the imageMeta.insert() call.
I know for a fact that these _id's are unique, or at least as unique as I can expect from boost's UUID implementation, but as the collection grows larger, the time to insert a record has grown as well.
I feel like to ensure uniqueness of the _id field Mongo must be double-checking these against the other _ids in the database. How is this implemented, and how should I expect the insert time to grow wrt. to the collection size?

The _id field in mongo is required to be unique and indexed. When an insert is performed, all indexes in the collection are updated, so it's expected to see insert time increase with the number of indexes and/or documents. Namely, all collections have at least one index (on the _id field), but you've likely created indexes on fields that you frequently query, and those indexes also get updated on every insert (adding to the latency).
One way to reduce perceived database latency is to specify a write concern to your driver. Note that the default write concern prior to November 2012 was 'unacknowledged', but it has since been changed to 'acknowledged'.

Change size of Objectid

In MongoDb ObjectId is a 12-byte BSON type.
Is there any way to reduce the size of objectID?

No. It's a BSON data type. It's like asking a 32-bit integer to shrink itself.
Every object must have _id property, but you are not restricted to ObjectId.

Every document in a MongoDB collection needs to have a unique _id but the value does not have to be an ObjectId. Therefore, if you are looking to reduce the size of documents in your collection you have two choices:
Pick one of the unique properties of your documents and use it as the _id field. For example, if you have an accounts collection where the account ID--provided externally--is part of your data model, you could store the account ID in the _id field.
Manage primary keys for the collection yourself. Many drivers support custom primary key factories. As #assylias suggests, going with an int will give you good space savings but, still, you will use more space than if you can use one of the fields in your model as the _id.
BTW, the value of an _id field can be composite: you can use an Object/hash/map/dictionary. See, for example, this SO question.
If you are using some type of object/model framework on top of Mongo, I'd be careful with (1). Some frameworks have a hard time with developers overriding id generation. For example, I've had bad experience with Mongoid in Ruby. In that case, (2) may be the safer way to go as the generation happens at the driver layer.

Mongodb store and select order

Basic question. Does mongodb find command will always return documents in the order they where added to collection? If no how is it possible to implement selection docs in the right order?
Sort? But what if docs where added simultaneously and say created date is the same, but there was an order still.

Well, yes and ... not exactly.
Documents are default sorted by natural order. Which is initially the order the documents are stored on disk, which is indeed the order in which the documents had been added to a collection.
This order however, is not deterministic, as document may be moved on disk once these documents grow after update operations, and can't be fit into current space anymore. This way the initial (insert) order may change.
The way to guarantee insert order sort is sort by {_id : 1} as long as the _id is of type ObjectId. This will return your documents sorted in ascending order.
Write operations do not take place simultaneously. Write locks are imposed in database level (V 2.4 and on). The first four bytes of _id is insert timestamp, and 3 last digits is a random counter used to distinguish (and sort) between ObjectId instances with same timestamp.
_id field is indexed by default

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse