MongoDB bulk update efficiency using forEach - mongodb

How would you approach bulk / batch updating documents (Up to 10k docs) coupled with forEach?
(No specific criteria to update by, used for random document selection)
I'm looking at two options:
Collect all document _id in the forEach closure into an array and afterwards update using
collection.update({_id : {$in : idsArray}}, ...)
Add update queries in the forEach closure to a bulk operation and execute once done, somewhere along the lines of
bulk.find({_id: doc.id}).updateOne({...});
bulk.execute();
I'm going to benchmark this soon, but I would like to know what's more I/O efficient and considered 'smart' with Mongo.

OK, so I've benchmarked the two options.
TL;DR option one is twice as fast, so collect ids and update once.
for future reference, some more details for :
Total number of documents in db is around 500k.
Documents contain around 20-25 fields each.
Did an update on 10-30k documents.
Results (times are machine specific, but the relative difference is what matters):
One update with ids array: 200-500ms.
Bulk update: 600-1000ms.
Looking back, I thought bulk might be faster because maybe there was some hidden optimization. But I understand that the question was missing logic, less operations probably means faster, hence bulk is slower.

Related

Mongo DB update query performance

I would like to understand which of the below queries would be faster, while doing updates, in mongo db? I want to update few thousands of records at one stretch.
Accumulating the object ids of those records and firing them using $in or using bulk update?
Using one or two fields in the collection which are common for those few thousand records - akin to "where" in sql and firing an update using those fields. These fields might or might not be indexed.
I know that query will be much smaller in the 2nd case as every single "_id" (oid) is not accumulated. Does accumulating _ids and using those to update documents offer any practical performance advantages?
Does accumulating _ids and using those to update documents offer any practical performance advantages?
Yes because MongoDB will certainly use the _id index (idhack).
In the second method - as you observed - you can't tell whether or not an index will be used for a certain field.
So the answer will be: it depends.
If your collection has million of documents or more, and / or the number of search fields is quite large you should prefer the first search method. Especially if the id list size is not small and / or the id values are adjacent.
If your collection is pretty small and you can tolerate a full scan you may prefer the second approach.
In any case, you should testify both methods using explain().

What is the preferred way to add many fields to all documents in a MongoDB collection?

I have have a Python application that is iteratively going through every document in a MongoDB (3.0.2) collection (typically between 10K and 1M documents), and adding new fields (probably doubling/tripling the number of fields in the document).
My initial thought was that I would use upsert the entire of the revised documents (using pyMongo) - now I'm questioning that:
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
this is actually a great question that can be solved a few different ways depending on how you are managing your data.
if you are upserting additional fields does this mean your data is appending additional fields at a later point in time with the only changes being the addition of the additional fields? if so you could set the ttl on your documents so that the old ones drop off over time. keep in mind that if you do this you will want to set an index that sorts your results by descending _id so that the most recent additions are selected before the older ones.
the benefit of this of doing it this way is that your are continually writing data as opposed to seeking and updating data so it is faster.
in regards to upserts vs bulk inserts. bulk inserts are always faster than upserts since bulk upserting requires you to find the original document first.
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
you really need to understand your data fully to determine what is best but if only change to the data is additional fields or changes that only need to be considered from that point forward then bulk inserting and setting a ttl on your older data is the better method from the stand point of write operations as opposed to seek, find and update. when using this method you will want to db.document.find_one() as opposed to db.document.find() so that only your current record is returned.
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
bulk inserts will be faster than inserting each one sequentially.

Slow Upserts with PyMongoDB

I'm trying to insert ~800 million records into MongoDB using PyMongo on a macbook air 1.7GHz i7 with no multi-threading, the documents are structured as below:
Records I'm reading are the following tuple:
(user_id,imp_date,imp_creative,imp_pid,geo_id)
I'm creating my own _id field based on the user_id in the file I'm reading from.
{_id:user_id,
'imp_date':[array of dates],
'imp_creative':[array of numeric ids],
'imp_pid':[array of numeric ids],
'geo_id':numeric id}
I'm using an upsert with $push to append date, creative id, and pid for the corresponding arrays
self.collection.update({'_id':uid},
{"$push":{'imp_date':<datevalue>,
'imp_creative':<creative_id>,
'imp_pid':<pid>}},safe=True,upsert=True)
I'm using an upsert with $set to overwrite the geographic location (only care about most recent.
self.collection.update({'_id':uid},
{"$set":{'geo_id':<geo id>}},safe=True,upsert=True)
I'm only writing about 1,500 records per second (8,000 if I set safe=False). My question is: what can I do to speed this up further (ideally 20k/second or faster)?
Ideas I can't find a definitive recommendation on:
-Using multiple threads to insert data
-Sharding
-Padding arrays (my arrays grow very slowly, each document array will have an average length of ~4 at the end of the file)
-Turning journaling off
Apologies if I've left out any required information, this is my first post.
1- You could add an index to speed it up, and index would help you to find the documents faster although the inserts would be slower (you have to update the index as well). If the improvement in the retrieving phase compensates the extra time to update the index depends on how many records you have in the collections, how many indexes you have and how complicated those indexes are.
However, in your case you are only querying with the _id so there's no much more you can do with indexes.
2- Are you using two consecutive updates? I mean, one for the $set and one for the $push?
If that's true, then you should definetelly use just one:
self.collection.update({'_id':uid},
{"$push":{'imp_date':<datevalue>,
'imp_creative':<creative_id>,
'imp_pid':<pid>},
"$set":{'geo_id':<geo id>}},
safe=True,upsert=True)
3- The update operation is an atomic operation which might locks other queries. If the document you are about to update is not already in RAM but it is in the disk, mongo will have to first fetch it from the disk and then update it. If you do a find operation first (which doesn't block as it's a read-only operation) the document will be in RAM for sure so the update operation (the locking one) will be faster:
self.collection.findOne({'_id':uid})
self.collection.update({'_id':uid},
{"$push":{'imp_date':<datevalue>,
'imp_creative':<creative_id>,
'imp_pid':<pid>},
"$set":{'geo_id':<geo id>}},
safe=True,upsert=True)
4-If your documents don't grow too much as you have said, it won't be necessary to bother about padding factor and reallocation issues. Furthermore, in some recent versions (can't remember if it was since 2.2 or 2.4) collections are created with the powerOfTwo option enabled by default.

Partial doc updates to a large mongo collection - how to not lock up the database?

I've got a mongo db instance with a collection in it which has around 17 million records.
I wish to alter the document structure (to add a new attribute in the document) of all 17 million documents, so that I dont have to problematically deal with different structures as well as make queries easier to write.
I've been told though that if I run an update script to do that, it will lock the whole database, potentially taking down our website.
What is the easiest way to alter the document without this happening? (I don't mind if the update happens slowly, as long as it eventually happens)
The query I'm attempting to do is:
db.history.update(
{ type : { $exists: false }},
{
$set: { type: 'PROGRAM' }
},
{ multi: true }
)
You can update the collection in batches(say half a million per batch), this will distribute the load.
I created a collection with 20000000 records and ran your query on it. It took ~3 minutes to update on a virtual machine and i could still read from the db in a separate console.
> for(var i=0;i<20000000;i++){db.testcoll.insert({"somefield":i});}
The locking in mongo is quite lightweight, and it is not going to be held for the whole duration of the update. Think of it like 20000000 separate updates. You can read more here:
http://docs.mongodb.org/manual/faq/concurrency/
You do actually care if your update query is slow, because of the write lock problem on the database you are aware of, both are tightly linked. It's not a simple read query here, you really want this write query to be as fast as possible.
Updating the "find" part is part of the key here. First, since your collection has millions of documents, it's a good idea to keep the field name size as small as possible (ideally one single character : type => t). This helps because of the schemaless nature of mongodb collections.
Second, and more importantly, you need to make your query use a proper index. For that you need to workaround the $exists operator which is not optimized (several ways to do it there actually).
Third, you can work on the field values themselves. Use http://bsonspec.org/#/specification to estimate the size of the value you want to store, and eventually pick a better choice (in your case, you could replace the 'PROGRAM' string by a numeric constant for example and gain a few bytes in the process, multiplied by the number of documents to update for each update multiple query). The smaller the data you want to write, the faster the operation will be.
A few links to other questions which can inspire you :
Can MongoDB use an index when checking for existence of a field with $exists operator?
Improve querying fields exist in MongoDB

Does providing a projection argument to find() limit the data that is added to Mongo's working set?

In Mongo, suppose I have a collection mycollection that has fields a, b, and huge. I very frequently want to perform queries, mapreduce, updates, etc. on a, and b and very occassionally want to return huge in query results as well.
I know that db.mycollection.find() will scan the entire collection and result in Mongo attempting to add the whole collection to the working set, which may exceed the amount of RAM I have available.
If I instead call db.mycollection.find({}, { a : 1, b : 1 }), will this still result in the whole collection being added to the working set or only the terms of my projection?
MongoDB can use something called covered queries: http://docs.mongodb.org/manual/applications/indexes/#create-indexes-that-support-covered-queries these allow you to load all the values from the index rather than the disk, or memory, if those documents are in memory at the time.
Be warned that you cannot use covered queries on a full table scan, the condition, projection and sort must all be within the index; i.e.:
db.col.ensureIndex({a:1,b:1});
db.col.find({a:1}, {_id:0, a:1, b:1})(.sort({b:1}));
Would work (the sort is in brackets because it is not totally needed). You can add _id to your index if you intend to return that too.
Map Reduce does not support covered queries, there is no way to project only a certain amount of fields into the MR, as far as I know; maybe there is some hack I do not know of. Map Reduce only supports a $match like operator in terms of input query with a separate parameter for the sort of the incoming query ( http://docs.mongodb.org/manual/applications/map-reduce/ ).
Note that for updates I believe only atomic operations: http://docs.mongodb.org/manual/tutorial/isolate-sequence-of-operations/ (excluding findAndModify) do not load the document into your working set, however, believe is the keyword there.
Considering you need to do both MR and normal find and update on these records I would strongly recommend you look into checking why you are paging in so much data and whether you really do need to do it that often. It seems like you are trying to do too much processing in a short and frequent amount of time.
On the other hand, if this is a script which runs every night or something then I would not worry too much about its excessive working set (i.e. score board recalc script).