heterogeneous bulk update in mongodb

heterogeneous bulk update in mongodb - mongodb

I know that we can bulk update documents in mongodb with
db.collection.update( criteria, objNew, upsert, multi )
in one db call, but it's homogeneous, i.e. all those documents impacted are following one kind of criteria. But what I'd like to do is something like
db.collection.update([{criteria1, objNew1}, {criteria2, objNew2}, ...]
, to send multiple update request which would update maybe absolutely different documents or class of documents in single db call.
What I want to do in my app is to insert/update a bunch of objects with compound primary key, if the key is already existing, update it; insert it otherwise.
Can I do all these in one combine in mongodb?

That's two seperate questions. To the first one; there is no MongoDB native mechanism to bulk send criteria/update pairs although technically doing that in a loop yourself is bound to be about as efficient as any native bulk support.
Checking for the existence of a document based on an embedded document (what you refer to as compound key, but in the interest of correct terminology to avoid confusion it's better to use the mongo name in this case) and insert/update depending on that existence check can be done with upsert :
document A :
{
_id: ObjectId(...),
key: {
name: "Will",
age: 20
}
}
db.users.update({name:"Will", age:20}, {$set:{age: 21}}), true, false)
This upsert (update with insert if no document matches the criteria) will do one of two things depending on the existence of document A :
Exists : Performs update "$set:{age:21}" on the existing document
Doesn't exist : Create a new document with fields "name" and field
"age" with values "Will" and "20" respectively (basically the
criteria are copied into the new doc) and then the update is applied
($set:{age:21}). End result is a document with "name"="Will" and
"age"=21.
Hope that helps

we are seeing some benefits of $in clause.
our use case was to update the 'status' in a document for a large number number records.
In our first cut, we were doing a for loop and doing updates one by 1. But then we switched to using $in clause and that made a huge improvement.

There is no real benefit from doing updates the way you suggest.
The reason that there is a bulk insert API and that it is faster is that Mongo can write all the new documents sequentially to memory, and update indexes and other bookkeeping in one operation.
A similar thing happens with updates that affect more than one document: the update will traverse the index only once and update objects as they are found.
Sending multiple criteria with multiple criteria cannot benefit from any of these optimizations. Each criteria means a separate query, just as if you issued each update separately. The only possible benefit would be sending slightly fewer bytes over the connection. The database would still have to do each query separately and update each document separately.
All that would happen would be that Mongo would queue the updates internally and execute them sequentially (because only one update can happen at any one time), this is exactly the same as if all the updates were sent separately.
It's unlikely that the overhead in sending the queries separately would be significant, Mongo's global write lock will be the limiting factor anyway.

Related

Mongodb always increased "_id" field?

Is _id field in mongodb always increased for the next inserted document in the collection even if we have multiple shards? So if I have collection.watch do I always get higher _id field for the next document than for the prev one? I need this to implement catch-up subscription and not to lose any document. So on every processed document from collection.watch I store its _id and if crash - I can select all documents with _id > last_seen_id in addition to collection.watch.
Or do I have to use some sort of auto-incemented value? I don't wanna cause it will hurt performance a lot and kill reason of sharding.

ObjectIds are guaranteed to be monotonically increasing most of the time, but not all of the time. See What does MongoDB's documentation mean when it says ObjectIDs are "likely unique"? and Can a 4 byte timestamp value in MongoDb ObjectId overflow?. If you need a guaranteed monotonically increasing counter, you need to implement it yourself.
As you pointed out this isn't a trivial thing to implement in a distributed environment, which is why MongoDB doesn't provide this.
One possible solution:
Have a dedicated counter collection
Seed the collection with a document like {i: 1}
Issue find-and-modify operation that uses https://docs.mongodb.com/manual/reference/operator/update/inc/ and no condition (thus affecting all documents in the collection, i.e. the one and only document which is the counter)
Request the new document as the update result (e.g. https://docs.mongodb.com/ruby-driver/master/tutorials/ruby-driver-crud-operations/#update-options return_document: :after)
Use the returned value as the counter
This doesn't get you a queue. If you want a queue, there are various libraries and systems that provide queues.

which branches of a mongodb $or query were satisfied to include a document in the set returned?

Say I have a mongo $or query, something like { $or: [query1, query2, ... queryN] }, where each embedded query could be complex. Upon executing the query, a set of documents matching one or more of the embedded queries is returned. I would like to know which of the N embedded queries was satisfied for each document in the returned set, perhaps by adding a new field that I specify, eg. marks, into each returned document that would hold a list of the indexes of whichever of the queries was satisfied. I need this information to indicate how each document was identified in my application's interface.
I realize I could inspect the set once it is returned and determine the queries that were satisfied, but these queries could be arbitrarily complex and expensive to inspect - besides, this must have already been done inside mongo itself while doing the search.
I also realize I could run each of the N queries sequentially and then mark and merge the results into a growing set, but I want to save that overhead by running a single query instead of N queries.
And I realize that Mongo will certainly stop once the first satisfying query is found for each document, so I may not be able to get the complete set, but then I would at least like some assurance that the queries are executed in a certain order, say 1...N, and each document could be marked with its first satisfying index.
Does anyone know if there's a mechanism in Mongo that can do this?

You can use aggregation.
Use $addFields to add a new field for each query.
You could either $match first, and then add the fields, or add the fields first and on the added fields.

Duplicate efficiently documents in MongoDB

I would like to find-out the most efficient way to duplicate documents in MongoDB, given that I want to take a bunch of documents from an existing collection, update one of their field, unset _id to generate a new one, and push them back in the collection to create duplicates.
This is typically to create a "branching" feature in MongoDB, allowing users to modify data in two separate branches at the same time.
I've tried the following things:
In my server, get data chunks in multiple threads, modify data, and insert modified data with a new _id in the base
This basically works but performance is not super good (~ 20s for 1 million elements).
In the future MongoDB version (tested on version 4.1.10), use the new $out aggregation mechanism to insert in the same collection
This does not seem to work and raise an error message "errmsg" : "$out with mode insertDocuments is not supported when the output collection is the same as the aggregation collection"
Any ideas how to be faster than the first approach? Thanks!

What is the preferred way to add many fields to all documents in a MongoDB collection?

I have have a Python application that is iteratively going through every document in a MongoDB (3.0.2) collection (typically between 10K and 1M documents), and adding new fields (probably doubling/tripling the number of fields in the document).
My initial thought was that I would use upsert the entire of the revised documents (using pyMongo) - now I'm questioning that:
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
Also, is it better to perform a write to the collection on a document by document basis or in bulk?

this is actually a great question that can be solved a few different ways depending on how you are managing your data.
if you are upserting additional fields does this mean your data is appending additional fields at a later point in time with the only changes being the addition of the additional fields? if so you could set the ttl on your documents so that the old ones drop off over time. keep in mind that if you do this you will want to set an index that sorts your results by descending _id so that the most recent additions are selected before the older ones.
the benefit of this of doing it this way is that your are continually writing data as opposed to seeking and updating data so it is faster.
in regards to upserts vs bulk inserts. bulk inserts are always faster than upserts since bulk upserting requires you to find the original document first.
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
you really need to understand your data fully to determine what is best but if only change to the data is additional fields or changes that only need to be considered from that point forward then bulk inserting and setting a ttl on your older data is the better method from the stand point of write operations as opposed to seek, find and update. when using this method you will want to db.document.find_one() as opposed to db.document.find() so that only your current record is returned.
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
bulk inserts will be faster than inserting each one sequentially.

Partial doc updates to a large mongo collection - how to not lock up the database?

I've got a mongo db instance with a collection in it which has around 17 million records.
I wish to alter the document structure (to add a new attribute in the document) of all 17 million documents, so that I dont have to problematically deal with different structures as well as make queries easier to write.
I've been told though that if I run an update script to do that, it will lock the whole database, potentially taking down our website.
What is the easiest way to alter the document without this happening? (I don't mind if the update happens slowly, as long as it eventually happens)
The query I'm attempting to do is:
db.history.update(
{ type : { $exists: false }},
{
$set: { type: 'PROGRAM' }
},
{ multi: true }
)

You can update the collection in batches(say half a million per batch), this will distribute the load.
I created a collection with 20000000 records and ran your query on it. It took ~3 minutes to update on a virtual machine and i could still read from the db in a separate console.
> for(var i=0;i<20000000;i++){db.testcoll.insert({"somefield":i});}
The locking in mongo is quite lightweight, and it is not going to be held for the whole duration of the update. Think of it like 20000000 separate updates. You can read more here:
http://docs.mongodb.org/manual/faq/concurrency/

You do actually care if your update query is slow, because of the write lock problem on the database you are aware of, both are tightly linked. It's not a simple read query here, you really want this write query to be as fast as possible.
Updating the "find" part is part of the key here. First, since your collection has millions of documents, it's a good idea to keep the field name size as small as possible (ideally one single character : type => t). This helps because of the schemaless nature of mongodb collections.
Second, and more importantly, you need to make your query use a proper index. For that you need to workaround the $exists operator which is not optimized (several ways to do it there actually).
Third, you can work on the field values themselves. Use http://bsonspec.org/#/specification to estimate the size of the value you want to store, and eventually pick a better choice (in your case, you could replace the 'PROGRAM' string by a numeric constant for example and gain a few bytes in the process, multiplied by the number of documents to update for each update multiple query). The smaller the data you want to write, the faster the operation will be.
A few links to other questions which can inspire you :
Can MongoDB use an index when checking for existence of a field with $exists operator?
Improve querying fields exist in MongoDB

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse