Meteor.js: MongoDB bulk/batch Upsert - mongodb

I have been scouring the internet for a few hours to look at a solution where doing bulk upserts in Meteor.js's smart collections is efficient.
Scenario:
I am hitting an api to get updated info for 200 properties asynchronously every 12 hours. For each property I get an array of about 300 JSON objects on average. 70% of the objects might not have been updated. But for the rest of 30% objects I need to update them in the database. As there is no way to find those 30% objects without matching them with the documents in database, I decided to upsert all documents.
My options:
Run a loop over the objects array and upsert each document in the database.
Remove all documents from collection and bulk insert the new objects.
For option 1 running a loop and upserting 60K documents (which are going to increase with time), takes a-lot of time. But at the moment seems like the only plausible option.
For option 2 meteor.js does not allow bulk inserts in their smart collections. Even for that we will have to loop over the array of objects.
Is there another option where I can achieve this efficiently?

MongoDb supports inserting an array of documents. So you can insert all your documents in one call from Meteors 'rawCollection'.
MyCollection.remove({}); // its empty
var theRaw = MyCollection.rawCollection();
var mongoInsertSync = Meteor.wrapAsync(theRaw.insert, theRaw);
var result = mongoInsertSync(theArrayOfDocs);
In production code you will wrap this in a try/catch to get hold of the error if your insert fails, result is only good if the inserting of the array of documents succeeds.

The above solution with the rawCollection does insert, but it appears not to support the ordered:false directive to continue processing if one document fails, the raw collection exits on the first error, which is unfortunate.
"If false, perform an unordered insert, and if an error occurs with one of documents, continue processing the remaining documents in the array."

Related

Mongoose skip and limit (MongoDB Pagination)

I am using skip and limit in mongodb mongoose it returns random documents.
Sometimes it returns same documents.
I am trying to get limit number of documents and after skip i want to get next limit documents.
I guess you are trying to use the pagination concept here. In both SQL and NoSQL, the data must be sorted in a specific order to achieve pagination without jumbled data on each db call.
for example:
await Event.find({}).sort({"createdDate": 0}).skip(10).limit(10);
I'm trying to fetch the 10-20 data in the above case, which is sorted using createdDate. So, the data won't shuffled while fetching, unless you insert or delete the data.
I hope this answers your question

Duplicate efficiently documents in MongoDB

I would like to find-out the most efficient way to duplicate documents in MongoDB, given that I want to take a bunch of documents from an existing collection, update one of their field, unset _id to generate a new one, and push them back in the collection to create duplicates.
This is typically to create a "branching" feature in MongoDB, allowing users to modify data in two separate branches at the same time.
I've tried the following things:
In my server, get data chunks in multiple threads, modify data, and insert modified data with a new _id in the base
This basically works but performance is not super good (~ 20s for 1 million elements).
In the future MongoDB version (tested on version 4.1.10), use the new $out aggregation mechanism to insert in the same collection
This does not seem to work and raise an error message "errmsg" : "$out with mode insertDocuments is not supported when the output collection is the same as the aggregation collection"
Any ideas how to be faster than the first approach? Thanks!

MongoDB bulk update efficiency using forEach

How would you approach bulk / batch updating documents (Up to 10k docs) coupled with forEach?
(No specific criteria to update by, used for random document selection)
I'm looking at two options:
Collect all document _id in the forEach closure into an array and afterwards update using
collection.update({_id : {$in : idsArray}}, ...)
Add update queries in the forEach closure to a bulk operation and execute once done, somewhere along the lines of
bulk.find({_id: doc.id}).updateOne({...});
bulk.execute();
I'm going to benchmark this soon, but I would like to know what's more I/O efficient and considered 'smart' with Mongo.
OK, so I've benchmarked the two options.
TL;DR option one is twice as fast, so collect ids and update once.
for future reference, some more details for :
Total number of documents in db is around 500k.
Documents contain around 20-25 fields each.
Did an update on 10-30k documents.
Results (times are machine specific, but the relative difference is what matters):
One update with ids array: 200-500ms.
Bulk update: 600-1000ms.
Looking back, I thought bulk might be faster because maybe there was some hidden optimization. But I understand that the question was missing logic, less operations probably means faster, hence bulk is slower.

Get nth item from a collection

I'm in the learning phase of mongodb.
I have a test website project where each step of a story is a domain.com/step
for instance, step 14 is accessed through domain.com/14
In other words, for the above case, I will need to access 14th document in my collection to serve it.
I've been using find().skip(n).limit(1) method for this so far to return nth document however it becomes extremely slow when there are too many documents to skip. So I need a more efficient way to get the nth document in my collection.
Any ideas are appreciated.
Add a field to your documents which tells you which step it is, add an index to that field and query by it.
Document:
{
step:14
text:"text",
date:date,
imageurl:"imageurl"
}
Index:
db.collection.createIndex({step:1});
Query:
db.collection.find({step:14});
Relying on natural order in the collection is not just slow (as you found out), it is also unreliable. When you start a new collection and insert a bunch of documents, you will usually find them in the order you inserted them. But when you change documents after they were inserted, it can happen that the order gets messed up in unpredictable ways. So never rely on insertion order being consistent.
Exception: Capped Collections guarantee that insertion order stays consistent. But there are very few use-cases where these are useful, and I don't think you have such a case here.

What is the preferred way to add many fields to all documents in a MongoDB collection?

I have have a Python application that is iteratively going through every document in a MongoDB (3.0.2) collection (typically between 10K and 1M documents), and adding new fields (probably doubling/tripling the number of fields in the document).
My initial thought was that I would use upsert the entire of the revised documents (using pyMongo) - now I'm questioning that:
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
this is actually a great question that can be solved a few different ways depending on how you are managing your data.
if you are upserting additional fields does this mean your data is appending additional fields at a later point in time with the only changes being the addition of the additional fields? if so you could set the ttl on your documents so that the old ones drop off over time. keep in mind that if you do this you will want to set an index that sorts your results by descending _id so that the most recent additions are selected before the older ones.
the benefit of this of doing it this way is that your are continually writing data as opposed to seeking and updating data so it is faster.
in regards to upserts vs bulk inserts. bulk inserts are always faster than upserts since bulk upserting requires you to find the original document first.
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
you really need to understand your data fully to determine what is best but if only change to the data is additional fields or changes that only need to be considered from that point forward then bulk inserting and setting a ttl on your older data is the better method from the stand point of write operations as opposed to seek, find and update. when using this method you will want to db.document.find_one() as opposed to db.document.find() so that only your current record is returned.
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
bulk inserts will be faster than inserting each one sequentially.