How to solve concurrent read + write update to the document in MongoDB? - mongodb

I have an application that runs multiple instances and that needs to perform this:
Read the document
Check a value of the timestamp field
If an incoming document is newer, then update (including the timestamp field).
Even if I use transactions, two parallel operations would be able to perform 1 and 2 simultaneously and then both will write to the database, potentially writing the "older" document last (since it will check the timestamp of the original document, rather than a "new" one)
So what I am looking for is some kind of a read lock on the document or some other mechanism that will be able to solve this.

Related

Firestore full collection update for schema change

I am attempting to figure out a solid strategy for handling schema changes in Firestore. My thinking is that schema changes would often require reading and then writing to every document in a collection (or possibly documents in a different collection).
Here are my concerns:
I don't know how large the collection will be in the future. Will I hit any limitations on how many documents can be read in a single query?
My current plan is to run the schema change script from Cloud Build. Is it possible this will timeout?
What is the most efficient way to do the actual update? (e.g. read document, write update to document, repeat...)
Should I be using batched writes?
Also, feel free to tell me if you think this is the complete wrong approach to implementing schema changes, and suggest a better solution.
I don't know how large the collection will be in the future. Will I hit any limitations on how many documents can be read in a single query?
If the number of documents gets too large to handle in a single query, you can start paginating the results.
My current plan is to run the schema change script from Cloud Build. Is it possible this will timeout?
That's impossible to say at this moment.
What is the most efficient way to do the actual update? (e.g. read document, write update to document, repeat...)
If you need the existing contents of a document to determine its new contents, then you'll indeed need to read it. If you don't need the existing contents, all you need is the path, and you can consider using the Node.js API to only retrieve the document IDs.
Should I be using batched writes?
Batched writes have no performance advantages. In fact, they're often slower than sending the individual update calls in parallel from your code.

MongoDB - Save vs Update [duplicate]

This question already has answers here:
Mongoose difference between .save() and using update()
(4 answers)
Closed 3 years ago.
I have around 400 fields in my collection (including both at top level as well as embedded), following is the nature of write queries:
All write queries always update single document and an average of 60
fields in that document.
There are indexed fields in collection but no write query updates an indexed field.
Volume of write queries is very large.
I can use either .save() or .update() to update the document. In update I only pass the fields that need to be updated, whereas in save I pass the entire document. I want to know if using update in this case will give me better performance than save (or vice versa) or does it not make any difference at the database level and both perform equally well?
It doesn't make any significant change in performance. The reasons are as below
When you save or update a document in mongodb, you probably decide to call save or update from another application that could be written in C#, Java, JavaScript, PHP or someother language.
In this case, there is a Inter process communication (or network call if you mongo db is running in another machine). When compared to this the difference in time take to selectively replace a document by update and completely replace the document by save is negligible. By the way, save and update both will probably have run time complexity of O(n) if there is no indexes.
For a document with 250 fields, the size of the document is probably not too big that we have to consider. If the size of the update document is significantly smaller that size of the save document, then please use update.
Else use a save/update depending on the which is more elegant in the client side code.

MongoDB TTL but to do other stuff

I have a requirement that when a date attribute field is passed, that we would like to trigger two things:
to move the record to be deleted to another table.
to call a function to do other actions.
I understand TTL is only to delete a record when the date field is tripped. Can I hook extra logic to it?
Thanks!
Depending on the requirements there could be quite a few ways to do this.
One way is to execute a script periodically, and run a query to filter documents that have passed certain date value. For each of the documents, perform a document migration to another table and extra actions.
Alternatively is to use MongoDB Change Streams. The trick however, is that delete events from change stream do not return the document itself (because it's already been deleted).
Instead if you were to update a field for documents that have passed certain date value you could listen for the update events. For example, sets a field value to expired:true.
Worth mentioning that if you're going down the route of change streams update events, you could utilise MongoDB Stitch Triggers (relying on change streams). MongoDB Stitch database triggers allow you to automatically execute Stitch functions in response to changes in your MongoDB database.
I suggest write a function and call it via scheduler. That will be the better option to do it.

What is the preferred way to add many fields to all documents in a MongoDB collection?

I have have a Python application that is iteratively going through every document in a MongoDB (3.0.2) collection (typically between 10K and 1M documents), and adding new fields (probably doubling/tripling the number of fields in the document).
My initial thought was that I would use upsert the entire of the revised documents (using pyMongo) - now I'm questioning that:
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
this is actually a great question that can be solved a few different ways depending on how you are managing your data.
if you are upserting additional fields does this mean your data is appending additional fields at a later point in time with the only changes being the addition of the additional fields? if so you could set the ttl on your documents so that the old ones drop off over time. keep in mind that if you do this you will want to set an index that sorts your results by descending _id so that the most recent additions are selected before the older ones.
the benefit of this of doing it this way is that your are continually writing data as opposed to seeking and updating data so it is faster.
in regards to upserts vs bulk inserts. bulk inserts are always faster than upserts since bulk upserting requires you to find the original document first.
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
you really need to understand your data fully to determine what is best but if only change to the data is additional fields or changes that only need to be considered from that point forward then bulk inserting and setting a ttl on your older data is the better method from the stand point of write operations as opposed to seek, find and update. when using this method you will want to db.document.find_one() as opposed to db.document.find() so that only your current record is returned.
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
bulk inserts will be faster than inserting each one sequentially.

Solution to Bulk FindAndModify in MongoDB

My use case is as follows -
I have a collection of documents in mongoDB which I have to send for analysis.
The format of the documents are as follows -
{ _id:ObjectId("517e769164702dacea7c40d8") ,
date:"1359911127494",
status:"available",
other_fields... }
I have a reader process which picks first 100 documents with status:available sorted by date and modifies them with status:processing.
ReaderProcess sends the documents for analysis. Once the analysis is complete the status is changed to processed.
Currently reader process first fetch 100 documents sorted by date and then update the status to processing for each document in a loop. Is there any better/efficient solution for this case?
Also, in future for scalability, we might go with more than one reader process.
In this case, I want that 100 documents picked by one reader process should not get picked by another reader process. But fetching and updating are seperate queries right now, so it is very much possible that multiple reader processes pick same documents.
Bulk findAndModify (with limit) would have solved all these problems. But unfortunately it is not provided in MongoDB yet. Is there any solution to this problem?
As you mention there is currently no clean way to do what you want. The best approach at this time for operations like the one you need is this :
Reader selects X documents with appropriate limit and sorting
Reader marks the documents returned by 1) with it's own unique reader ID (e.g. update({_id:{$in:[<result set ids>]}, state:"available", $isolated:1}, {$set:{readerId:<your reader's ID>, state:"processing"}}, false, true))
Reader selects all documents marked as processing and with it's own reader ID. At this point it is guaranteed that you have exclusive access to the resulting set of documents.
Offer the resultset from 3) for your processing.
Note that this even works in highly concurrent situations as a reader can never reserve documents not already reserved by another reader (note that step 2 can only reserve currently available documents, and writes are atomic). I would add a timestamp with reservation time as well if you want to be able to time out reservations (for example for scenarios where readers might crash/fail).
EDIT: More details :
All write operations can occasionally yield for pending operations if the write takes a relatively long time. This means that step 2) might not see all documents marked by step 1) unless you take the following steps :
Use an appropriate "w" (write concern) value, meaning 1 or higher. This will ensure that the connection on which the write operation is invoked will wait for it to complete regardless of it yielding.
Make sure you do the read in step 2 on the same connection (only relevant for replicasets with slaveOk enabled reads) or thread so that they are guaranteed to be sequential. The former can be done in most drivers with the "requestStart" and "requestDone" methods or similar (Java documentation here).
Add the $isolated flag to your multi-updates to ensure it cannot be interleaved with other write operations.
Also see comments for discussion regarding atomicity/isolation. I incorrectly assumed multi-updates were isolated. They are not, or at least not by default.