Can lock yielding break query isolation? - mongodb

Mongo docs talk about queries yielding locks to avoid blocking other operations. Will Mongo yield the lock from a read to a write that changes the read result?
Say I've got docs {x:1}, {x:2}, {x:2}, {x:1} and I'm reading find({x:2}). Assume the fourth doc isn't in the working set, so Mongo page faults, yielding the lock to an update({x:1}, {x:2}, {multi: true}), which completes and returns the lock to the find. The find would now include the fourth doc but omit the first doc. Does Monogo work like this?

In MongoDB there is no guarantee of query isolation - in fact, across multiple documents, you are not guaranteed to be looking at the same point in time.
What you describe can absolutely happen, and it does. The same is true of multi-document queries that fetch a large number of documents in batches (when you use a cursor). When you issue getmore for the next batch, the state of the data is not guaranteed to be the same as when you got the previous batch of results.

Related

Ordering a sequence of writes to MongoDB v4.0 / DocumentDB

Problem
I need to establish write consistency for a sequence of queries using updateMany, against a DocumentDB cluster with only a single primary instance. I am not sure which approach to use, between Transactions, ordered BulkWrites, or simply setting a Majority write concern for each updateMany query.
Environment
AWS DocumentDB cluster, which maps to MongoDB v4.0, via pymongo 3.12.0 .
Note: the cluster has a single primary instance, and no other instances. In practice, AWS will have us connect to the cluster in replica set mode. I am not sure whether this means we need to still think about this problem in terms of replica sets.
Description
I have a sequence of documents D , each of which is an array of records. Each record is of the form {field: MyField, from_id: A, to_id: B}.
To process a record, I need to look in my DB for all fields MyField that have value A, and then set that value to B. The actual query I use to do this is updateMany. The code looks something like:
for doc in Documents:
for record in doc:
doWriteUpdate(record)
def doWriteUpdate(record):
query = ... # format the query based on record's information
db.updateMany(query)
I need the update operations to happen such that the writes have actually been applied, and are visible, before the next doWriteUpdate query runs.
This is because I expect to encounter a situation where I can have a record {field: MyField, from_id: A, to_id: B}, and then a subsequent record (whether in the same document, or a following document) {field: MyField, from_id: B, to_id: C}. Being able to properly apply the latter record operation, depends on the former record operation having been committed to the database.
Possible Approaches
Transactions
I have tried wrapping my updateMany operation in a Transaction. If this had worked, I would have called it a day; but I exceed the size allowed: Total size of all transaction operations must be less than 33554432. Without rewriting the queries, this cannot be worked around, because the updateMany has several layers of array-filtering, and digs through a lot of documents. I am not even sure if transactions are appropriate in this case, because I am not using any replica sets, and they seem to be intended for ACID with regard to replication.
Ordered Bulk Writes
BulkWrite.updateMany would appear to guarantee execution order of a sequence of writes. So, one approach could be, to generate the update query strings for each record r in a document D, and then send those through (preserving order) as a BulkWrite. While this would seem to "preserve order" of execution, I don't know if a) the preservation of execution order, also guarantees write consistency (everything executed serially is applied serially), and, more important, b) whether the following BulkWrites, for the other documents, will interleave with this one.
WriteConcern
Pymongo states that writes will block given a desired WriteConcern. My session is single-threaded, so this should give the desired behavior. However, MongoDB says
For multi-document transactions, you set the write concern at the transaction level, not at the individual operation level. Do not explicitly set the write concern for individual write operations in a transaction.
I am not clear on whether this pertains to "transactions" as in the general sense, or MongoDB Transactions set up through session objects. If it means the latter, then it shouldn't apply to my use case. If the former, then I don't know what other approach to use.
The proper write concern is majority, and with a read concern that uses the linearizable
Real Time Order Combined with "majority" write concern, "linearizable"
read concern enables multiple threads to perform reads and writes on a
single document as if a single thread performed these operations in
real time; that is, the corresponding schedule for these reads and
writes is considered linearizable.

Spring Data MongoDB Concurrent Updates Behavior

Imagine theres a document containing a single field: {availableSpots: 100}
and there are millions of users, racing to get a spot by sending a request to an API server.
each time a request comes, the server reads the document and if the availableSpot is > 0, it then decrements it by 1 and creates a booking in another collection.
Now i read that mongodb locks the document whenever an update operation is performed.
What will happen if theres a million concurrent requests? will it take a long time because the same document keeps getting locked? Also, the server reads the value of document before it tries to update the document, and by the time it acquires the lock, the spot may not be available anymore.
It is also possible that the threads are getting "availableSpot > 0" is true at the same instant in time, but in reality the availableSpot may not be enough for all the requests. How to deal with this?
The most important thing here is atomicity and concurrency.
1. Atomicity
Your operation to update (decrement by one) if availableSpots > 0 :
db.collection.updateOne({"availableSpots" :{$gt : 0}}, { $inc: { availableSpots: -1 })
is atomic.
$inc is an atomic operation within a single document.
Refer : https://docs.mongodb.com/manual/reference/operator/update/inc/
2. Concurrency
Since MongoDB has document-level concurrency control for write operations. Each update will take a lock on the document.
Now your questions:
What will happen if theres a million concurrent requests?
Yes each update will be performed one by one (due to locking) hence will slow down.
the server reads the value of document before it tries to update the
document, and by the time it acquires the lock, the spot may not be
available anymore.
Since the operation is atomic, this will not happen. It will work as you want, only 100 updates will be executed with number of affected rows greater than 0 or equal to 1.
MongoDB uses Wired Tiger as a default storage engine starting version 3.2.
Wired Tiger provides document level concurrency:
From docs:
WiredTiger uses document-level concurrency control for write
operations. As a result, multiple clients can modify different
documents of a collection at the same time.
For most read and write operations, WiredTiger uses optimistic
concurrency control. WiredTiger uses only intent locks at the global,
database and collection levels. When the storage engine detects
conflicts between two operations, one will incur a write conflict
causing MongoDB to transparently retry that operation.
When multiple clients are trying to update a value in a document, only that document will be locked, but not the entire collections.
My understanding is that you are concerned about the performance of many concurrent ACID-compliant transactions against two separate collections:
a collection (let us call it spots) with one document {availableSpots: 999..}
another collection (let us call it bookings) with multiple documents, one per booking.
Now i read that mongodb locks the document whenever an update operation is performed.
It is also possible that the threads are getting "availableSpot > 0"
is true at the same instant in time, but in reality the availableSpot
may not be enough for all the requests. How to deal with this?
With version 4.0, MongoDB provides the ability to perform multi-document transactions against replica sets. (The forthcoming MongoDB 4.2 will extend this multi-document ACID transaction capability to sharded clusters.)
This means that no write operations within a multi-document transaction (such as updates to both the spots and bookings collections, per your proposed approach) are visible outside the transaction until the transaction commits.
Nevertheless, as noted in the MongoDB documentation on transactions a denormalized approach will usually provide better performance than multi-document transactions:
In most cases, multi-document transaction incurs a greater performance
cost over single document writes, and the availability of
multi-document transaction should not be a replacement for effective
schema design. For many scenarios, the denormalized data model
(embedded documents and arrays) will continue to be optimal for your
data and use cases. That is, for many scenarios, modeling your data
appropriately will minimize the need for multi-document transactions.
In MongoDB, an operation on a single document is atomic. Because you can use embedded documents and arrays to capture relationships between data in a single document structure instead of normalizing across multiple documents and collections, this single-document atomicity obviates the need for multi-document transactions for many practical use cases.
But do bear in mind that your use case, if implemented within one collection as a single denormalized document containing one availableSpots sub-document and many thousands of bookings sub-documents, may not be feasible as the maximum document size is 16MB.
So, in conclusion, a denormalized approach to write atomicity will usually perform better than a multi-document approach, but is constrained by the maximum document size of 16MB.
You can try using findAndModify() option while trying to update the document. Each time you will need to cherry pick whichever field you want to update in that particular document. Also, since mongo db replicates data to Primary and secondary nodes, you may also want to adjust your WriteConcern values as well. You can read more about this in official documentation. I have something similar coded that handles similar kind of concurrency issues in mongoDB using spring mongoTemplate. Let me know if you want any reference related to java with that.

MongoDB Chain queries, pseudo transactions

I understand you cannot do transactions in MongoDB and the thinking is that its not needed because everything locks the whole database or collection, I am not sure which. However how then do you perform the following?
How do I chain together multiple insert, update, delete or select queries in mongodb so that other queries that might operate on the same data wait until these queries finish? An analogy would be serialization transaction isolation in ms sql server.
more..
I want to insert/update record into collection A and update a record in collection B and then read Collection A and B but I don't want anyone (process or thread) to read or write to collection A or B until BOTH A and B have been updated or inserted by the first queries.
Yes, that's absolutely possible.
It is called ordered bulk operations on planet Mongo and works like this in the mongo shell:
bulk = db.emptyCollection.initializeOrderedBulkOp()
bulk.insert({name:"First document"})
bulk.find({name:"First document"})
.update({$set:{name:"First document, updated"}})
bulk.execute()
bulk.findOne()
> {_id: <someObjectId>, name:"First document, updated"}
Please read the manual regarding Bulk Write Operations for details.
Edit: Somehow is misread your question. It isn't possible for two collections. Remember though, that you can have different documents in one collection. Some ODMs even allow to have different models saved to the same collection. Exploiting this, you should be able to achieve what you want using the above bulk operations. You may want to combine this with locking to prevent writing. But preventing reading and writing would be the same as an transaction in terms of global and possibly distributed locks.

How to lock a Collection in MongoDB

I have a collection in my database
1.I want to lock my collection when the User Updating the Document
2.No operations are Done Expect Reads while Updating the collection for another Users
please give suggestions how to Lock the collection in MongoDB
Best Regards
GSY
MongoDB implements a writer greedy database level lock already.
This means that when a specific document is being written to:
The User collection would be locked
No reads will be available until the data is written
The reason that no reads are available is because MongoDB cannot do a consistent read while writing (darn you physics, you win again).
It is good to note that if you wish for a more complex lock, spanning multiple rows, then this will not be available in MongoDB and there is no real way of implementing such a thing.
MongoDB locking already does that for you. See what operations acquire which lock and what does each lock mean.
See the MongoDB documentation on write operations paying special attention to this section:
Isolation of Write Operations
The modification of a single document is always atomic, even if the write operation modifies >multiple sub-documents within that document. For write operations that modify multiple >documents, the operation as a whole is not atomic, and other operations may interleave.
No other operations are atomic. You can, however, attempt to isolate a write operation that >affects multiple documents using the isolation operator.
To isolate a sequence of write operations from other read and write operations, see Perform >Two Phase Commits.

Solution to Bulk FindAndModify in MongoDB

My use case is as follows -
I have a collection of documents in mongoDB which I have to send for analysis.
The format of the documents are as follows -
{ _id:ObjectId("517e769164702dacea7c40d8") ,
date:"1359911127494",
status:"available",
other_fields... }
I have a reader process which picks first 100 documents with status:available sorted by date and modifies them with status:processing.
ReaderProcess sends the documents for analysis. Once the analysis is complete the status is changed to processed.
Currently reader process first fetch 100 documents sorted by date and then update the status to processing for each document in a loop. Is there any better/efficient solution for this case?
Also, in future for scalability, we might go with more than one reader process.
In this case, I want that 100 documents picked by one reader process should not get picked by another reader process. But fetching and updating are seperate queries right now, so it is very much possible that multiple reader processes pick same documents.
Bulk findAndModify (with limit) would have solved all these problems. But unfortunately it is not provided in MongoDB yet. Is there any solution to this problem?
As you mention there is currently no clean way to do what you want. The best approach at this time for operations like the one you need is this :
Reader selects X documents with appropriate limit and sorting
Reader marks the documents returned by 1) with it's own unique reader ID (e.g. update({_id:{$in:[<result set ids>]}, state:"available", $isolated:1}, {$set:{readerId:<your reader's ID>, state:"processing"}}, false, true))
Reader selects all documents marked as processing and with it's own reader ID. At this point it is guaranteed that you have exclusive access to the resulting set of documents.
Offer the resultset from 3) for your processing.
Note that this even works in highly concurrent situations as a reader can never reserve documents not already reserved by another reader (note that step 2 can only reserve currently available documents, and writes are atomic). I would add a timestamp with reservation time as well if you want to be able to time out reservations (for example for scenarios where readers might crash/fail).
EDIT: More details :
All write operations can occasionally yield for pending operations if the write takes a relatively long time. This means that step 2) might not see all documents marked by step 1) unless you take the following steps :
Use an appropriate "w" (write concern) value, meaning 1 or higher. This will ensure that the connection on which the write operation is invoked will wait for it to complete regardless of it yielding.
Make sure you do the read in step 2 on the same connection (only relevant for replicasets with slaveOk enabled reads) or thread so that they are guaranteed to be sequential. The former can be done in most drivers with the "requestStart" and "requestDone" methods or similar (Java documentation here).
Add the $isolated flag to your multi-updates to ensure it cannot be interleaved with other write operations.
Also see comments for discussion regarding atomicity/isolation. I incorrectly assumed multi-updates were isolated. They are not, or at least not by default.