Parallel update issue in MongoDB - mongodb

We have one field which gets updated on user action, admin action, and in cron at the same time then what should we do in order to handle this kind of scenario in MongoDB.
e.g There is a "balance" field in the user's collection when the cron is running user's balance is decreased, now at the same time if the user is recharging and admin is refunding then the balance is not getting updated.
So please suggest any solution for this problem.

If possible, use update operations. They are atomic at the document level, so this should not be a problem.
If you are using a recent version of mongodb, you can use transactions for read-update-writes.
If you cannot do any of these, you can emulate an optimistic locking scheme using versioning to prevent unintended overwrites. There are several ways this can be done, but it generally goes like this:
Read the document. Document has a version field (which can be an integer, or a unique ObjectId. Don't use timestamp)
Make modifications in memory and update the version (increment the integer, or generate a new ObjectId)
Update the document with query containing (version: oldVersion)
This will fail if someone updated the document after you read it but before you updated it. If it fails, retry.

Related

How to solve concurrent read + write update to the document in MongoDB?

I have an application that runs multiple instances and that needs to perform this:
Read the document
Check a value of the timestamp field
If an incoming document is newer, then update (including the timestamp field).
Even if I use transactions, two parallel operations would be able to perform 1 and 2 simultaneously and then both will write to the database, potentially writing the "older" document last (since it will check the timestamp of the original document, rather than a "new" one)
So what I am looking for is some kind of a read lock on the document or some other mechanism that will be able to solve this.

MongoDb concurrency best practices

I am new with MongoDb, I am creating an application that manage a very big list of items (resources), and for each resources the application should manage a kind of booking.
My idea is to embed booking document inside resource document, and to avoid concurrency problem I need to lock the resource during booking.
I see that MongoDB allow locks at collection level, but this will create a bottleneck on the booking functionality because all resources inside the collection will be looked until the current booking is in progress, so for a large amount of users and large amount of resources this solution will have poor performance.
In addition to that, in case of a deadlock occurred booking a resource, all resources will be locked.
Are there alternative solutions or best practices to improve performance and scalability of this use case?
A possible solution should be to have a lock not at collection level but a document level (the resource in my example), in this way a user booking a resource doesn't lock another user to book another resource, even if (also in this case) I am not sure of the final result because write commands are not executed in parallel: I suppose I'll probably also need a cluster of servers to manage multiple writes in parallel.
You are absolutely right, you should definitely not lock the entire collection for just updating a single document.
Now this problem depends on how you update your document.
If you update your document with a single update query, then since document update is atomic you would have no problem.
But if you first have to read the document, change the document, save the document, then you would have the concurrency problem. Just before you save the changed document, it could be updated by some other request and the document you have read would no longer be up to date, hence your new updates will not be right either.
The simple solution to this concurrency problem is solved by storing a version number(usually _v) in each of your documents. And for every update you increment the version number. Then every time you do a read & change & update, you make sure that the version of your read document and the version of that document in the database are identical. When the version number differs the update will fail and you can simply try again.
If you are using node.js, then you are probably using mongoose and mongoose will generate _v and do concurrency checks behind the scenes. So you do not have to do any extra job to solve this concurrency issue.

Zend service solr update document

In Zend_Service_Solr I can add or delete a record.
$solr->addDocument($document);
Is there any way that I can update a record. I couldn't find any document for that. Or is there any extension for doing the same.
In most cases updating a document in Solr is to add the same document again (with the same value for the uniqueKey field).
It's possible to perform certain updates in more recent versions of Solr, but these require all fields to be stored (so that the document can just be re-added internally) and a custom update syntax. There are also some work in progress with non-textual DocValues being updatable without having to resubmit the complete document, but this is currently not in any released version of Solr.
The best way to handle this is usually to just re-submit the document with updated values, and have a straight forward way of doing that in your application code.

Why mongodb only updates the first matching document in the collection?

Consider a collection student contains the following documents.
{name:”Nithin”,age:23}
{name:”Nithin”,age:25}
{name:”Nithin”,age:28}
{name:”Nithin”,age:12}
I want to update all the documents whose name is “Nithin” as age=60.
If we execute the following query it will only update the first document.
db.student.update({name:”Nithin”},{age:60})
For update all the documents I have to use the query
db.student.update({name:”Nithin”},{age:60},false,true)
or
db.student.update({name:”Nithin”},{age:60},multi:true)
What is the reason by default mongodb not updating all the documents by executing db.student.update({name:”Nithin”},{age:60}) ? What is the motivation for creating separate queries for updating all the documents? Is it improving the performance?
Originally, in the early early days of MongoDB (pre 1.1) it was not possible to update multiple documents. This was a feature added around 1.1.3.
You can see it in the release notes, New Feature 268.
I'm guessing this was not enabled by default for backwards compatibility with previous versions.
This may not really be the reason but I find the additional multi parameter as a safeguard to prevent accidental update of multiple records when one intends to update a single document only, something like accidentally performing UPDATE...SET on SQL without specifying additional constraints.
Again this is just an assumption but may not really be the case.
I suppose part of the reason might be to avoid people coming from the SQL world to think about multi-document updates as isolated transactions.
In fact, during a long update MongoDB will periodically yield control to other queries which can potentially modify the same dataset.
So, by explicitly setting multi=true you are somewhat acknowledging this fact (well, not really, but I guess that's the spirit...)

Document DB and simulating ACID

See results at the end
I want to use a document DB (for various reasons) - probably CouchDB or MongoDB. However, I also need ACID on my multiple-document transactions.
However, I do plan on working with "add-only" model - changes are added as new documents (add is add, update is add a copy+transform data, delete is add empty document with the same ID + delete flag). Periodically, I'll run compaction on the database to remove non-current documents.
With that in mind, are there any holes in the following idea:
Maintain a collection for current transactions in progress. This collection will hold documents with transaction IDs (GUIDs + timestamp) of transactions in progress.
Atomicity:
On a transaction:
Add a document to the transactions in progress collection.
Add the new documents (add is add, update is copy+add, delete is add with ID and “deleted” flag).
Each added document will have the following management fields:
Transaction ID.
Previous document ID (linked list).
Remove the document added to the transactions in progress collection.
On transaction fail:
Remove all added documents
Remove the document from the transactions in progress collection.
Periodically:
Go over all transaction in progress, get ones that have been abandoned (>10 minutes?), remove the associated documents in the DB (index on transaction ID) and then remove the transaction in progress.
Read transaction consistency (read only committed transactions):
On data retrieval:
Load transactions in progress set.
Load needed documents.
For all documents, if the document transaction ID is in “transactions in progress” or later (using timestamp), load the previous document in the linked list (recursive).
It’s a bit like MVCC, a bit like Git. I set the retrieval context by the transactions I know that managed to finish before I started. I avoid single sequence (hence single execution) by keeping a list of “ongoing transactions” and not a “transaction revision”. And, of course, I avoid reading non-comitted transactions and provide rollback on conflict.
So - are there any holes in this? Will my performance suffer horribly?
Edit1: Please please please - don't hammer the "don't use document database if you need multi-document transactions". I know, I need a document database anyway for other reasons.
Edit2: added timestamp to avoid data from transactions that start after retrieval transaction has started. Possibly could change timestamp to sequence ID.
Edit3: Here's another algorithm I thought about - it may be better than the one above:
New algorithm - easier to understand (and possible correct this time :) )
Support structures:
transaction_support_tempalte {
_created-by-transaction: <txid>
_made-obsolete-by-transaction: <txid>
}
transaction_record { //
transaction_id: <txid>
timestamp: <tx timestamp>
updated_documents: {
[doc1_id, doc2_id...]
}
}
transaction_numer { //atomic counter - used for ordering transactions.
_id: "transaction_number"
next_transaction_id: 0 //initial.
}
Note: all IDs are model object IDs, not DB ids (don't confuse with logical IDs which are different).
DB ID - different for each document - but multiple DB documents are revisions of one model object.
Model object ID - same for all revisions of the model object.
Logical ID - client-facing ID.
First time setup:
1. Create the transaction_number document:
Commit process:
1. Get new transaction ID by atomic increment on the transaction number counter.
2. Insert a new transaction record with the transaction id, the timestamp and the updated documents.
3. Create the new version for each document. Make sure the _created-by-transaction is set.
4. Update the old version of each updated or deleted document as
"_made-obsolete-by-transaction" with the transaction id.
This is the time to detect conflicts! if seen a conflict, rollback.
Note - this can be done as find-and-modify rather then by serializing the entire document again.
5. Remove the transaction record.
Cleanup process:
1. Go over transaction record, sorted by id, ascending (oldest transaction first).
2. For each transaction, if it expired (by timestamp), do rollback(txid).
Rollback(txid) process:
1. Get the transaction record for the given transaction id.
2. For each document id in the "updated documents":
2.1 If the document exists and has "_made-obsolete-by-transaction" with
the correct transaction id, remove the _made-obsolete-by-transaction data.
3. For each document with the _created-by-transaction-id:
3.1 remove the document.
4. Remove the transaction record document.
Retrieval process:
1. Top-transaction-id = transaction ID counter.
2. Read all transactions from the transactions collection.
Current-transaction-ids[] = Get all transaction IDs.
3. Retrieve documents as needed. Always use "sort by transaction_id, desc" as last sort clause.
3.1 If a document "_created-by-transaction-id" is in the Current-transaction-ids[]
or is >= Top-transaction-id - ignore it (not yet committed).
3.2 If a document "_made-obsolete-by-transaction" is not in the Current-transaction-ids[]
and is < Top-transaction-id - ignore it (a newer version was committed).
4. We may have to retrieve more chunks to satisfy original requests if documents were ignored.
Was the document committed when we started?
If we see a document with transaction ID in the current executing transactions - it's a transaction that
started before we started the retrieval but was not yet committed at that time - so we don't want it.
If we see a document with transaction ID >= top transaction ID - it's a transaction that started after
we started the retrieval - so we don't want it.
Is the document up-to-date (latest version)?
If we see a document with made-obsolete that is not in the current transaction IDs (transactions started
before we started) and is < top transaction ID (transactions started after we started) - then
there was a transaction that finished commit in our past that made this document obsolete - so we don't want it.
Why is sorting not harmed?
Because we add the sort as a last clause, we'll always see the real sorting work first. For each real
sorting "bucket" we might get multiple documents that represent the model object at different versions.
However, the sort order between model objects remains.
Why doesn't the counter makes the transaction execute serially (one at ta time)?
Because this is not RDBMS - we don't really have transactions so we don't wait for the transaction
to commit as we do with "select for update".
Another transaction can make the atomic change as soon as we're done with it.
Compaction:
One in a while a compaction will have to take place - get all really old documents and remove them to another data store.
This shouldn't affect any running retrieval or transaction.
Optimization:
Put the conditions into the query itself.
Add transaction ID to all indexes.
Make sure documents with the same model object ID don't get sharded into different nodes.
What's the cost?
Assuming we want multiple document versions for history and audit anyway, the extra cost is
atomically updating the counter, creating the transaction record, "sealing" the previous version of each model object
(mark obsolete) and removing the transaction document. This shouldn't be too big.
Note that if the above assumption is not valid, the extra cost is quite high, especially for retrieval.
Results:
I've implemented the above algorithm (the revised one with minor changes). Functionally, it's working. However, the performance (at least over MongoDB with 3 nodes in master-slave replication topology, no fsync but replication required before "commit" ends) is atrocious. I'm constantly reading things I've just written to from different threads. I'm getting constant collection locks on the transactions collection and my indexes can't keep up with the constant rollover. Performance is capped at 20 TPS for tiny tiny transactions with 10 feeder threads.
In short - not a good general purpose solution.
without going into the specifics of your plan, I thought it might first be useful to go over mongoDB's support of ACID requirements.
Atomicity: Mongo supports atomic changes for individual documents. Typically, the most significant atomic operations are "$set" and findAndModify Some documentation on these operations and atomicity in mongoDB in general:
http://www.mongodb.org/display/DOCS/Atomic+Operations
[http://www.mongodb.org/display/DOCS/Updating#Updating-%24set][1]
http://www.mongodb.org/display/DOCS/findAndModify+Command
Consistency: Difficult to achieve and quite complex. I won't try to summarize in this post, but there is a great series of posts on the subject:
http://blog.mongodb.org/post/475279604/on-distributed-consistency-part-1
[http://blog.mongodb.org/post/498145601/on-distributed-consistency-part-2-some-eventual][2]
Isolation: Isolation in mongoDB does exist for documents, but not for any higher levels. Again, this is a complicated subject; besides the Atomic Operations link above, the best resource I have found is the following stack overflow thread:
Why doesn't MongoDB use fsync()? (the top answer is a bit of a goldmine for this subject in general, though some of the information regarding durability is out of date)
Durability: The main way that users ensure data durability is by using the getLastError command (see link below for more info) to confirm that a majority of nodes in a replica set have written the data before the call returns.
http://www.mongodb.org/display/DOCS/getLastError+Command#getLastErrorCommand-majority
http://docs.mongodb.org/manual/core/replication-internals/ (linked to in the above document)
Knowing all this about ACID in mongo, it would be very useful to look over some examples similar problems that have already been worked out in mongo. The two following links I expect will be really useful to you as they are very complete and right on subject.
Two-Phase Commits: http://cookbook.mongodb.org/patterns/perform-two-phase-commits/
Transactions for e-commerce work: http://www.slideshare.net/spf13/mongodb-ecommerce-and-transactions-10524960
Finally, I have to ask: Why do you want to have transactions? It is rare that users of mongoDB find they truly need ACID to achieve their goals. It might be worthwhile stepping back and trying to approach the problem from another perspective before you go ahead and implement a whole layer on top of mongo just to get transactions.