I am learning about mongodb. If I create a bulk write is this transaction all or nothing? I have a scenario where my users can delete who they are friends with.
FRIEND 1 | FRIEND 2
User B USER A
User A USER B
For this to happen I need to delete from both bidirectional relationships. For consistency I need these to occur as a all or nothing because I wouldn't want only 1 of the 2 operations to succeed as this would cause bad data. Reading the docs I could not find the answer:
https://docs.mongodb.org/manual/core/bulk-write-operations/
db.collection.initializeOrderedBulkOp()
"If an error occurs during the processing of one of the write operations, MongoDB will return without processing any remaining write operations in the list."
No mention of rollback ops, simply stops inserting the remaining.
db.collection.insert() method
"The insert() method, when passed an array of documents, performs a bulk insert, and inserts each document atomically."
you can roll your own . but use acknowledged write concern which would have to be via your chosen driver. shell is acknowledged but perhaps driver is not.
https://docs.mongodb.org/manual/core/write-concern/
try
insert 1
catch
delete
try
insert 2
catch
delete 1
delete 2
Related
Let's say I have MongoDB collection A and B, and they are in the same database.
I'm renaming B to A with deleting target.
I know renaming takes really short time.
But what if I send a query to A while mongo is still renaming B to A?
----------|--------------------------------|---------------------------|-------------------
rename B to A begin Send query to A rename B to A done
Am I gonna get the result right away? or wait till the end of rename?
Per the documentation
The db.collection.renameCollection() method and renameCollection command will invalidate open cursors which interrupts queries that are currently returning data.
So you might end up with errors of dead or killed cursors while renaming your collection.
Since MongoDB 4.2, all operations will wait on the rename to be finished.
renameCollection() obtains an exclusive lock on the source and target collections for the duration of the operation. All subsequent operations on the collections must wait until renameCollection() completes. Prior to MongoDB 4.2, renaming a collection within the same database with renameCollection required obtaining an exclusive database lock.
In my app, I am doing following with mongodb.
Start a mongodb session and start a transaction
Read a document
Do some calculations based on values in the document and some other arguments
Update the document that was read in step 2 with the results of the calculations in step 3
Commit transaction and end session
Above procedure is executed with retries on TransientTransactionError, so if the transaction is failed due to a concurrency issue, procedure is retried.
If two concurrent invocations were made on above procedure, if both invocations read the document before any of them writes to the document, I need only one invocation to be able to successfully write to the document and other to fail. If this doesn't happen, I don't get the expected result what I am trying to achieve with this.
Can I expect mongodb to fail one invocation in this scenario, so the procedure will be retried on the updated picture of the document?
MongoDB multi-document transactions are atomic (i.e. provide an “all-or-nothing” proposition). When a transaction commits, all data changes made in the transaction are saved and visible outside the transaction. That is, a transaction will not commit some of its changes while rolling back others.
This is also elaborated further in In-progress Transactions and Write Conflicts:
If a transaction is in progress and a write outside the transaction
modifies a document that an operation in the transaction later tries
to modify, the transaction aborts because of a write conflict.
If a transaction is in progress and has taken a lock to modify a
document, when a write outside the transaction tries to modify the
same document, the write waits until the transaction ends.
See also Video: How and When to Use Multi-Document Transactions on Write Conflicts section to understand multi-document transactions more (i.e. write locks, etc).
If you are writing to the same document that you read in both transactions then yes, one will roll back. But do make sure that your writes actually change the document as MongoDB is smart enough to not update if nothing has changed.
This is to prevent the lost updates.
Please see the source: https://www.mongodb.com/blog/post/how-to-select--for-update-inside-mongodb-transactions
In fact, I have the same implementation in one of my projects and it works as expected but I have multi-documents being read etc. But in your specific example, that is not the case.
Even if you did not have transactions, you could use findAndModify with an appropriate query part (such as the example for update operation here: https://www.mongodb.com/docs/manual/core/write-operations-atomicity/) to guarantee the behavior you expect.
I understand you cannot do transactions in MongoDB and the thinking is that its not needed because everything locks the whole database or collection, I am not sure which. However how then do you perform the following?
How do I chain together multiple insert, update, delete or select queries in mongodb so that other queries that might operate on the same data wait until these queries finish? An analogy would be serialization transaction isolation in ms sql server.
more..
I want to insert/update record into collection A and update a record in collection B and then read Collection A and B but I don't want anyone (process or thread) to read or write to collection A or B until BOTH A and B have been updated or inserted by the first queries.
Yes, that's absolutely possible.
It is called ordered bulk operations on planet Mongo and works like this in the mongo shell:
bulk = db.emptyCollection.initializeOrderedBulkOp()
bulk.insert({name:"First document"})
bulk.find({name:"First document"})
.update({$set:{name:"First document, updated"}})
bulk.execute()
bulk.findOne()
> {_id: <someObjectId>, name:"First document, updated"}
Please read the manual regarding Bulk Write Operations for details.
Edit: Somehow is misread your question. It isn't possible for two collections. Remember though, that you can have different documents in one collection. Some ODMs even allow to have different models saved to the same collection. Exploiting this, you should be able to achieve what you want using the above bulk operations. You may want to combine this with locking to prevent writing. But preventing reading and writing would be the same as an transaction in terms of global and possibly distributed locks.
I have to write an app that constantly polls a mongodb db collection in a given db. If it finds documents it reads them copies them to another db, does some extra processing and deletes them from the original db.
What is the most efficient way to implement this? What are the best practices?
Is it better to process one doc at a time: read one document, copy the document then delete it
or is it better to read all documents, copy all of them, then delete all of them?
What would be the best way to handle failures in the middle of one of these read, write deletes?
Bulk reads, inserts and deletes are almost always more performant than single document actions. But try to limit it to a maximum number of documents, e.g. in our setup 500 seemed to be optimal.
For handling errors, you could use the following pseudo transaction pattern:
findAndModify while setting "state":"pending" for all read documents
process documents
bulk insert
delete all documents with "state":"pending"
If something goes wrong in the processing part or the bulk insert, you can unlock all locked documents and try again.
A more elaborate example of these kind of psuedo transactions can be found in the MongoDB Tutorial:
http://docs.mongodb.org/manual/tutorial/perform-two-phase-commits/
See results at the end
I want to use a document DB (for various reasons) - probably CouchDB or MongoDB. However, I also need ACID on my multiple-document transactions.
However, I do plan on working with "add-only" model - changes are added as new documents (add is add, update is add a copy+transform data, delete is add empty document with the same ID + delete flag). Periodically, I'll run compaction on the database to remove non-current documents.
With that in mind, are there any holes in the following idea:
Maintain a collection for current transactions in progress. This collection will hold documents with transaction IDs (GUIDs + timestamp) of transactions in progress.
Atomicity:
On a transaction:
Add a document to the transactions in progress collection.
Add the new documents (add is add, update is copy+add, delete is add with ID and “deleted” flag).
Each added document will have the following management fields:
Transaction ID.
Previous document ID (linked list).
Remove the document added to the transactions in progress collection.
On transaction fail:
Remove all added documents
Remove the document from the transactions in progress collection.
Periodically:
Go over all transaction in progress, get ones that have been abandoned (>10 minutes?), remove the associated documents in the DB (index on transaction ID) and then remove the transaction in progress.
Read transaction consistency (read only committed transactions):
On data retrieval:
Load transactions in progress set.
Load needed documents.
For all documents, if the document transaction ID is in “transactions in progress” or later (using timestamp), load the previous document in the linked list (recursive).
It’s a bit like MVCC, a bit like Git. I set the retrieval context by the transactions I know that managed to finish before I started. I avoid single sequence (hence single execution) by keeping a list of “ongoing transactions” and not a “transaction revision”. And, of course, I avoid reading non-comitted transactions and provide rollback on conflict.
So - are there any holes in this? Will my performance suffer horribly?
Edit1: Please please please - don't hammer the "don't use document database if you need multi-document transactions". I know, I need a document database anyway for other reasons.
Edit2: added timestamp to avoid data from transactions that start after retrieval transaction has started. Possibly could change timestamp to sequence ID.
Edit3: Here's another algorithm I thought about - it may be better than the one above:
New algorithm - easier to understand (and possible correct this time :) )
Support structures:
transaction_support_tempalte {
_created-by-transaction: <txid>
_made-obsolete-by-transaction: <txid>
}
transaction_record { //
transaction_id: <txid>
timestamp: <tx timestamp>
updated_documents: {
[doc1_id, doc2_id...]
}
}
transaction_numer { //atomic counter - used for ordering transactions.
_id: "transaction_number"
next_transaction_id: 0 //initial.
}
Note: all IDs are model object IDs, not DB ids (don't confuse with logical IDs which are different).
DB ID - different for each document - but multiple DB documents are revisions of one model object.
Model object ID - same for all revisions of the model object.
Logical ID - client-facing ID.
First time setup:
1. Create the transaction_number document:
Commit process:
1. Get new transaction ID by atomic increment on the transaction number counter.
2. Insert a new transaction record with the transaction id, the timestamp and the updated documents.
3. Create the new version for each document. Make sure the _created-by-transaction is set.
4. Update the old version of each updated or deleted document as
"_made-obsolete-by-transaction" with the transaction id.
This is the time to detect conflicts! if seen a conflict, rollback.
Note - this can be done as find-and-modify rather then by serializing the entire document again.
5. Remove the transaction record.
Cleanup process:
1. Go over transaction record, sorted by id, ascending (oldest transaction first).
2. For each transaction, if it expired (by timestamp), do rollback(txid).
Rollback(txid) process:
1. Get the transaction record for the given transaction id.
2. For each document id in the "updated documents":
2.1 If the document exists and has "_made-obsolete-by-transaction" with
the correct transaction id, remove the _made-obsolete-by-transaction data.
3. For each document with the _created-by-transaction-id:
3.1 remove the document.
4. Remove the transaction record document.
Retrieval process:
1. Top-transaction-id = transaction ID counter.
2. Read all transactions from the transactions collection.
Current-transaction-ids[] = Get all transaction IDs.
3. Retrieve documents as needed. Always use "sort by transaction_id, desc" as last sort clause.
3.1 If a document "_created-by-transaction-id" is in the Current-transaction-ids[]
or is >= Top-transaction-id - ignore it (not yet committed).
3.2 If a document "_made-obsolete-by-transaction" is not in the Current-transaction-ids[]
and is < Top-transaction-id - ignore it (a newer version was committed).
4. We may have to retrieve more chunks to satisfy original requests if documents were ignored.
Was the document committed when we started?
If we see a document with transaction ID in the current executing transactions - it's a transaction that
started before we started the retrieval but was not yet committed at that time - so we don't want it.
If we see a document with transaction ID >= top transaction ID - it's a transaction that started after
we started the retrieval - so we don't want it.
Is the document up-to-date (latest version)?
If we see a document with made-obsolete that is not in the current transaction IDs (transactions started
before we started) and is < top transaction ID (transactions started after we started) - then
there was a transaction that finished commit in our past that made this document obsolete - so we don't want it.
Why is sorting not harmed?
Because we add the sort as a last clause, we'll always see the real sorting work first. For each real
sorting "bucket" we might get multiple documents that represent the model object at different versions.
However, the sort order between model objects remains.
Why doesn't the counter makes the transaction execute serially (one at ta time)?
Because this is not RDBMS - we don't really have transactions so we don't wait for the transaction
to commit as we do with "select for update".
Another transaction can make the atomic change as soon as we're done with it.
Compaction:
One in a while a compaction will have to take place - get all really old documents and remove them to another data store.
This shouldn't affect any running retrieval or transaction.
Optimization:
Put the conditions into the query itself.
Add transaction ID to all indexes.
Make sure documents with the same model object ID don't get sharded into different nodes.
What's the cost?
Assuming we want multiple document versions for history and audit anyway, the extra cost is
atomically updating the counter, creating the transaction record, "sealing" the previous version of each model object
(mark obsolete) and removing the transaction document. This shouldn't be too big.
Note that if the above assumption is not valid, the extra cost is quite high, especially for retrieval.
Results:
I've implemented the above algorithm (the revised one with minor changes). Functionally, it's working. However, the performance (at least over MongoDB with 3 nodes in master-slave replication topology, no fsync but replication required before "commit" ends) is atrocious. I'm constantly reading things I've just written to from different threads. I'm getting constant collection locks on the transactions collection and my indexes can't keep up with the constant rollover. Performance is capped at 20 TPS for tiny tiny transactions with 10 feeder threads.
In short - not a good general purpose solution.
without going into the specifics of your plan, I thought it might first be useful to go over mongoDB's support of ACID requirements.
Atomicity: Mongo supports atomic changes for individual documents. Typically, the most significant atomic operations are "$set" and findAndModify Some documentation on these operations and atomicity in mongoDB in general:
http://www.mongodb.org/display/DOCS/Atomic+Operations
[http://www.mongodb.org/display/DOCS/Updating#Updating-%24set][1]
http://www.mongodb.org/display/DOCS/findAndModify+Command
Consistency: Difficult to achieve and quite complex. I won't try to summarize in this post, but there is a great series of posts on the subject:
http://blog.mongodb.org/post/475279604/on-distributed-consistency-part-1
[http://blog.mongodb.org/post/498145601/on-distributed-consistency-part-2-some-eventual][2]
Isolation: Isolation in mongoDB does exist for documents, but not for any higher levels. Again, this is a complicated subject; besides the Atomic Operations link above, the best resource I have found is the following stack overflow thread:
Why doesn't MongoDB use fsync()? (the top answer is a bit of a goldmine for this subject in general, though some of the information regarding durability is out of date)
Durability: The main way that users ensure data durability is by using the getLastError command (see link below for more info) to confirm that a majority of nodes in a replica set have written the data before the call returns.
http://www.mongodb.org/display/DOCS/getLastError+Command#getLastErrorCommand-majority
http://docs.mongodb.org/manual/core/replication-internals/ (linked to in the above document)
Knowing all this about ACID in mongo, it would be very useful to look over some examples similar problems that have already been worked out in mongo. The two following links I expect will be really useful to you as they are very complete and right on subject.
Two-Phase Commits: http://cookbook.mongodb.org/patterns/perform-two-phase-commits/
Transactions for e-commerce work: http://www.slideshare.net/spf13/mongodb-ecommerce-and-transactions-10524960
Finally, I have to ask: Why do you want to have transactions? It is rare that users of mongoDB find they truly need ACID to achieve their goals. It might be worthwhile stepping back and trying to approach the problem from another perspective before you go ahead and implement a whole layer on top of mongo just to get transactions.