Write operation during a long cursor operation - mongodb

I use MongoDB 2.4 with a single DB.
I find all items in a collection (50.000+) and for each one, I insert it into another one.
it = coll1.find()
while (it.hasNext()) {
coll2.save(it.next())
}
Is it a performance issue to make intensive writes when a cusor is open on the same database ?

This essentially comes down to a question about concurrency ( http://docs.mongodb.org/manual/faq/concurrency/ ) being able to do reads on a single database level writer greedy lock performantly while creating a write intensive load.
MongoDB should be able to juggle your read lock with the write lock quite well here, interweaving operations and yielding the current operation under certain conditions that it sees fit to keep performance up (see link supplied above).
This is, of course, in contrast to SQL where read and write operations are isolated, as such this means that MongoDBs concurrency rules actually break the I in ACID. Of course, in SQL the lock is much more granular so you would get relative performance normally.
If you do see a performance hit, mainly due to IO (reading requires IO as well remember) then you might find it prudent to batch your writes into groups of maybe 1000, taking about a 5 second break after each batch to let the IO subside.

No as cursors are not atomic. Each read is its own atomic transaction. This means that mongo is not subject to the issues of ensuring that the cursor represents a single snapshot in time.

Related

Is eval that evil?

I understand that eval locks the whole database, which can't be good for throughput - however I have a scenario where a very specific transaction involving several documents must be isolated.
Because that transaction does not happen very often and is fairly quick (a few updates on indexed queries), I was thinking of using eval to execute it.
Are their any pitfalls that I should be aware of (I have seen several eval=evil posts but without much explanation)?
Does it make a difference if the database is part of a replica set?
Many developers would suggest using eval is "evil" as their are obvious security concerns with potentially unsanitized JavaScript code executing within the context of the MongoDB instance. Normally MongoDB is immune to those types of injection attacks.
Some of the performance issues of using JavaScript in MongoDB via the eval command are mitigated in version 2.4, as muliple JavaScript operations can execute at the same time (depending on the setting of the nolock option). By default though, it takes a global lock (which is what you specifically want apparently).
When a eval is being used to try to perform an (ACID-like) transactional update to several documents, there's one primary concern. The biggest issue is that if all operations must succeed for the data to be in a consistent state, the developer is running the risk that a failure mid-way through the operation may result in a partially complete update to the database (like a hardware failure for example). Depending on the nature of the work being performed, replication settings, etc., the data may be OK, or may not.
For situations where database corruption could occur as a result of a partially complete eval operation, I would suggest considering an alternative schema design and avoiding eval. That's not to say that it wouldn't work 99.9999% of the time, it's really up to you to decide ultimately whether it's worth the risk.
In the case you describe, there are a few options:
{ version: 7, isCurrent: true}
When a version 8 document becomes current, you could for example:
Create a second document that contains the current version, this would be an atomic set operation. It would mean that all reads would potentially need to read the "find the current version" document first, followed by the read of the full document.
Use a timestamp in place of a boolean value. Find the most current document based on timestamp (and your code could clear out the fields of older documents if desired once the now current document has been set)

atomic operations and atomic transactions

Can someone explain to me, whats the difference between atomic operations and atomic transactions? Its seems to me that these two are the same thing.Is that correct?
The concept of Atomicity is common between atomic transactions and atomic operations, but they are usually related to different domains.
Atomic Transactions are associated with Database operations where a set of actions must ALL complete or else NONE of them complete. For example, if someone is booking a flight, you want to both get payment AND reserve the seat OR do neither. If either one were allowed to succeed without the other also succeeding, the database would be inconsistent.
Atomic Operations on the other hand are usually associated with low-level programming with regards to multi-processing or multi-threading applications and are similar to Critical Sections.
For example, if two threads both access and modify the same variable, each thread goes through the following steps:
Read the variable from storage into local memory.
Modify the value in local memory.
Write the modified value back to the original storage location.
But in a multi-threaded system an interrupt or other context switch might happen after the first process has read the value but has not written it back. The second process (or interrupt) will then read and modify the OLD value and write its modified value back to storage. When the first process is re-enabled, it doesn't know that something might have changed so it writes back its change to the original value. Hence the operation that the second process did to the variable will be lost.
If an operation is atomic, it is guaranteed to complete without being interrupted once it begins. This is usually accomplished using hardware-level primitives like Test-and-Set or Compare-and-Swap.
To get a wider picture, you can take a look at:
MySQL Transactions and Atomic Operations
Atomicity (database systems)
Atomicity (Programming)
Some quotes from the above-cited resources:
About databases:
In an atomic transaction, a series of database operations either all
occur, or nothing occurs. A guarantee of atomicity prevents updates to
the database occurring only partially, which can cause greater
problems than rejecting the whole series outright. In other words,
atomicity means indivisibility and irreducibility.
About programming:
In concurrent programming, an operation (or set of operations) is
atomic, linearizable, indivisible or uninterruptible if it appears to
the rest of the system to occur instantaneously. Atomicity is a
guarantee of isolation from concurrent processes. Additionally, atomic
operations commonly have a succeed-or-fail definition — they either
successfully change the state of the system, or have no apparent
effect.
I have seen the word transaction used more often for databases and operation in programming, especially in kernel-level programming.
In a statement:
an atomic transaction is the smallest set of operations to perform the required steps.
Either all of those required operations happen(successfully) or the atomic transaction fails.
An atomic operation usually has nothing in common with transactions. To my knowledge this comes from hardware programming, where an set of operations (or one) happen to get solved instantly.

Why using Locking in MongoDB?

MoongoDB is from the NoSql era, and Lock is something related to RDBMS? from Wikipedia:
Optimistic concurrency control (OCC) is a concurrency control method for relational database management systems...
So why do i find in PyMongo is_locked , and even in driver that makes non-blocking calls, Lock still exists, Motor has is_locked.
NoSQL does not mean automatically no locks.
There always some operations that do require a lock.
For example building of index
And official MongoDB documentation is a more reliable source than wikipedia(none offense meant to wikipedia :) )
http://docs.mongodb.org/manual/faq/concurrency/
Mongo does in-place updates, so it needs to lock in order to modify the database. There are other things that need locks, so read the link #Tigra provided for more info.
This is pretty standard as far as databases and it isn't an RDBMS-specific thing (Redis also does this, but on a per-key basis).
There are plans to implement collection-level (instead of database-level) locking: https://jira.mongodb.org/browse/SERVER-1240
Some databases, like CouchDB, get around the locking problem by only appending new documents. They create a new, unique revision id and once the document is finished writing, the database points to the new revision. I'm sure there's some kind of concurrency control when changing which revision is used, but it doesn't need to block the database to do that. There are certain downsides to this, such as compaction needing to be run regularly.
MongoDB implements a Database level locking system. This means that operations which are not atomic will lock on a per database level, unlike SQL whereby most techs lock on a table level for basic operations.
In-place updates only occur on certain operators - $set being one of them, MongoDB documentation did used to have a page that displayed all of them but I can't find it now.
MongoDB currently implements a read/write lock whereby each is separate but they can block each other.
Locks are utterly vital to any database, for example, how can you ensure a consistent read of a document if it is currently being written to? And if you write to the document how do you ensure that you only apply that single update at once and not multiple updates at the same time?
I am unsure how version control can stop this in CouchDB, locks are really quite vital for a consistent read and are separate to version control, i.e. what if you wish to apply a read lock to the same version or read a document that is currently being written to a new revision? You will obviously see a lock queue appear. Even though version control might help a little with write lock saturation there will still be a write lock and it will still need to work on a level.
As for concurrency features; MongoDB has the ability (for one), if the data is not in RAM, to subside a operation for other operations. This means that locks will not just sit there waiting for data to be paged in and other operations will run in the mean time.
As a side note, MongoDB actually has more locks than this, it also has a JavaScript lock which is global and blocking, it does not have the normal concurrency features of regular locks.
and even in driver that makes non-blocking calls
Hmm I think you might be confused by what is meant as a "non-blocking" application or server: http://en.wikipedia.org/wiki/Non-blocking_algorithm

mongodb any idiom for collapsing 2 upserts into one : create, modifier-to-set-default-values, modifier-to-update-existing-document

I want to avoid doing two operations to achieve the following :
Find document, update with modifier-1.
If document not exist, populate default fields with modifier-2, then update with modifier-1.
it's a common pattern so it should be possible. At the moment I am having to do two upserts.
( feel free to adjust the psuedocode, I am new to the query language).
update( {...}, modifier-1, true)
if(upserted)
{
// check for race condition, detect if another query from another thread
// hasn't populated the default values.
update ( {...,if_a_default_value_does_not_exist}, modifier-2, true)
}
I assume that two operations would result in two disk writes, I understand mongodb does asynchronous disk writes. If I can't do this with one operation, is there some sort of mechanism in place that would merge the writes into a single write before writing to journal / disk ? And yes this would make a significant difference in loading my 300 gb data set :D
Hassan,
The asynchronous writes to disk you mentioned are accomplished by writing the changes to memory and then fsync'ing them onto disk periodically in the background, so merging the two operations would likely not impact performance here as much as you would think.
The journal is another matter entirely - it is written separately to disk in an idempotent manner for safety to allow for easier recovery/restoration in case of failure or other similar issues. You can always start the DB with journaling off, do the import, and then restart with journaling enabled once the bulk update is done if the journal writes are causing you significant issues.
Finally, be careful of the not exists logic in your second modifier - from an indexing perspective a positive operator such as exits is preferred, otherwise indexes may not be used and that will certainly slow down your inserts.
Away from bulk inserts, for single atomic updates you can also explore the use of findAndModify (http://www.mongodb.org/display/DOCS/findAndModify+Command) to do the check and subsequent change for you, it's hard to tell based on the description if that would be a good fit because it has its own drawbacks.

mongodb: should i always use the 'safe' option on updates

when dealing with mongodb, when should i use the {safe: true} on queries?
Right now I use the 'safe' option just to check if my queries were inserted or updated successfully. However, I feel this might be over kill.
Should i assume that 99% of the time, my queries (assuming they are properly written) will be inserted/updated, not have to worry about checking if they successfully inputted?
thoughts?
Assuming when you say queries you actually mean writes/inserts (the wording of your question makes me think this) then the Write Concern (safe, none, fsync, etc) can be used to get more speed and less safety when that is acceptable, and less speed and more safety when that is necessary.
As an example, a hypothetical Facebook-style application could use an unsafe write for "Likes" while it would use a very safe write for password changes. The logic behind this is that there will be many thousand "Like"-style updates happening a second, and it doesn't matter if one is lost, whereas password updates happen less regularly but it is essential that they succeed.
Therefore, try to tailor your Write Concern choice to the kind of update you are doing, based upon your speed and data integrity requirements.
Here is another use case where unsafe writes are an appropriate choice: You are making a large number of writes in very short order. In this case you might perform a number of writes, and then call get last error to see if any of them failed.
collection.setWriteConcern(WriteConcern.NORMAL)
collection.getDB().resetError()
List<
for (Something data : importData) {
collection.insert(makeDBObject(data))
}
collection.getDB().getLastError(WriteConcern.REPLICAS_SAFE).throwOnError()
If this block succeeds without an exception, then all of the data was inserted successfully. If there was an exception, then one or more of the write operations failed, and you will need to retry them (or check for a unique index violation, etc). In real life, you might call getLastError every 10 writes or so, to avoid having to resubmit lots of requests.
This pattern is very nice for performance when performing bulk inserts of large amounts of data.
Safe is only necessary on writes, not reads. Queries are only reads.