mongodb: atomically rename two collections? - mongodb

I have two existing collections "A" and "B". I need to rename "B" to "C", and rename "A" to "B", without permitting any writes to B during that time. The rename itself activates the global lock, but I need to prevent writes from occurring in between renames. Is this possible?
Here's my code:
db.B.renameCollection('C')
<-- prevent writes from occurring to B in between commands
db.A.renameCollection('B')
Edit: I'm using mongodb version 1.8.1, and changing versions is not currently an option.

Mongodb itself cannot handle this, the only way you could do this is with some custom code.
If this will only occur one time in your app ( I guess renaming collections is not something that is done often ) you could have a more 'aggressive' approach, where you search for a flag in your database that will mean 'collection db.B has been renamed but db.A not yet'. If all your writes check for this before submitting the write to the server and just return if the flag is set, it can protect the app from writing to db.A after db.B is renamed.
I consider this the 'aggressive' approach since it clearly affects performance ( still, reads are so fast, you probably won't feel it ).
If your app runs on a single web server (and not a web farm) you can have the synchronization mechanism on the web app itself, using thread synchronization tools like semaphores, etc or even some thread safe variable that will be used as the flag I suggested above. (depends on the server side technology you are using )

You can create a function named "renameCollection" and put a lock on it :
db.runCommand({eval:renameCollection,args:["Collection1","Collection2"],nolock:false});
The lock allows to do this kind of operations safely and make wait the requests

As you could guess: this is not possible. No transaction support, only atomic operations.

MongoDB has no sense of transactional renames, in fact I am not sure if SQL does in this case either, however you could accomplish this with a bit of server-side programming and a lock collection.
From your server side language you can fire off the commands while writing a row to a lock table, each query against B will check for lock, if not found will write otherwise will bail out.
This is a simple method however most likely a bit tedious, especially if you have a very segmented code base that does not house a standardised query layer between the server-side code and the database.
I should also note that renameCollection will not work on sharded collections, you most likely already knew that but I thought I would just say it anyway. In the case of sharded collection it would be better to "move" the collection instead via copy OPs.

I work for Tokutek on TokuMX with multi-statement transactions.
As other answers have said, MongoDB cannot do this (to the best of my knowledge), but TokuMX can. TokuMX has multi-statement transactions on non-sharded clusters. To perform this operation, you can do:
db.beginTransaction()
db.B.renameCollection('C')
db.A.renameCollection('B')
db.commitTransaction()

Related

MongoDB: Switch database/collection referenced by a given name on the fly

My application needs only read access to all of its databases. One of those databases (db_1) hosts a collection coll_1 whose entire contents* need to be replaced periodically**.
My goal is to have no or very little effect on read performance for servers currently connected to the database.
Approaches I could think of with so far:
1. renameCollection
Build a temporary collection coll_tmp, then use renameCollection with dropTarget: true to move its contents over to coll_1. The downside of this approach is that as far as I can tell, renameCollection does not copy indexes, so once the collection is renamed, coll_1 would need reindexing. While I don't have a good estimate of how long this would take, I would think that query-performance will be significantly affected until reindexing is complete.
2. TTL Index
Instead of straight up replacing, use a time-to-live index to expire documents after the chosen replacement period. Insert new data every time period. This seems like a decent solution to me, except that for our specific application, old data is better than no data. In this scenario, if the cron job to repopulate the database fails for whatever reason, we could potentially be left with an empty coll_1 which is undesirable. I think this might have a negligible effect, but this solution also requires on-the-fly indexing as every document is inserted.
3. Communicate current database to read-clients
Simply use two different databases (or collections?) and inform connected clients which one is more recent. This solution would allow for finishing indexing the new coll_1_alt (and then coll_1 again) before making it available. I personally dislike the solution since it couples the read clients very closely to the database itself, and of course communication channels are always imperfect.
4. copyDatabase
Use copyDatabase to rename (designate) an alternate database db_tmp to db_1.db_tmp would also have a collection coll_1. Once reindexing is complete on db_tmp.coll_1, copyDatabase could be used to simply rename db_tmp to db_1. It seems that this would require droppping db_1 before renaming, leaving a window in which data won't be accessible.
Ideally (and naively), I'd just set db_1 to be something akin to a symlink, switching to the most current database as needed.
Anyone has good suggestions on how to achieve the desired effect?
*There are about 10 million documents in coll_1.
** The current plan is to replace the collection once every 24 hours. The replacement interval might get as low as once every 30 minutes, but not lower.
The problem that you point out in option 4 you will also have with option 1. dropTarget will also mean that the collection is not available.
Another alternative could be to just have both the old and the new data in the same collection, and use a "version ID" that you then still have to communicate to your clients to do a query on. That at least stops you from having to do reindexing like you pointed out for option 1.
I think your best bet is actually option 3, and it's the most equivalent to changing a symlink, except it is on the client side.

Is eval that evil?

I understand that eval locks the whole database, which can't be good for throughput - however I have a scenario where a very specific transaction involving several documents must be isolated.
Because that transaction does not happen very often and is fairly quick (a few updates on indexed queries), I was thinking of using eval to execute it.
Are their any pitfalls that I should be aware of (I have seen several eval=evil posts but without much explanation)?
Does it make a difference if the database is part of a replica set?
Many developers would suggest using eval is "evil" as their are obvious security concerns with potentially unsanitized JavaScript code executing within the context of the MongoDB instance. Normally MongoDB is immune to those types of injection attacks.
Some of the performance issues of using JavaScript in MongoDB via the eval command are mitigated in version 2.4, as muliple JavaScript operations can execute at the same time (depending on the setting of the nolock option). By default though, it takes a global lock (which is what you specifically want apparently).
When a eval is being used to try to perform an (ACID-like) transactional update to several documents, there's one primary concern. The biggest issue is that if all operations must succeed for the data to be in a consistent state, the developer is running the risk that a failure mid-way through the operation may result in a partially complete update to the database (like a hardware failure for example). Depending on the nature of the work being performed, replication settings, etc., the data may be OK, or may not.
For situations where database corruption could occur as a result of a partially complete eval operation, I would suggest considering an alternative schema design and avoiding eval. That's not to say that it wouldn't work 99.9999% of the time, it's really up to you to decide ultimately whether it's worth the risk.
In the case you describe, there are a few options:
{ version: 7, isCurrent: true}
When a version 8 document becomes current, you could for example:
Create a second document that contains the current version, this would be an atomic set operation. It would mean that all reads would potentially need to read the "find the current version" document first, followed by the read of the full document.
Use a timestamp in place of a boolean value. Find the most current document based on timestamp (and your code could clear out the fields of older documents if desired once the now current document has been set)

Why using Locking in MongoDB?

MoongoDB is from the NoSql era, and Lock is something related to RDBMS? from Wikipedia:
Optimistic concurrency control (OCC) is a concurrency control method for relational database management systems...
So why do i find in PyMongo is_locked , and even in driver that makes non-blocking calls, Lock still exists, Motor has is_locked.
NoSQL does not mean automatically no locks.
There always some operations that do require a lock.
For example building of index
And official MongoDB documentation is a more reliable source than wikipedia(none offense meant to wikipedia :) )
http://docs.mongodb.org/manual/faq/concurrency/
Mongo does in-place updates, so it needs to lock in order to modify the database. There are other things that need locks, so read the link #Tigra provided for more info.
This is pretty standard as far as databases and it isn't an RDBMS-specific thing (Redis also does this, but on a per-key basis).
There are plans to implement collection-level (instead of database-level) locking: https://jira.mongodb.org/browse/SERVER-1240
Some databases, like CouchDB, get around the locking problem by only appending new documents. They create a new, unique revision id and once the document is finished writing, the database points to the new revision. I'm sure there's some kind of concurrency control when changing which revision is used, but it doesn't need to block the database to do that. There are certain downsides to this, such as compaction needing to be run regularly.
MongoDB implements a Database level locking system. This means that operations which are not atomic will lock on a per database level, unlike SQL whereby most techs lock on a table level for basic operations.
In-place updates only occur on certain operators - $set being one of them, MongoDB documentation did used to have a page that displayed all of them but I can't find it now.
MongoDB currently implements a read/write lock whereby each is separate but they can block each other.
Locks are utterly vital to any database, for example, how can you ensure a consistent read of a document if it is currently being written to? And if you write to the document how do you ensure that you only apply that single update at once and not multiple updates at the same time?
I am unsure how version control can stop this in CouchDB, locks are really quite vital for a consistent read and are separate to version control, i.e. what if you wish to apply a read lock to the same version or read a document that is currently being written to a new revision? You will obviously see a lock queue appear. Even though version control might help a little with write lock saturation there will still be a write lock and it will still need to work on a level.
As for concurrency features; MongoDB has the ability (for one), if the data is not in RAM, to subside a operation for other operations. This means that locks will not just sit there waiting for data to be paged in and other operations will run in the mean time.
As a side note, MongoDB actually has more locks than this, it also has a JavaScript lock which is global and blocking, it does not have the normal concurrency features of regular locks.
and even in driver that makes non-blocking calls
Hmm I think you might be confused by what is meant as a "non-blocking" application or server: http://en.wikipedia.org/wiki/Non-blocking_algorithm

What level does mongodb write lock take place?

I am beginning to research technology for a project, that can have frequent large writes. I am wondering at what level does mongo write lock take place? Is it at the server level or database level? I have read http://www.mongodb.org/display/DOCS/How+does+concurrency+work but official documentation says. a write operation can block all other operations.
To me this means write locks are server level but I am hoping they are db level. Could someone please confirm or deny this?
At the moment, MongoDB does indeed have a global server lock. However, there is some additional code that will release the lock in case memory blocks have to be loaded from disk. It uses lock-yielding for that. Although this does not solve all concurrency issues, it addresses quite a few of the generally associated problems. This post describes it well: http://blog.pythonisito.com/2011/12/mongodbs-write-lock.html
From MongoDB 2.2, there will be a database-level lock, and also more work on yielding is done.

Is there any method to guarantee transaction from the user end

Since MongoDB does not support transactions, is there any way to guarantee transaction?
What do you mean by "guarantee transaction"?
There are two conepts in MongoDB that are similar;
Atomic operations
Using safe mode / getlasterror ...
http://www.mongodb.org/display/DOCS/Last+Error+Commands
If you simply need to know if there was an error when you run an update for example you can use the getlasterror command, from the docs ...
getlasterror is primarily useful for
write operations (although it is set
after a command or query too). Write
operations by default do not have a
return code: this saves the client
from waiting for client/server
turnarounds during write operations.
One can always call getLastError if
one wants a return code.
If you're writing data to MongoDB on
multiple connections, then it can
sometimes be important to call
getlasterror on one connection to be
certain that the data has been
committed to the database. For
instance, if you're writing to
connection # 1 and want those writes to
be reflected in reads from connection #2, you can assure this by calling getlasterror after writing to
connection # 1.
Alternatively, you can use atomic operations for cases where you need to increment a value for example (like an upvote, etc.) more about that here:
http://www.mongodb.org/display/DOCS/Atomic+Operations
As a side note, MySQL's default storage engine doesn't have transaction either! :)
http://dev.mysql.com/doc/refman/5.1/en/myisam-storage-engine.html
MongoDB only supports atomic operations. There is no ways implement transaction in the sense of ACID on top of MongoDB. Such a transaction support must be implemented in the core. But you will never see full transaction support due to the CARP theorem. You can not have speed, durability and consistency at the same time.
I think ti's one of the things you choose to forego when you choose a NoSQL solution.
If transactions are required, perhaps NoSQL is not for you. Time to go back to ACID relational databases.
Unfortunately MongoDB does't support transaction out of the box, but actually you can implement ACID optimistic transactions on top on it. I wrote an example and some explanation on a GitHub page.