Is it possible to default all MongoDB writes to safe? What is the performance hit from doing this? - mongodb

For MongoDB 2.2.2, is it possible to default all writes to safe, or do you have to include the right flags for each write operation individually?
If you use safe writes, what is the performance hit?
We're using MongoMapper on Rails.

If you are using the latest version of 10gen official drivers, then the default actually is safe, not fire-and-forget (which used to be the default).
You can read this blog post by 10gen co-founder and CTO which explains some history and announces that all MongoDB drivers as of late November use "safe" mode by default rather than "fire-and-forget".
MongoMapper is built on top of 10gen supported Ruby Driver, they also updated their code to be consistent with the new defaults. You can see the check-in and comments here for the master branch. Since I'm not certain what their release schedule is, I would recommend you ask on MongoMapper mailing list.
Even prior to this change you could set "safe" value on connection level in MongoMapper which is as good as global. Starting with 0.11, you can do it in mongo.yml file. You can see how in the release notes.
The bottom line is that you don't have to specify safe mode for each write, but you can still specify higher or lower than default durability for each individual write operation if you wish, but when you switch to the latest versions of everything, then you will be using "safe writes" globally by default.

I do not use mongomapper so I can only answer a little.
In terms of the database, depends. A safe write is basically (basically being the keyword) waiting for the database to do what it would normally do after you got a default "I am done" response from a fire and forget.
There is more work depending on how safe you want the write to be. A good example is a write to a single node and one to many nodes. If you write to that single node you will get a quicker response from the database than if you wish to replicate the command (safely) to other nodes in the network.
Any amount of safe write does, of course, cause a performance hit in the amount of writes you can send to the server since there is more work required before a response is given which means less writes able to be thrown at the database. The key thing is getting the balance just right for your application, between speed and durability of your data.
Many drivers now (including MongoDB PHP 1.3, using a write concern of 1: http://php.net/manual/en/mongo.writeconcerns.php ) are starting to use normal safe writes by default and normal fire and forget queries are starting to abolished by default.
By looking at the mongomapper documentation: http://mongomapper.com/documentation/plugins/safe.html it seems you must still add the flags everywhere.

Related

Is eval that evil?

I understand that eval locks the whole database, which can't be good for throughput - however I have a scenario where a very specific transaction involving several documents must be isolated.
Because that transaction does not happen very often and is fairly quick (a few updates on indexed queries), I was thinking of using eval to execute it.
Are their any pitfalls that I should be aware of (I have seen several eval=evil posts but without much explanation)?
Does it make a difference if the database is part of a replica set?
Many developers would suggest using eval is "evil" as their are obvious security concerns with potentially unsanitized JavaScript code executing within the context of the MongoDB instance. Normally MongoDB is immune to those types of injection attacks.
Some of the performance issues of using JavaScript in MongoDB via the eval command are mitigated in version 2.4, as muliple JavaScript operations can execute at the same time (depending on the setting of the nolock option). By default though, it takes a global lock (which is what you specifically want apparently).
When a eval is being used to try to perform an (ACID-like) transactional update to several documents, there's one primary concern. The biggest issue is that if all operations must succeed for the data to be in a consistent state, the developer is running the risk that a failure mid-way through the operation may result in a partially complete update to the database (like a hardware failure for example). Depending on the nature of the work being performed, replication settings, etc., the data may be OK, or may not.
For situations where database corruption could occur as a result of a partially complete eval operation, I would suggest considering an alternative schema design and avoiding eval. That's not to say that it wouldn't work 99.9999% of the time, it's really up to you to decide ultimately whether it's worth the risk.
In the case you describe, there are a few options:
{ version: 7, isCurrent: true}
When a version 8 document becomes current, you could for example:
Create a second document that contains the current version, this would be an atomic set operation. It would mean that all reads would potentially need to read the "find the current version" document first, followed by the read of the full document.
Use a timestamp in place of a boolean value. Find the most current document based on timestamp (and your code could clear out the fields of older documents if desired once the now current document has been set)

Why using Locking in MongoDB?

MoongoDB is from the NoSql era, and Lock is something related to RDBMS? from Wikipedia:
Optimistic concurrency control (OCC) is a concurrency control method for relational database management systems...
So why do i find in PyMongo is_locked , and even in driver that makes non-blocking calls, Lock still exists, Motor has is_locked.
NoSQL does not mean automatically no locks.
There always some operations that do require a lock.
For example building of index
And official MongoDB documentation is a more reliable source than wikipedia(none offense meant to wikipedia :) )
http://docs.mongodb.org/manual/faq/concurrency/
Mongo does in-place updates, so it needs to lock in order to modify the database. There are other things that need locks, so read the link #Tigra provided for more info.
This is pretty standard as far as databases and it isn't an RDBMS-specific thing (Redis also does this, but on a per-key basis).
There are plans to implement collection-level (instead of database-level) locking: https://jira.mongodb.org/browse/SERVER-1240
Some databases, like CouchDB, get around the locking problem by only appending new documents. They create a new, unique revision id and once the document is finished writing, the database points to the new revision. I'm sure there's some kind of concurrency control when changing which revision is used, but it doesn't need to block the database to do that. There are certain downsides to this, such as compaction needing to be run regularly.
MongoDB implements a Database level locking system. This means that operations which are not atomic will lock on a per database level, unlike SQL whereby most techs lock on a table level for basic operations.
In-place updates only occur on certain operators - $set being one of them, MongoDB documentation did used to have a page that displayed all of them but I can't find it now.
MongoDB currently implements a read/write lock whereby each is separate but they can block each other.
Locks are utterly vital to any database, for example, how can you ensure a consistent read of a document if it is currently being written to? And if you write to the document how do you ensure that you only apply that single update at once and not multiple updates at the same time?
I am unsure how version control can stop this in CouchDB, locks are really quite vital for a consistent read and are separate to version control, i.e. what if you wish to apply a read lock to the same version or read a document that is currently being written to a new revision? You will obviously see a lock queue appear. Even though version control might help a little with write lock saturation there will still be a write lock and it will still need to work on a level.
As for concurrency features; MongoDB has the ability (for one), if the data is not in RAM, to subside a operation for other operations. This means that locks will not just sit there waiting for data to be paged in and other operations will run in the mean time.
As a side note, MongoDB actually has more locks than this, it also has a JavaScript lock which is global and blocking, it does not have the normal concurrency features of regular locks.
and even in driver that makes non-blocking calls
Hmm I think you might be confused by what is meant as a "non-blocking" application or server: http://en.wikipedia.org/wiki/Non-blocking_algorithm

mongodb: atomically rename two collections?

I have two existing collections "A" and "B". I need to rename "B" to "C", and rename "A" to "B", without permitting any writes to B during that time. The rename itself activates the global lock, but I need to prevent writes from occurring in between renames. Is this possible?
Here's my code:
db.B.renameCollection('C')
<-- prevent writes from occurring to B in between commands
db.A.renameCollection('B')
Edit: I'm using mongodb version 1.8.1, and changing versions is not currently an option.
Mongodb itself cannot handle this, the only way you could do this is with some custom code.
If this will only occur one time in your app ( I guess renaming collections is not something that is done often ) you could have a more 'aggressive' approach, where you search for a flag in your database that will mean 'collection db.B has been renamed but db.A not yet'. If all your writes check for this before submitting the write to the server and just return if the flag is set, it can protect the app from writing to db.A after db.B is renamed.
I consider this the 'aggressive' approach since it clearly affects performance ( still, reads are so fast, you probably won't feel it ).
If your app runs on a single web server (and not a web farm) you can have the synchronization mechanism on the web app itself, using thread synchronization tools like semaphores, etc or even some thread safe variable that will be used as the flag I suggested above. (depends on the server side technology you are using )
You can create a function named "renameCollection" and put a lock on it :
db.runCommand({eval:renameCollection,args:["Collection1","Collection2"],nolock:false});
The lock allows to do this kind of operations safely and make wait the requests
As you could guess: this is not possible. No transaction support, only atomic operations.
MongoDB has no sense of transactional renames, in fact I am not sure if SQL does in this case either, however you could accomplish this with a bit of server-side programming and a lock collection.
From your server side language you can fire off the commands while writing a row to a lock table, each query against B will check for lock, if not found will write otherwise will bail out.
This is a simple method however most likely a bit tedious, especially if you have a very segmented code base that does not house a standardised query layer between the server-side code and the database.
I should also note that renameCollection will not work on sharded collections, you most likely already knew that but I thought I would just say it anyway. In the case of sharded collection it would be better to "move" the collection instead via copy OPs.
I work for Tokutek on TokuMX with multi-statement transactions.
As other answers have said, MongoDB cannot do this (to the best of my knowledge), but TokuMX can. TokuMX has multi-statement transactions on non-sharded clusters. To perform this operation, you can do:
db.beginTransaction()
db.B.renameCollection('C')
db.A.renameCollection('B')
db.commitTransaction()

Incrementing hundreds of counters at once, redis or mongodb?

Background/Intent:
So I'm going to create an event tracker from scratch and have a couple of ideas on how to do this but I'm unsure of the best way to proceed with the database side of things. One thing I am interested in doing is allowing these events to be completely dynamic, but at the same time to allow for reporting on relational event counters.
For example, all countries broken down by operating systems. The desired effect would be:
US # of events
iOS - # of events that occured in US
Android - # of events that occured in US
CA # of events
iOS - # of events that occured in CA
Android - # of events that occured in CA
etc.
My intent is to be able to accept these event names like so:
/?country=US&os=iOS&device=iPhone&color=blue&carrier=Sprint&city=orlando&state=FL&randomParam=123&randomParam2=456&randomParam3=789
Which means in order to do the relational counters for something like the above I would potentially be incrementing 100+ counters per request.
Assume there will be 10+ million of the above requests per day.
I want to keep things completely dynamic in terms of the event names being tracked and I also want to do it in such a manner that the lookups on the data remains super quick. As such I have been looking into using redis or mongodb for this.
Questions:
Is there a better way to do this then counters while keeping the fields dynamic?
Provided this was all in one document (structured like a tree), would using the $inc operator in mongodb to increment 100+ counters at the same time in one operation be viable and not slow? The upside here being I can retrieve all of the statistics for one 'campaign' quickly in a single query.
Would this be better suited to redis and to do a zincrby for all of the applicable counters for the event?
Thanks
Depending on how your key structure is laid out I would recommend pipelining the zincr commands. You have an easy "commit" trigger - the request. If you were to iterate over your parameters and zincr each key, then at the end of the request pass the execute command it will be very fast. I've implemented a system like you describe as both a cgi and a Django app. I set up a key structure along the lines of this:
YYYY-MM-DD:HH:MM -> sorted set
And was able to process Something like 150000-200000 increments per second on the redis side with a single process which should be plenty for your described scenario. This key structure allows me to grab data based on windows of time. I also added an expire to the keys to avoid writing a db cleanup process. I then had a cronjob that would do set operations to "roll-up" stats in to hourly, daily, and weekly using variants of the aforementioned key pattern. I bring these ideas up as they are ways you can take advantage of the built in capabilities of Redis to make the reporting side simpler. There are other ways of doing it but this pattern seems to work well.
As noted by eyossi the global lock can be a real problem with systems that do concurrent writes and reads. If you are writing this as a real time system the concurrency may well be an issue. If it is an "end if day" log parsing system then it would not likely trigger the contention unless you run multiple instances of the parser or reports at the time of input. With regards to keeping reads fast In Redis, I would consider setting up a read only redis instance slaved off of the main one. If you put it on the server running the report and point the reporting process at it it should be very quick to generate the reports.
Depending on your available memory, data set size, and whether you store any other type of data in the redis instance you might consider running a 32bit redis server to keep the memory usage down. A 32b instance should be able to keep a lot of this type of data in a small chunk of memory, but if running the normal 64 bit Redis isn't taking too much memory feel free to use it. As always test your own usage patterns to validate
In redis you could use multi to increment multiple keys at the same time.
I had some bad experience with MongoDB, i have found that it can be really tricky when you have a lot of writes to it...
you can look at this link for more info and don't forget to read the part that says "MongoDB uses 1 BFGL (big f***ing global lock)" (which maybe already improved in version 2.x - i didn't check it)
On the other hand, i had a good experience with Redis, i am using it for a lot of read / writes and it works great.
you can find more information about how i am using Redis (to get a feeling about the amount of concurrent reads / writes) here: http://engineering.picscout.com/2011/11/redis-as-messaging-framework.html
I would rather use pipelinethan multiif you don't need the atomic feature..

What level does mongodb write lock take place?

I am beginning to research technology for a project, that can have frequent large writes. I am wondering at what level does mongo write lock take place? Is it at the server level or database level? I have read http://www.mongodb.org/display/DOCS/How+does+concurrency+work but official documentation says. a write operation can block all other operations.
To me this means write locks are server level but I am hoping they are db level. Could someone please confirm or deny this?
At the moment, MongoDB does indeed have a global server lock. However, there is some additional code that will release the lock in case memory blocks have to be loaded from disk. It uses lock-yielding for that. Although this does not solve all concurrency issues, it addresses quite a few of the generally associated problems. This post describes it well: http://blog.pythonisito.com/2011/12/mongodbs-write-lock.html
From MongoDB 2.2, there will be a database-level lock, and also more work on yielding is done.