Does a MongoDB MapReduce job lock the database? I am developing a multi-user MongoDB web application and am worried about multi-user conflicts and performance. Does anyone have any words of wisdom for me?
Simple answer? Sometimes ...
It depends a lot on how you are using map/reduce ... but in my experience it's never been a problem.
There isn't much info on this, but it's clearly stated in the docs that is does sometimes lock but it "Allows substantial concurrent operation."
There are a couple of questions in the mongodb-user group asking about this ... the best response I've seen offically is that ... "in 1.4 it yields but isn't as nice as it should be, in 1.5 its much friendlier to other requests."
That does not mean that it doesn't block at all, but compared to db.eval() which blocks the whole mongod process ... it's your best bet.
That said, in 1.7.2 and up there is now a nolock option for db.eval() ...
No, mapreduce does not lock the database. See the note here, just after "Using db.eval()" (it explains why mapreduce may be more appropriate to use than eval, because mapreduce does not block).
If you are going to run a lot of mapreduce jobs you should use sharding, because that way the job can run in parallel on all the shards. Unfortunately mapreduce jobs can't run on secondaries in a replica set, since the results must be written and replicas are read-only.
In version 2.1.0 added a "nonAtomic" flag to output option.
See: https://jira.mongodb.org/browse/SERVER-2581
Related
I just inherited an application from another developer, and I've been asked to fix some latency issues that users have been experiencing. The problem is that any page that makes db calls to mongo takes several minutes to load in the browser.
When I restart mongo, however, everything speeds up again, and the application functions normally. I see several cron jobs that run throughout the day, and I believe one of these may be causing mongo to slow down.
Unfortunately, I have no experience with mongo (only mysql), and I really don't have any idea of what I'm looking for in terms of things that could be making mongo run so slowly.
Anyways, I was hoping someone could suggest some potential things that could be causing the latency so I can approach this problem better. I have looked in the mongo logs, and the only thing I see that could be of concern is a message that says:
warning: can't find plugin [asc]
I know this may point to an indexing problem, but are there any other obvious things I should be investigating?
From what I read at https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/pqPvMq7cSBw it looks like one of your queries declared
db.a2.find().sort({a:"asc"})
rather than
db.a2.find().sort({a:1})
In MongoDB you need to declare your sorting with either 1 or -1, there's no asc or desc constants for the sorting. So I would recommend that you check if any of your queries runs incorrectly. You can check what queries are running through the log files (with correct profiling settings) http://docs.mongodb.org/manual/tutorial/manage-the-database-profiler/ . You may use mongotop (http://docs.mongodb.org/manual/reference/mongotop/) to see where the most time reading/writing data is spent for your collections, as well.
I am beginning to research technology for a project, that can have frequent large writes. I am wondering at what level does mongo write lock take place? Is it at the server level or database level? I have read http://www.mongodb.org/display/DOCS/How+does+concurrency+work but official documentation says. a write operation can block all other operations.
To me this means write locks are server level but I am hoping they are db level. Could someone please confirm or deny this?
At the moment, MongoDB does indeed have a global server lock. However, there is some additional code that will release the lock in case memory blocks have to be loaded from disk. It uses lock-yielding for that. Although this does not solve all concurrency issues, it addresses quite a few of the generally associated problems. This post describes it well: http://blog.pythonisito.com/2011/12/mongodbs-write-lock.html
From MongoDB 2.2, there will be a database-level lock, and also more work on yielding is done.
We maintain a fairly large sphinx store. about 3.3 million records. we also maintain a fairly well distributed memcached base set over 4 servers.
We were just wondering if it is advisable to store sphinx results for various queries in memcached, which would be fairly easy to implement.
While I understand this can be a somewhat broad question, but just any general ideas?
Also worth mentioning, the memcached connection is always made in the script that accesses sphinx. total connection times (sphinx + memcached vs just memcached) could be improved. then again, all queries that do not result in a memcached hit would end up having to send a write to memcached.
So, would it be a good idea to store sphinx results in memcached for future use?
Thanks!
It is depends on your situation and on your needs. In your projects we had the same architecture, where memcache caches sphinx search results. But, in general, in our projects, it was minor chance that needed search query results are already in cache. That was about 10% of all queries because of large variety of queries and not guaranteed long-time storing data in memcache. Further more, sphinx usually searches very fast. So, we decided not to use cache in search.
So, you need to do tests. They will tell you.
Since MongoDB does not support transactions, is there any way to guarantee transaction?
What do you mean by "guarantee transaction"?
There are two conepts in MongoDB that are similar;
Atomic operations
Using safe mode / getlasterror ...
http://www.mongodb.org/display/DOCS/Last+Error+Commands
If you simply need to know if there was an error when you run an update for example you can use the getlasterror command, from the docs ...
getlasterror is primarily useful for
write operations (although it is set
after a command or query too). Write
operations by default do not have a
return code: this saves the client
from waiting for client/server
turnarounds during write operations.
One can always call getLastError if
one wants a return code.
If you're writing data to MongoDB on
multiple connections, then it can
sometimes be important to call
getlasterror on one connection to be
certain that the data has been
committed to the database. For
instance, if you're writing to
connection # 1 and want those writes to
be reflected in reads from connection #2, you can assure this by calling getlasterror after writing to
connection # 1.
Alternatively, you can use atomic operations for cases where you need to increment a value for example (like an upvote, etc.) more about that here:
http://www.mongodb.org/display/DOCS/Atomic+Operations
As a side note, MySQL's default storage engine doesn't have transaction either! :)
http://dev.mysql.com/doc/refman/5.1/en/myisam-storage-engine.html
MongoDB only supports atomic operations. There is no ways implement transaction in the sense of ACID on top of MongoDB. Such a transaction support must be implemented in the core. But you will never see full transaction support due to the CARP theorem. You can not have speed, durability and consistency at the same time.
I think ti's one of the things you choose to forego when you choose a NoSQL solution.
If transactions are required, perhaps NoSQL is not for you. Time to go back to ACID relational databases.
Unfortunately MongoDB does't support transaction out of the box, but actually you can implement ACID optimistic transactions on top on it. I wrote an example and some explanation on a GitHub page.
I just wanted to know if there is a fundamental difference between hbase, cassandra, couchdb and monogodb ? In other words, are they all competing in the exact same market and trying to solve the exact same problems. Or they fit best in different scenarios?
All this comes to the question, what should I chose when. Matter of taste?
Thanks,
Federico
Those are some long answers from #Bohzo. (but they are good links)
The truth is, they're "kind of" competing. But they definitely have different strengths and weaknesses and they definitely don't all solve the same problems.
For example Couch and Mongo both provide Map-Reduce engines as part of the main package. HBase is (basically) a layer over top of Hadoop, so you also get M-R via Hadoop. Cassandra is highly focused on being a Key-Value store and has plug-ins to "layer" Hadoop over top (so you can map-reduce).
Some of the DBs provide MVCC (Multi-version concurrency control). Mongo does not.
All of these DBs are intended to scale horizontally, but they do it in different ways. All of these DBs are also trying to provide flexibility in different ways. Flexible document sizes or REST APIs or high redundancy or ease of use, they're all making different trade-offs.
So to your question: In other words, are they all competing in the exact same market and trying to solve the exact same problems?
Yes: they're all trying to solve the issue of database-scalability and performance.
No: they're definitely making different sets of trade-offs.
What should you start with?
Man, that's a tough question. I work for a large company pushing tons of data and we've been through a few years. We tried Cassandra at one point a couple of years ago and it couldn't handle the load. We're using Hadoop everywhere, but it definitely has a steep learning curve and it hasn't worked out in some of our environments. More recently we've tried to do Cassandra + Hadoop, but it turned out to be a lot of configuration work.
Personally, my department is moving several things to MongoDB. Our reasons for this are honestly just simplicity.
Setting up Mongo on a linux box takes minutes and doesn't require root access or a change to the file system or anything fancy. There are no crazy config files or java recompiles required. So from that perspective, Mongo has been the easiest "gateway drug" for getting people on to KV/Document stores.
CouchDB and MongoDB are document stores
Cassandra and HBase are key-value based
Here is a detailed comparison between HBase and Cassandra
Here is a (biased) comparison between MongoDB and CouchDB
Short answer: test before you use in production.
I can offer my experience with both HBase (extensive) and MongoDB (just starting).
Even though they are not the same kind of stores, they solve the same problems:
scalable storage of data
random access to the data
low latency access
We were very enthusiastic about HBase at first. It is built on Hadoop (which is rock-solid), it is under Apache, it is active... what more could you want? Our experience:
HBase is fragile
administrator's nightmare (full of configuration settings where default ones are less than perfect, nontransparent configuration, changes from version to version,...)
loses data (unless you have set the X configuration and changed Y to... you get the point :) - we found that out when HBase crashed and we lost 2 hours (!!!) of data because WAL was not setup properly
lacks secondary indexes
lacks any way to perform a backup of database without shutting it down
All in all, HBase was a nightmare. Wouldn't recommend it to anyone except to our direct competitors. :)
MongoDB solves all these problems and many more. It is a delight to setup, it makes administrating it a simple and transparent job and the default configuration settings actually make sense. You can perform (hot) backups, you can have secondary indexes. From what I read, I wouldn't recommend MapReduce on MongoDB (JavaScript, 1 thread per node only), but you can use Hadoop for that.
And it is also VERY active when compared to HBase.
Also:
http://www.google.com/trends?q=HBase%2CMongoDB
Need I say more? :)
UPDATE: many months later I must say MongoDB delivered on all accounts and more. The only real downside is that hosting companies do not offer it the way they offer MySQL. ;)
It also looks like MapReduce is bound to become multi-threaded in 2.2. Still, I wouldn't use MR this way. YMMV.
Cassandra is good for writing the data. it has advantage of "writes never fail". It has no single point failure.
HBase is very good for data processing. HBase is based on Hadoop File System (HDFS) so HBase dosen't need to worry for data replication, data consistency. HBase has the single point of failure. I am not really sure that what does it's mean if it has single point of failure then it is somhow similar to RDBMS where we have single point of failure. I might be wrong in sense since I am quite new.
How abou RIAK ? Does someone has experience using RIAK. I red some where that you need to pay, I am not sure. Need explanation.
One more thing which one you will prefer to use when you are only concern to reading a lot of data. You don't have any concern with writing. Just imagine you have database with pitabyte and you want to make fast search which NOSQL database would you prefer ?