MongoDB - How does locking work for Map Reduce? - mongodb

Does MongoDB map reduce lock a collection when performing an operation on it?
I have some collections that are widely and intensively used by an application. A Map/Reduce runs in the background every 10 minutes via a cron job, on that widely and intensively used collection.
I want to know if there is a high probability that Map/Reduce won't perform well because other operations are in progress (inserts, updates, and mostly reads) on that collection. In particular, I want know if Map/Reduce interferes with normal operations performed on the collection by users.

MapReduce, if outputting to a collection will take multiple write locks out as it writes (as any operation which is creating/updating a collection would). If you are doing an in-line MR, then you avoid that locking (but have limitations on result sizes). Even so, there are still read-locks and the Javascript lock (single threaded for server side JS on mongoDB right now).
This is all explained (and will be updated if it changes) here:
http://www.mongodb.org/display/DOCS/How+does+concurrency+work#Howdoesconcurrencywork-MapReduce
Note: the SpiderMonkey to V8 JS engine migration issues are ones to watch if multi-threading is something you are concerned about.

Related

How long will a mongo internal cache sustain?

I would like to know how long a mongo internal cache would sustain. I have a scenario in which i have some one million records and i have to perform a search on them using the mongo-java driver.
The initial search takes a lot of time (nearly one minute) where as the consecutive searches of same query reduces the computation time (to few seconds) due to mongo's internal caching mechanism.
But I do not know how long this cache would sustain, like is it until the system reboots or until the collection undergoes any write operation or things like that.
Any help in understanding this is appreciated!
PS:
Regarding the fields with which search is performed, some are indexed
and some are not.
Mongo version used 2.6.1
It will depend on a lot of factors, but the most prominent are the amount of memory in the server and how active the server is as MongoDB leaves much of the caching to the OS (by MMAP'ing files).
You need to take a long hard look at your log files for the initial query and try to figure out why it takes nearly a minute.
In most cases there is some internal cache invalidation mechanism that will drop your cached query internal record when write operation occurs. It is the simplest describing of process. Just from my own expirience.
But, as mentioned earlier, there are many factors besides simple invalidation that can have place.
MongoDB automatically uses all free memory on the machine as its cache.It would be better to use MongoDB 3.0+ versions because it comes with two Storage Engines MMAP and WiredTiger.
The major difference between these two is that whenever you perform a write operation in MMAP then the whole database is going to lock and whereas the locking mechanism is upto document level in WiredTiger.
If you are using MongoDB 2.6 version then you can also check the query performance and execution time taking to execute the query by explain() method and in version 3.0+ executionStats() in DB Shell Commands.
You need to index on a particular field which you will query to get results faster. A single collection cannot have more than 64 indexes. The more index you use in a collection there is performance impact in write/update operations.

MongoDB realtime query

I've heard about RethinkDB and since I'm developing a multi-player online game I think if MongoDB pushes the changes (let's say new rows) instead of pulling rows, it would be much faster for both server-side and client-side.
Is there any wrapper or techniques to make a realtime query to MongoDB or not?
You can leverage tailable cursors on capped collections. At the lowest level, that would require writing all changes to the capped collection first, then have them be applied by some kind of worker (an event sourcing pattern). That's a severe change of application architecture, so it's probably not what you want.
A more generic approach is to watch the oplog, a special capped collection that is used to synchronize master and secondary nodes and that contains all operations performed on documents, so no change in application architecture is required.
Still, this is somewhat more low-level than what RethinkDB exposes, in particular because you need to perform a diff. There are wrappers that can hide some of the complexity, but I haven't used them and I don't know what programming language you're using. Oplog monitoring is used, for example, by Meteor, which is pretty much built on publish/subscribe and hides most of the complexity, so it's generally possible, though it seems it's more complicated than with RethinkDB.

Why db object is not accessible from map function in Mongodb's Mapreduce?

In Mongodb's Mapreduce, Before I think there was "db"(like db.anotherCollection.find()) object accessible inside map function. But this features has been removed(from version 1.6 or so on), which make difficult in case of join. what was the reason? why it has been removed?
As at MongoDB 2.4 there are several reasons to disallow access to the db object with from within Map/Reduce functions including:
Deadlocks: There are potential deadlock scenarios between database and/or JavaScript locks called from within the same server-side function.
Performance: The Map/Reduce pattern calls reduce() multiple times; each iteration is a different JavaScript context and would have to open new connections to the database and allocate additional memory for query results. Long-running JavaScript operations will block other operations.
Security: Cross-database queries require appropriate authentication checks.
The above issues could be further complicated for Map/Reduce jobs reading or writing to sharded clusters. The MongoDB Map/Reduce implementation is currently only designed to work with data from a single input collection, and any historical abuses of db object within Map/Reduce functions should be considered a bug rather than a feature.
If you want to merge data with Map/Reduce, you can use an Incremental Map/Reduce. Depending on what outcome you are trying to achieve, there are other approaches that may be more straightforward such as adjusting your schema or doing joins in your application code via multiple queries.

Can I set the priority of a MongoDB query so that batch/archiving operations are yielded to user requests?

I have been using Mongo's map/reduce for some time and this blocks other operations since the JS engine in 2.0 took out a lock (or so I believe). I am just experimenting with the new aggregation framework in 2.2 and had hoped that since it's just reading it would not need to lock, but according to db.currentOps() it is locking.
So in relation to the new aggregation framework (and indeed with any MongoDB operation) I would like to know if it's possible to indicate a priority of a certain operation so that MongoDB can intelligently yield low priority operations (such as some background updates) to a time-sensitive high-priority operation?
In this doc you can see it says Map Reduce "Allows substantial concurrent operation but exclusive to other javascript execution."
So this means that it already yields operation on the database. Plus mostly this is bounded to the map/reduce being a single threaded JavaScript operation.
But if you want make sure there is no locking you can write the output of Map Reduce to another database and then move the collection to the original database.
>use admin
>db.runCommand( {renameCollection: "mapreducedb.mycol", to: "appdb.mycol"} )
Same for the Aggregation Framework
EDIT: can't be used for the Aggregation framework (as of 2.2), as it does not have an {{$out}} operator to write to another database. But Aggregation operations are still safe to execute on the production/main database as Yielding of those operations will still occur.

MongoDB - Materialized View/OLAP Style Aggregation and Performance

I've been reading up on MongoDB. I am particularly interested in the aggregation frameworks ability. I am looking at taking multiple dataset consisting of at least 10+ million rows per month and creating aggregations off of this data. This is time series data.
Example. Using Oracle OLAP, you can load data at the second/minute level and have this roll up to hours, days, weeks, months, quarters, years etc...simply define your dimensions and go from there. This works quite well.
So far I have read that MongoDB can handle the above using it's map reduce functionality. Map reduce functionality can be implemented so that it updates results incrementally. This makes sense since I would be loading new data say weekly or monthly and I would expect to only have to process new data that is being loaded.
I have also read that map reduce in MongoDB can be slow. To overcome this, the idea is to use a cheap commodity hardware and spread the load across multiple machines.
So here are my questions.
How good (or bad) does MongoDB handle map reduce in terms of performance? Do you really need a lot of machines to get acceptable performance?
In terms of workflow, is it relatively easy to store and merge the incremental results generated by map reduce?
How much of a performance improvement does the aggregation framework offer?
Does the aggregation framework offer the ability to store results incrementally in a similar manner that the map/reduce functionality that already exists does.
I appreciate your responses in advance!
How good (or bad) does MongoDB handle map reduce in terms of performance? Do you really need a lot of machines to get acceptable performance?
MongoDB's Map/Reduce implementation (as of 2.0.x) is limited by its reliance on the single-threaded SpiderMonkey JavaScript engine. There has been some experimentation with the v8 JavaScript engine and improved concurrency and performance is an overall design goal.
The new Aggregation Framework is written in C++ and has a more scalable implementation including a "pipeline" approach. Each pipeline is currently single-threaded, but you can run different pipelines in parallel. The aggregation framework won't currently replace all jobs that can be done in Map/Reduce, but does simplify a lot of common use cases.
A third option is to use MongoDB for storage in combination with Hadoop via the MongoDB Hadoop Connector. Hadoop currently has a more scalable Map/Reduce implementation and can access MongoDB collections for input and output via the Hadoop Connector.
In terms of workflow, is it relatively easy to store and merge the incremental results generated by map reduce?
Map/Reduce has several output options, including merging the incremental output into a previous output collection or returning the results inline (in memory).
How much of a performance improvement does the aggregation framework offer?
This really depends on the complexity of your Map/Reduce. Overall the aggregation framework is faster (and in some cases, significantly so). You're best doing a comparison for your own use case(s).
MongoDB 2.2 isn't officially released yet, but the 2.2rc0 release candidate has been available since mid-July.
Does the aggregation framework offer the ability to store results incrementally in a similar manner that the map/reduce functionality that already exists does.
The aggregation framework is currently limited to returning results inline so you have to process/display the results when they are returned. The result document is also restricted to the maximum document size in MongoDB (currently 16MB).
There is a proposed $out pipeline command (SERVER-3253) which will likely be added in future for more output options.
Some further reading that may be of interest:
a presentation at MongoDC 2011 on Time Series Data Storage in MongoDB
a presentation at MongoSF 2012 on MongoDB's New Aggregation Framework
capped collections, which could be used similar to RRD
Couchbase map reduce is designed for building incremental indexes, which can then be dynamically queried for the level of rollup you are looking for (much like the Oracle example you gave in your question).
Here is a write up of how this is done using Couchbase: http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-sample-patterns-timestamp.html