Why db object is not accessible from map function in Mongodb's Mapreduce? - mongodb

In Mongodb's Mapreduce, Before I think there was "db"(like db.anotherCollection.find()) object accessible inside map function. But this features has been removed(from version 1.6 or so on), which make difficult in case of join. what was the reason? why it has been removed?

As at MongoDB 2.4 there are several reasons to disallow access to the db object with from within Map/Reduce functions including:
Deadlocks: There are potential deadlock scenarios between database and/or JavaScript locks called from within the same server-side function.
Performance: The Map/Reduce pattern calls reduce() multiple times; each iteration is a different JavaScript context and would have to open new connections to the database and allocate additional memory for query results. Long-running JavaScript operations will block other operations.
Security: Cross-database queries require appropriate authentication checks.
The above issues could be further complicated for Map/Reduce jobs reading or writing to sharded clusters. The MongoDB Map/Reduce implementation is currently only designed to work with data from a single input collection, and any historical abuses of db object within Map/Reduce functions should be considered a bug rather than a feature.
If you want to merge data with Map/Reduce, you can use an Incremental Map/Reduce. Depending on what outcome you are trying to achieve, there are other approaches that may be more straightforward such as adjusting your schema or doing joins in your application code via multiple queries.

Related

Best way to query entire MongoDB collection for ETL

We want to query an entire live production MongoDB collection (v2.6, around 500GB of data on around 70M documents).
We're wondering what's the best approach for this:
A single query with no filtering to open a cursor and get documents in batches of 5/6k
Iterate with pagination, using a logic of find().limit(5000).skip(currentIteration * 5000)
We're unsure what's the best practice and will yield the best results with minimum impact on performance.
I would go with 1. & 2. mixed if possible: Iterate over your huge dataset in pages but access those pages by querying instead of skipping over them as this may be costly as also pointed out by the docs.
The cursor.skip() method is often expensive because it requires the
server to walk from the beginning of the collection or index to get
the offset or skip position before beginning to return results. As the
offset (e.g. pageNumber above) increases, cursor.skip() will become
slower and more CPU intensive. With larger collections, cursor.skip()
may become IO bound.
So if possible build your pages on an indexed field and process those batches of data with an according query range.
The brutal way
Generally speaking, most drivers load batches of documents anyway. So your languages equivalent of
var docs = db.yourcoll.find()
docs.forEach(
function(doc){
//whatever
}
)
will actually just create a cursor initially, and will then, when the current batch is close to exhaustion, load a new batch transparently. So doing this pagination manually while planning to access every document in the collection will have little to no advantage, but hold the overhead of multiple queries.
As for ETL, manually iterating over the documents to modify and then store them in a new instance does under most circumstances not seem reasonable to me, as you basically reinvent the wheel.
Alternate approach
Generally speaking, there is no one-size-fits all "best" way. The best way is the one that best fits your functional and non-functional requirements.
When doing ETL from MongoDB to MongoDB, I usually proceed as follows:
ET…
Unless you have very complicated transformations, MongoDB's aggregation framework is a surprisingly capable ETL tool. I use it regularly for that purpose and have yet to find a problem not solvable with the aggregation framework for in-MongoDB ETL. Given the fact that in general each document is processed one by one, the impact on your production environment should be minimal, if noticeable at all. After you did your transformation, simply use the $out stage to save the results in a new collection.
Even collection spanning transformations can be achieved, using the $lookup stage.
…L
After you did the extract and transform on the old instance, for loading the data to the new MongoDB instance, you have several possibilities:
Create a temporary replica set, consisting of the old instance, the new instance and an arbiter. Make sure your old instance becomes primary, do the ET part, have the primary step down so your new instance becomes primary and remove the old instance and the arbiter from the replica set. The advantage is that you facilitate MongoDB's replication mechanics to get the data from your old instance to your new instance, without the need to worry about partially executed transfers and such. And you can use it the other way around: Transfer the data first, make the new instance the primary, remove the other members from the replica set perform your transformations and remove the "old" data, then.
Use db.CloneCollection(). The advantage here is that you only transfer the collections you need, at the expense of more manual work.
Use db.cloneDatabase() to copy over the entire DB. Unless you have multiple databases on the original instance, this method has little to now advantage over the replica set method.
As written, without knowing your exact use cases, transformations and constraints, it is hard to tell which approach makes the most sense for you.
MongoDB 3.4 support Parallel Collection Scan. I never tried this myself yet. But looks interesting to me.
This will not work on sharded clusters. If we have parallel processing setup this will speed up the scanning for sure.
Please see the documentation here: https://docs.mongodb.com/manual/reference/command/parallelCollectionScan/

Can I set the priority of a MongoDB query so that batch/archiving operations are yielded to user requests?

I have been using Mongo's map/reduce for some time and this blocks other operations since the JS engine in 2.0 took out a lock (or so I believe). I am just experimenting with the new aggregation framework in 2.2 and had hoped that since it's just reading it would not need to lock, but according to db.currentOps() it is locking.
So in relation to the new aggregation framework (and indeed with any MongoDB operation) I would like to know if it's possible to indicate a priority of a certain operation so that MongoDB can intelligently yield low priority operations (such as some background updates) to a time-sensitive high-priority operation?
In this doc you can see it says Map Reduce "Allows substantial concurrent operation but exclusive to other javascript execution."
So this means that it already yields operation on the database. Plus mostly this is bounded to the map/reduce being a single threaded JavaScript operation.
But if you want make sure there is no locking you can write the output of Map Reduce to another database and then move the collection to the original database.
>use admin
>db.runCommand( {renameCollection: "mapreducedb.mycol", to: "appdb.mycol"} )
Same for the Aggregation Framework
EDIT: can't be used for the Aggregation framework (as of 2.2), as it does not have an {{$out}} operator to write to another database. But Aggregation operations are still safe to execute on the production/main database as Yielding of those operations will still occur.

MongoDB - Materialized View/OLAP Style Aggregation and Performance

I've been reading up on MongoDB. I am particularly interested in the aggregation frameworks ability. I am looking at taking multiple dataset consisting of at least 10+ million rows per month and creating aggregations off of this data. This is time series data.
Example. Using Oracle OLAP, you can load data at the second/minute level and have this roll up to hours, days, weeks, months, quarters, years etc...simply define your dimensions and go from there. This works quite well.
So far I have read that MongoDB can handle the above using it's map reduce functionality. Map reduce functionality can be implemented so that it updates results incrementally. This makes sense since I would be loading new data say weekly or monthly and I would expect to only have to process new data that is being loaded.
I have also read that map reduce in MongoDB can be slow. To overcome this, the idea is to use a cheap commodity hardware and spread the load across multiple machines.
So here are my questions.
How good (or bad) does MongoDB handle map reduce in terms of performance? Do you really need a lot of machines to get acceptable performance?
In terms of workflow, is it relatively easy to store and merge the incremental results generated by map reduce?
How much of a performance improvement does the aggregation framework offer?
Does the aggregation framework offer the ability to store results incrementally in a similar manner that the map/reduce functionality that already exists does.
I appreciate your responses in advance!
How good (or bad) does MongoDB handle map reduce in terms of performance? Do you really need a lot of machines to get acceptable performance?
MongoDB's Map/Reduce implementation (as of 2.0.x) is limited by its reliance on the single-threaded SpiderMonkey JavaScript engine. There has been some experimentation with the v8 JavaScript engine and improved concurrency and performance is an overall design goal.
The new Aggregation Framework is written in C++ and has a more scalable implementation including a "pipeline" approach. Each pipeline is currently single-threaded, but you can run different pipelines in parallel. The aggregation framework won't currently replace all jobs that can be done in Map/Reduce, but does simplify a lot of common use cases.
A third option is to use MongoDB for storage in combination with Hadoop via the MongoDB Hadoop Connector. Hadoop currently has a more scalable Map/Reduce implementation and can access MongoDB collections for input and output via the Hadoop Connector.
In terms of workflow, is it relatively easy to store and merge the incremental results generated by map reduce?
Map/Reduce has several output options, including merging the incremental output into a previous output collection or returning the results inline (in memory).
How much of a performance improvement does the aggregation framework offer?
This really depends on the complexity of your Map/Reduce. Overall the aggregation framework is faster (and in some cases, significantly so). You're best doing a comparison for your own use case(s).
MongoDB 2.2 isn't officially released yet, but the 2.2rc0 release candidate has been available since mid-July.
Does the aggregation framework offer the ability to store results incrementally in a similar manner that the map/reduce functionality that already exists does.
The aggregation framework is currently limited to returning results inline so you have to process/display the results when they are returned. The result document is also restricted to the maximum document size in MongoDB (currently 16MB).
There is a proposed $out pipeline command (SERVER-3253) which will likely be added in future for more output options.
Some further reading that may be of interest:
a presentation at MongoDC 2011 on Time Series Data Storage in MongoDB
a presentation at MongoSF 2012 on MongoDB's New Aggregation Framework
capped collections, which could be used similar to RRD
Couchbase map reduce is designed for building incremental indexes, which can then be dynamically queried for the level of rollup you are looking for (much like the Oracle example you gave in your question).
Here is a write up of how this is done using Couchbase: http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-sample-patterns-timestamp.html

MongoDB - How does locking work for Map Reduce?

Does MongoDB map reduce lock a collection when performing an operation on it?
I have some collections that are widely and intensively used by an application. A Map/Reduce runs in the background every 10 minutes via a cron job, on that widely and intensively used collection.
I want to know if there is a high probability that Map/Reduce won't perform well because other operations are in progress (inserts, updates, and mostly reads) on that collection. In particular, I want know if Map/Reduce interferes with normal operations performed on the collection by users.
MapReduce, if outputting to a collection will take multiple write locks out as it writes (as any operation which is creating/updating a collection would). If you are doing an in-line MR, then you avoid that locking (but have limitations on result sizes). Even so, there are still read-locks and the Javascript lock (single threaded for server side JS on mongoDB right now).
This is all explained (and will be updated if it changes) here:
http://www.mongodb.org/display/DOCS/How+does+concurrency+work#Howdoesconcurrencywork-MapReduce
Note: the SpiderMonkey to V8 JS engine migration issues are ones to watch if multi-threading is something you are concerned about.

When do I need map reduce for database queries?

In CouchDB you always have to use map reduce to query results.
In MongoDB you can their query methods for retrieving data, but they also let you do map-reduce.
I wonder, when do I actually need map-reduce?
Are those query methods different from map-reduce or are they just wrappers for map-reduce functions?
MapReduce is needed for aggregations in MongoDB. The normal queries follow a very different (and much faster) code path and they should always be used for real-time operations. MapReduce is definitely not intended for real-time, it's more for batch jobs.
Technically, you could write all your queries using MapReduce, but that would be both painful and slow.