MongoDB aggregation comparison: group(), $group and MapReduce - mongodb

I am somewhat confused about when to use group(), aggregate with $group or mapreduce. I read the documentation at http://www.mongodb.org/display/DOCS/Aggregation for group(), http://docs.mongodb.org/manual/reference/aggregation/group/#_S_group for $group.. Is sharding the only situation where group() won't work? Also, I get this feeling that $group is more powerful than group() because it can be used in conjunction with other pipeline operators from aggregation framework.. How does $group compare with mapreduce? I read somewhere that it doesn't generate any temporary collection whereas mapreduce does. Is that so?
Can someone present an illustration or guide me to a link where these three concepts are explained together, taking the same sample data, so I can compare them easily?
EDIT:Also, it would be great if you can point out anything new specifically in these commands since the new 2.2 release came out..

It is somewhat confusing since the names are similar, but the group() command is a different feature and implementation from the $group pipeline operator in the Aggregation Framework.
The group() command, Aggregation Framework, and MapReduce are collectively aggregation features of MongoDB. There is some overlap in features, but I'll attempt to explain the differences and limitations of each as at MongoDB 2.2.0.
Note: inline result sets mentioned below refer to queries that are processed in memory with results returned at the end of the function call. Alternative output options (currently only available with MapReduce) could include saving results to a new or existing collection.
group() Command
Simple syntax and functionality for grouping .. analogous to GROUP BY in SQL.
Returns result set inline (as an array of grouped items).
Implemented using the JavaScript engine; custom reduce() functions can be written in JavaScript.
Current Limitations
Will not group into a result set with more than 20,000 keys.
Results must fit within the limitations of a BSON document (currently 16MB).
Takes a read lock and does not allow any other threads to execute JavaScript while it is running.
Does not work with sharded collections.
See also: group() command examples.
MapReduce
Implements the MapReduce model for processing large data sets.
Can choose from one of several output options (inline, new collection, merge, replace, reduce)
MapReduce functions are written in JavaScript.
Supports non-sharded and sharded input collections.
Can be used for incremental aggregation over large collections.
MongoDB 2.2 implements much better support for sharded map reduce output.
Current Limitations
A single emit can only hold half of MongoDB's maximum BSON document size (16MB).
There is a JavaScript lock so a mongod server can only execute one JavaScript function at a point in time .. however, most steps of the MapReduce are very short so locks can be yielded frequently.
MapReduce functions can be difficult to debug. You can use print() and printjson() to include diagnostic output in the mongod log.
MapReduce is generally not intuitive for programmers trying to translate relational query aggregation experience.
See also: Map/Reduce examples.
Aggregation Framework
New feature in the MongoDB 2.2.0 production release (August, 2012).
Designed with specific goals of improving performance and usability.
Returns result set inline.
Supports non-sharded and sharded input collections.
Uses a "pipeline" approach where objects are transformed as they pass through a series of pipeline operators such as matching, projecting, sorting, and grouping.
Pipeline operators need not produce one output document for every input document: operators may also generate new documents or filter out documents.
Using projections you can add computed fields, create new virtual sub-objects, and extract sub-fields into the top-level of results.
Pipeline operators can be repeated as needed (for example, multiple $project or $group steps.
Current Limitations
Results are returned inline, so are limited to the maximum document size supported by the server (16MB)
Doesn't support as many output options as MapReduce
Limited to operators and expressions supported by the Aggregation Framework (i.e. can't write custom functions)
Newest server feature for aggregation, so has more room to mature in terms of documentation, feature set, and usage.
See also: Aggregation Framework examples.
Can someone present an illustration or guide me to a link where these three concepts are explained together, taking the same sample data, so I can compare them easily?
You generally won't find examples where it would be useful to compare all three approaches, but here are previous StackOverflow questions which show variations:
group() versus Aggregation Framework
MapReduce versus Aggregation Framework

Related

Which one is good mongodb aggregate or mongodb functions

Which one is good mongodb aggregate or mongodb functions
what i mean to say : mongodb aggregation or mongodb functional which one is preforable and which one gives good performance.
Mongodb functions
Defines a custom aggregation function or expression in JavaScript.
Whereas aggregation is a prebuilt supported operations.
Executing JavaScript inside an aggregation expression may decrease performance. Only use the $function operator if the provided pipeline operators cannot fulfill your application's needs.
If you have a logic which needs to be customised not supported out of the box, go for functions.
Reference

When to use map reduce over Aggregation Pipeline in MongoDB?

While looking at documentation for map-reduce, I found that:
NOTE:
For most aggregation operations, the Aggregation Pipeline provides
better performance and more coherent interface. However, map-reduce
operations provide some flexibility that is not presently available in
the aggregation pipeline.
I did not understand much from it.
What are the use cases for using map-reduce over aggregation pipeline?
What flexibility does map-reduce provide?
How much delta is there in performance?
For one thing, Map/Reduce in MongoDB wasn't made for ad-hoc queries, there's considerable overhead to M/R. Even a very simple M/R operation on a small dataset can take in the hundreds of milliseconds because of that overhead.
I can't say much about the performance of M/R compared to the aggregation framework on large datasets in practice, but in theory, M/R operations on a large sharded database should be faster since the shards can run the operations largely in parallel.
As to the flexibility, since M/R actually runs javascript methods you have the full power of the language at your disposal. For example, let's say you wanted to group some data by the cosine of a field's value. Since there's neither a $cos operator in the aggregation framework, nor a meaningful way to build discrete buckets from continuous numbers (something like $truncate), the aggregation framework wouldn't help in that case.
So, in a nutshell, I'd say the use cases are
keeping the results of M/R in a separate collection and updating it from time to time (using the out parameter and merging the results)
Complex queries on large sharded data sets
Queries that are so complex that you can't use the aggregation framework. I'd say that's a pretty certain sign of a design flaw in the data structure, but in principle, it can help

Pipeline vs MapReduce in MongoDB

When do I use prefer MapReduce over Pipeline in MongoDB or vice versa? I feel most of the aggregation operations are suitable for pipeline. What kind of complexity of the problem or what use case should make me go for MapReduce.
As a general rule of thumb: When you can do it with the aggregation pipeline, you should.
One reason is that the aggregation pipeline is able to use indexes and internal optimizations between the aggregation steps which are just not possible with MapReduce.
Aggregation is also a lot more secure when the operation is triggered by user input. When there are any user-supplied parameters to your query, MapReduce forces you to create javascript functions through string concatenation. This opens the door for dangerous Javascript code injection vulnerabilities. The APIs used for creating aggregation pipeline objects (in most programming languages!) usually has fewer such obvious pitfalls.
There are, however, still a few cases which can not be done easily or not at all with aggregation. For these cases, MapReduce has still a reason to exist.
Another limitation of the aggregation framework is that the intermediate dataset after each aggregation step is limited to 100MB unless you use the allowDiskUse option, which really slows down the query. MapReduce usually behaves a lot better when you need to work with a really large dataset.

(Real time) Small data aggregation MongoDB: triggers?

What is a reliable and efficient way to aggregate small data in MongoDB?
Currently, my data that needs to be aggregated is under 1 GB, but can go as high as 10 GB. I'm looking for a real time strategy or near real time (aggregation every 15 minutes).
It seems like the likes of Map/Reduce, Hadoop, Storm are all over kill. I know that triggers don't exist, but I found this one post that may be ideal for my situation. Is creating a trigger in MongoDB an ideal solution for real time small data aggregation?
MongoDB has two built-in options for aggregating data - the aggregation framework and map-reduce.
The aggregation framework is faster (executing as native C++ code as opposed to a JavaScript map-reduce job) but more limited in the sorts of aggregations that are supported. Map-reduce is very versatile and can support very complex aggregations but is slower than the aggregation framework and can be more difficult to code.
Either of these would be a good option for near real time aggregation.
One further consideration to take into account is that as of the 2.4 release the aggregation framework returns a single document containing its results and is therefore limited to returning 16MB of data. In contrast, MongoDB map-reduce jobs have no such limitation and may output directly to a collection. In the upcoming 2.6 release of MongoDB, the aggregation framework will also gain the ability to output directly to a collection, using the new $out operator.
Based on the description of your use case, I would recommend using map-reduce as I assume you need to output more than 16MB of data. Also, note that after the first map-reduce run you may run incremental map-reduce jobs that run only on the data that is new/changed and merge the results into the existing output collection.
As you know, MongoDB doesn't support triggers, but you may easily implement triggers in the application by tailing the MongoDB oplog. This blog post and this SO post cover the topic well.

Examples which can be done by map reduce only and not aggregation framework in mongodb?

I wanted to know about some examples or scenarios related to Mongo DB which can be done by map-reduce but not aggregation framework ?
Map-reduce is considered to be very powerful tool/mechanism of aggregating data. Then can some of you please share few scenarios where it is not possible for map-reduce to do it ?
Thanks & Best Regards.
In MongoDB currently aggregation framework is limited to 16MB of returned results.
MapReduce can write its output to a collection and has no size limitations.
MapReduce can group entire documents, aggregation framework works on field level. MapReduce can map keys to values and values to keys which can't be done any other way. MapReduce can also call/use various JavaScript built-in functions where aggregation is limited to functions and expressions which are built-in to its framework.