MongoDB - Materialized View/OLAP Style Aggregation and Performance - mongodb

I've been reading up on MongoDB. I am particularly interested in the aggregation frameworks ability. I am looking at taking multiple dataset consisting of at least 10+ million rows per month and creating aggregations off of this data. This is time series data.
Example. Using Oracle OLAP, you can load data at the second/minute level and have this roll up to hours, days, weeks, months, quarters, years etc...simply define your dimensions and go from there. This works quite well.
So far I have read that MongoDB can handle the above using it's map reduce functionality. Map reduce functionality can be implemented so that it updates results incrementally. This makes sense since I would be loading new data say weekly or monthly and I would expect to only have to process new data that is being loaded.
I have also read that map reduce in MongoDB can be slow. To overcome this, the idea is to use a cheap commodity hardware and spread the load across multiple machines.
So here are my questions.
How good (or bad) does MongoDB handle map reduce in terms of performance? Do you really need a lot of machines to get acceptable performance?
In terms of workflow, is it relatively easy to store and merge the incremental results generated by map reduce?
How much of a performance improvement does the aggregation framework offer?
Does the aggregation framework offer the ability to store results incrementally in a similar manner that the map/reduce functionality that already exists does.
I appreciate your responses in advance!

How good (or bad) does MongoDB handle map reduce in terms of performance? Do you really need a lot of machines to get acceptable performance?
MongoDB's Map/Reduce implementation (as of 2.0.x) is limited by its reliance on the single-threaded SpiderMonkey JavaScript engine. There has been some experimentation with the v8 JavaScript engine and improved concurrency and performance is an overall design goal.
The new Aggregation Framework is written in C++ and has a more scalable implementation including a "pipeline" approach. Each pipeline is currently single-threaded, but you can run different pipelines in parallel. The aggregation framework won't currently replace all jobs that can be done in Map/Reduce, but does simplify a lot of common use cases.
A third option is to use MongoDB for storage in combination with Hadoop via the MongoDB Hadoop Connector. Hadoop currently has a more scalable Map/Reduce implementation and can access MongoDB collections for input and output via the Hadoop Connector.
In terms of workflow, is it relatively easy to store and merge the incremental results generated by map reduce?
Map/Reduce has several output options, including merging the incremental output into a previous output collection or returning the results inline (in memory).
How much of a performance improvement does the aggregation framework offer?
This really depends on the complexity of your Map/Reduce. Overall the aggregation framework is faster (and in some cases, significantly so). You're best doing a comparison for your own use case(s).
MongoDB 2.2 isn't officially released yet, but the 2.2rc0 release candidate has been available since mid-July.
Does the aggregation framework offer the ability to store results incrementally in a similar manner that the map/reduce functionality that already exists does.
The aggregation framework is currently limited to returning results inline so you have to process/display the results when they are returned. The result document is also restricted to the maximum document size in MongoDB (currently 16MB).
There is a proposed $out pipeline command (SERVER-3253) which will likely be added in future for more output options.
Some further reading that may be of interest:
a presentation at MongoDC 2011 on Time Series Data Storage in MongoDB
a presentation at MongoSF 2012 on MongoDB's New Aggregation Framework
capped collections, which could be used similar to RRD

Couchbase map reduce is designed for building incremental indexes, which can then be dynamically queried for the level of rollup you are looking for (much like the Oracle example you gave in your question).
Here is a write up of how this is done using Couchbase: http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-sample-patterns-timestamp.html

Related

When to use map reduce over Aggregation Pipeline in MongoDB?

While looking at documentation for map-reduce, I found that:
NOTE:
For most aggregation operations, the Aggregation Pipeline provides
better performance and more coherent interface. However, map-reduce
operations provide some flexibility that is not presently available in
the aggregation pipeline.
I did not understand much from it.
What are the use cases for using map-reduce over aggregation pipeline?
What flexibility does map-reduce provide?
How much delta is there in performance?
For one thing, Map/Reduce in MongoDB wasn't made for ad-hoc queries, there's considerable overhead to M/R. Even a very simple M/R operation on a small dataset can take in the hundreds of milliseconds because of that overhead.
I can't say much about the performance of M/R compared to the aggregation framework on large datasets in practice, but in theory, M/R operations on a large sharded database should be faster since the shards can run the operations largely in parallel.
As to the flexibility, since M/R actually runs javascript methods you have the full power of the language at your disposal. For example, let's say you wanted to group some data by the cosine of a field's value. Since there's neither a $cos operator in the aggregation framework, nor a meaningful way to build discrete buckets from continuous numbers (something like $truncate), the aggregation framework wouldn't help in that case.
So, in a nutshell, I'd say the use cases are
keeping the results of M/R in a separate collection and updating it from time to time (using the out parameter and merging the results)
Complex queries on large sharded data sets
Queries that are so complex that you can't use the aggregation framework. I'd say that's a pretty certain sign of a design flaw in the data structure, but in principle, it can help

How can I improve aggregation processing time with Map Reduce?

I am trying to improve the speed of my aggregation query. My idea is to create a new collection with Map Reduce and to run aggregation on it. Will this decrease processing time? What are the drawbacks?
Yes, it could work. It's a common pattern to improve querying speed. But MongoDB is special in that case, because Map Reduce needs JavaScript evaluation, while the aggregation framework is implemented natively, therefore aggregation is faster.
I advice to separate the concept from the technology. It's still a good idea to do pre-calculations in a batch job, but you should do it with the aggregation framework instead of Map Reduce.
Drawbacks of using batch jobs are
You can only query what ran through your batch job, so that there will be a delay between inserting new data and retrieving it.
It's only faster if you're able to reduce complexity of your real-time query. For example your new query should read fewer documents than without a batch job.
Your application consumes more disk space, because you're creating an additional collection.
It adds complexity. Therefore start with real-time aggregation and only if it becomes a performance problem, do pre-calculations.
For further latency improvements you might consider to implement a Lambda architecture. It reduces querying time to a minimum by pre-calculating results as far as possible.
This approach to architecture attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate precomputed views, while simultaneously using real-time stream processing to provide dynamic views. The two view outputs may be joined before presentation.

(Real time) Small data aggregation MongoDB: triggers?

What is a reliable and efficient way to aggregate small data in MongoDB?
Currently, my data that needs to be aggregated is under 1 GB, but can go as high as 10 GB. I'm looking for a real time strategy or near real time (aggregation every 15 minutes).
It seems like the likes of Map/Reduce, Hadoop, Storm are all over kill. I know that triggers don't exist, but I found this one post that may be ideal for my situation. Is creating a trigger in MongoDB an ideal solution for real time small data aggregation?
MongoDB has two built-in options for aggregating data - the aggregation framework and map-reduce.
The aggregation framework is faster (executing as native C++ code as opposed to a JavaScript map-reduce job) but more limited in the sorts of aggregations that are supported. Map-reduce is very versatile and can support very complex aggregations but is slower than the aggregation framework and can be more difficult to code.
Either of these would be a good option for near real time aggregation.
One further consideration to take into account is that as of the 2.4 release the aggregation framework returns a single document containing its results and is therefore limited to returning 16MB of data. In contrast, MongoDB map-reduce jobs have no such limitation and may output directly to a collection. In the upcoming 2.6 release of MongoDB, the aggregation framework will also gain the ability to output directly to a collection, using the new $out operator.
Based on the description of your use case, I would recommend using map-reduce as I assume you need to output more than 16MB of data. Also, note that after the first map-reduce run you may run incremental map-reduce jobs that run only on the data that is new/changed and merge the results into the existing output collection.
As you know, MongoDB doesn't support triggers, but you may easily implement triggers in the application by tailing the MongoDB oplog. This blog post and this SO post cover the topic well.

MongoDB Using Map Reduce against Aggregation

I have seen this asked a couple of years ago. Since then MongoDB 2.4 has multi-threaded Map Reduce available (after the switch to the V8 Javascript engine) and has become faster than what it was in previous versions and so the argument of being slow is not an issue.
However, I am looking for a scenario where a Map Reduce approach might work better than the Aggregation Framework. Infact, possibly a scenario where the Aggregation Framework cannot work at all but the Map Reduce can get the required results.
Thanks,
John
Take a look to this.
The Aggregation FW results are stored in a single document so are limited to 16 MB: this might be not suitable for some scenarios. With MapReduce there are several output types available including a new entire collection so it doesn't have space limits.
Generally, MapReduce is better when you have to work with large data sets (may be the entire collection). Furthermore, it gives much more flexibility (you write your own aggregation logic) instead of being restricted to some pipeline commands.
Currently the Aggregation Framework results can't exceed 16MB. But, I think more importantly, you'll find that the AF is better suited to "here and now" type queries that are dynamic in nature (like filters are provided at run-time by the user for example).
A MapReduce is preplanned and can be far more complex and produce very large outputs (as they just output to a new collection). It has no run-time inputs that you can control. You can add complex object manipulation that simply is not possible (or efficient) with the AF. It's simple to manipulate child arrays (or things that are array like) for example in MapReduce as you're just writing JavaScript, whereas in the AF, things can become very unwieldy and unmanageable.
The biggest issue is that MapReduce's aren't automatically kept up to date and they're difficult to predict when they'll complete). You'll need to implement your own solution to keeping them up to date (unlike some other NoSQL options). Usually, that's just a timestamp of some sort and an incremental MapReduce update as shown here). You'll possibly need to accept that the data may be somewhat stale and that they'll take an unknown length of time to complete.
If you hunt around on StackOverflow, you'll find lots of very creative solutions to solving problems with MongoDB and many solutions use the Aggregation Framework as they're working around limitations of the general query engine in MongoDB and can produce "live/immediate" results. (Some AF pipelines are extremely complex though which may be a concern depending on the developers/team/product).

How to overcome the limitations with mongoDB aggregation framework

The aggregation framework on MongoDB has certain limitations as per this link.
I want to remove the restrictions 2, 3.
I really do not care what the resulting set's size is. I have a lot of RAM and resources.
And I do not care if it takes more than 10% system resources.
I expect both 2, 3 to be violated in my application. Mostly 2.
But I really need the aggregation framework. Is there anything that I can do to remove these limitations?
The reason *
The application I have been working has these things
The user has the ability to upload a large dataset
We have a menu to let him sort, aggregate etc
The aggregate has no restrictions currently and the user can choose to do whatever he wants. Since the data is not known to the developer and since it is possible to group by any number of columns, the application can error out.
Choosing something other than mongodb is a no go. We have already sunk too much into development with MongoDB
Is it advisable to change the source code of Mongo?
1) Saving aggregated values directly to some collection(like with MapReduce) will released in future versions, so first solution is just wait for a while :)
2) If you hit 2-nd or 3-rd limitation may you should redesign your data scheme and/or aggregation pipeline. If you working with large time series, you can reduce number of aggregated docs and do aggregation in several steps (like MapReduce do). I can't say more concretely, because I don't know your data/use cases(give me a comment).
3) You can choose different framework. If you familiar with MapReduce concept, you can try Hadoop(it can use MongoDB as data source). I don't have experience with MongoDB-Hadoop integration, but I mast warn you NOT to use Mongo's MapReduce -- it sucks hard on large datasets.
4) You can do aggregation inside your code, but you should use some "lowlevel" language or library. For example, pymongo (http://api.mongodb.org/python/current/) is not suitable for such things, but you can tray something like monary(https://bitbucket.org/djcbeach/monary/wiki/Home) to efficiently extract date and NumPy or Pandas to aggregate it the way want.