MongoDB Using Map Reduce against Aggregation

MongoDB Using Map Reduce against Aggregation - mongodb

I have seen this asked a couple of years ago. Since then MongoDB 2.4 has multi-threaded Map Reduce available (after the switch to the V8 Javascript engine) and has become faster than what it was in previous versions and so the argument of being slow is not an issue.
However, I am looking for a scenario where a Map Reduce approach might work better than the Aggregation Framework. Infact, possibly a scenario where the Aggregation Framework cannot work at all but the Map Reduce can get the required results.
Thanks,
John

Take a look to this.
The Aggregation FW results are stored in a single document so are limited to 16 MB: this might be not suitable for some scenarios. With MapReduce there are several output types available including a new entire collection so it doesn't have space limits.
Generally, MapReduce is better when you have to work with large data sets (may be the entire collection). Furthermore, it gives much more flexibility (you write your own aggregation logic) instead of being restricted to some pipeline commands.

Currently the Aggregation Framework results can't exceed 16MB. But, I think more importantly, you'll find that the AF is better suited to "here and now" type queries that are dynamic in nature (like filters are provided at run-time by the user for example).
A MapReduce is preplanned and can be far more complex and produce very large outputs (as they just output to a new collection). It has no run-time inputs that you can control. You can add complex object manipulation that simply is not possible (or efficient) with the AF. It's simple to manipulate child arrays (or things that are array like) for example in MapReduce as you're just writing JavaScript, whereas in the AF, things can become very unwieldy and unmanageable.
The biggest issue is that MapReduce's aren't automatically kept up to date and they're difficult to predict when they'll complete). You'll need to implement your own solution to keeping them up to date (unlike some other NoSQL options). Usually, that's just a timestamp of some sort and an incremental MapReduce update as shown here). You'll possibly need to accept that the data may be somewhat stale and that they'll take an unknown length of time to complete.
If you hunt around on StackOverflow, you'll find lots of very creative solutions to solving problems with MongoDB and many solutions use the Aggregation Framework as they're working around limitations of the general query engine in MongoDB and can produce "live/immediate" results. (Some AF pipelines are extremely complex though which may be a concern depending on the developers/team/product).

Related

MongoDB Data Transformation for UI display

We've stumbled into a huge issue, which we need to figure out how to achieve in a right manner.
We are using MongoDB via Mongoose and dump a lot of different data.
We need to create a big aggregation from few collections based on certain inputs.
Creating an aggregation function doesn't supply good performance times.
We need to find a technical solution, the correct one,
To actually create "ETL" but no a real ETL, to store a real time sample of the data, so UI layer could query it smoothly.
Let's say i have 5 collections, which i need a real time display of 3 fields of each, and "join" using aggregation won't supply a good enough performance wise solution.
We might need a mediator, we thought dumping data using etl to redshift but it doesn't feel like the right solution.
It seems like a common problem, but we don't seem to find the smooth and correct solution.
We don't mind deploying whatever needed.
Thanks for any advice.

Aggregate,Find,Group confusion?

I am building a web based system for my organization, using Mongo DB, I have gone through the document provided by mongo db and came to the following conclusion:
find: Cannot pull data from sub array.
group: Cannot work in sharded environment.
aggregate:Best for sub arrays, but has performance issue when data set is large.
Map Reduce : Too risky to write map and reduce function.
So,if someone can help me out with the best approach to work with sub array document, in production environment having sharded cluster.
Example:
{"testdata":{"studdet":[{"id","name":"xxxx","marks",80}.....]}}
now my "studdet" is a huge collection of more than 1000, rows for each document,
So suppose my query is:
"Find all the "name" from "studdet" where marks is greater than 80"
its definitely going to be an aggregate query, so is it feasible to go with aggregate in this case because ,"find" cannot do this and "group" will not work in sharded environment, so if I go with aggregate what will be the performance impact, i need to call this query most of the time.

Please have a look at:
http://docs.mongodb.org/manual/core/data-modeling/
and
http://docs.mongodb.org/manual/tutorial/model-embedded-one-to-many-relationships-between-documents/#data-modeling-example-one-to-many
These documents describe the decisions in creating a good document schema in MongoDB. That is one of the hardest things to do in MongoDB, and one of the most important. It will affect your performance etc.
In your case running a database that has a student collection with an array of grades looks to be the best bet.
{_id:, …., grades:[{type:”test”, grade:80},….]}
In general, and, given your sample data set, the aggregation framework is the best choice. The aggregation framework is faster then map reduce in most cases (certainly in execution speed, it is C++ vs javascript for map reduce).
If your data's working set becomes so large you have to shard then aggregation, and everything else, will be slower. Not, however, slower then putting everything on a single machine that has a lot of page faults. Generally you need a working set larger then the RAM available on a modern computer for sharding to be the correct way to go such that you can keep everything in RAM. (At this point a commercial support contract for Mongo for assistance is going to be a less then the cost of hardware, and that include extensive help with schema design.)
If you need anything else please don’t hesitate to ask.
Best,
Charlie

How to overcome the limitations with mongoDB aggregation framework

The aggregation framework on MongoDB has certain limitations as per this link.
I want to remove the restrictions 2, 3.
I really do not care what the resulting set's size is. I have a lot of RAM and resources.
And I do not care if it takes more than 10% system resources.
I expect both 2, 3 to be violated in my application. Mostly 2.
But I really need the aggregation framework. Is there anything that I can do to remove these limitations?
The reason *
The application I have been working has these things
The user has the ability to upload a large dataset
We have a menu to let him sort, aggregate etc
The aggregate has no restrictions currently and the user can choose to do whatever he wants. Since the data is not known to the developer and since it is possible to group by any number of columns, the application can error out.
Choosing something other than mongodb is a no go. We have already sunk too much into development with MongoDB
Is it advisable to change the source code of Mongo?

1) Saving aggregated values directly to some collection(like with MapReduce) will released in future versions, so first solution is just wait for a while :)
2) If you hit 2-nd or 3-rd limitation may you should redesign your data scheme and/or aggregation pipeline. If you working with large time series, you can reduce number of aggregated docs and do aggregation in several steps (like MapReduce do). I can't say more concretely, because I don't know your data/use cases(give me a comment).
3) You can choose different framework. If you familiar with MapReduce concept, you can try Hadoop(it can use MongoDB as data source). I don't have experience with MongoDB-Hadoop integration, but I mast warn you NOT to use Mongo's MapReduce -- it sucks hard on large datasets.
4) You can do aggregation inside your code, but you should use some "lowlevel" language or library. For example, pymongo (http://api.mongodb.org/python/current/) is not suitable for such things, but you can tray something like monary(https://bitbucket.org/djcbeach/monary/wiki/Home) to efficiently extract date and NumPy or Pandas to aggregate it the way want.

MongoDB - Materialized View/OLAP Style Aggregation and Performance

I've been reading up on MongoDB. I am particularly interested in the aggregation frameworks ability. I am looking at taking multiple dataset consisting of at least 10+ million rows per month and creating aggregations off of this data. This is time series data.
Example. Using Oracle OLAP, you can load data at the second/minute level and have this roll up to hours, days, weeks, months, quarters, years etc...simply define your dimensions and go from there. This works quite well.
So far I have read that MongoDB can handle the above using it's map reduce functionality. Map reduce functionality can be implemented so that it updates results incrementally. This makes sense since I would be loading new data say weekly or monthly and I would expect to only have to process new data that is being loaded.
I have also read that map reduce in MongoDB can be slow. To overcome this, the idea is to use a cheap commodity hardware and spread the load across multiple machines.
So here are my questions.
How good (or bad) does MongoDB handle map reduce in terms of performance? Do you really need a lot of machines to get acceptable performance?
In terms of workflow, is it relatively easy to store and merge the incremental results generated by map reduce?
How much of a performance improvement does the aggregation framework offer?
Does the aggregation framework offer the ability to store results incrementally in a similar manner that the map/reduce functionality that already exists does.
I appreciate your responses in advance!

How good (or bad) does MongoDB handle map reduce in terms of performance? Do you really need a lot of machines to get acceptable performance?
MongoDB's Map/Reduce implementation (as of 2.0.x) is limited by its reliance on the single-threaded SpiderMonkey JavaScript engine. There has been some experimentation with the v8 JavaScript engine and improved concurrency and performance is an overall design goal.
The new Aggregation Framework is written in C++ and has a more scalable implementation including a "pipeline" approach. Each pipeline is currently single-threaded, but you can run different pipelines in parallel. The aggregation framework won't currently replace all jobs that can be done in Map/Reduce, but does simplify a lot of common use cases.
A third option is to use MongoDB for storage in combination with Hadoop via the MongoDB Hadoop Connector. Hadoop currently has a more scalable Map/Reduce implementation and can access MongoDB collections for input and output via the Hadoop Connector.
In terms of workflow, is it relatively easy to store and merge the incremental results generated by map reduce?
Map/Reduce has several output options, including merging the incremental output into a previous output collection or returning the results inline (in memory).
How much of a performance improvement does the aggregation framework offer?
This really depends on the complexity of your Map/Reduce. Overall the aggregation framework is faster (and in some cases, significantly so). You're best doing a comparison for your own use case(s).
MongoDB 2.2 isn't officially released yet, but the 2.2rc0 release candidate has been available since mid-July.
Does the aggregation framework offer the ability to store results incrementally in a similar manner that the map/reduce functionality that already exists does.
The aggregation framework is currently limited to returning results inline so you have to process/display the results when they are returned. The result document is also restricted to the maximum document size in MongoDB (currently 16MB).
There is a proposed $out pipeline command (SERVER-3253) which will likely be added in future for more output options.
Some further reading that may be of interest:
a presentation at MongoDC 2011 on Time Series Data Storage in MongoDB
a presentation at MongoSF 2012 on MongoDB's New Aggregation Framework
capped collections, which could be used similar to RRD

Couchbase map reduce is designed for building incremental indexes, which can then be dynamically queried for the level of rollup you are looking for (much like the Oracle example you gave in your question).
Here is a write up of how this is done using Couchbase: http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-sample-patterns-timestamp.html

Is mongoDB efficient in doing multi-key lookups?

I'm evaluating MongoDB, coming from Membased/memcached because I want more flexibility.
Of course Membase is excellent in doing fast (multi)-key lookups.
I like the additional options that MongoDB gives me, but is it also fast in doing multi-key lookups? I've seen the $or and $in operator and I'm sure I can model it with that. I just want to know if it's performant (in the same league) as Membase.
use-case, e.g., Lucene/Solr returns 20 product-ids. Lookup these product-ids in Couchdb to return docs/ appropriate fields.
Thanks,
Geert-Jan

For your use case, I'd say it is, from my experience: I hacked some analytics into a database of mine that made a lot of $in queries with thousands of ids and it worked fine (it was a hack). To my surprise, it worked rather well, in the lower millisecond area.
Of course, it's hard to compare this, and -as usual- theory is a bad companion when it comes to performance. I guess the best way to figure it out is to migrate some test data and send some queries to the system.
Use MongoDB's excellent built-in profiler, use $explain, keep the one index per query rule in mind, take a look at the logs, keep an eye on mongostat, and do some benchmarks. This shouldn't take too long and give you a definite and affirmative answer. If your queries turn out slow, people here and on the news group probably have some ideas how to improve the exact query, or the indexation.

One index per query. It's sometimes thought that queries on multiple
keys can use multiple indexes; this is not the case with MongoDB. If
you have a query that selects on multiple keys, and you want that
query to use an index efficiently, then a compound-key index is
necessary.
http://www.mongodb.org/display/DOCS/Indexing+Advice+and+FAQ#IndexingAdviceandFAQ-Oneindexperquery.
There's more information on that page as well with regard to Indexes.
The bottom line is Mongo will be great if your indexes are in memory and you are indexing on the columns you want to query using composite keys. If you have poor indexing then your performance will suffer as a result. This is pretty much in line with most systems.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse