MongoDB Aggregation V/S simple query performance? - mongodb

I am reasking this question as i thought this question should be on seperate thread from this one in-mongodb-know-index-of-array-element-matched-with-in-operator.
I am using mongoDB and actually i was writing all of my queries using simple queries which are find, update etc. (No Aggregations). Now i read on many SO posts see this one for example mongodb-aggregation-match-vs-find-speed. Now i thought about why increasing computation time on server because as if i will compute more then my server load will become more, so i tried to use aggregations and i thought i am going in right direction now. But later on my previous question andreas-limoli told me about not using aggregations as it is slow and for using simple queries and computing on server. Now literally i am in a delimma about what should i use, i am working with mongoDB from a year now but i don't have any knowledge about its performance when data size increases so i completely don't know which one should i pick.
Also one more thing i didn't find on anywhere, if aggregation is slower than is it because of $lookup or not, because $lookup is the foremost thing i thought about using aggregation because otherwise i have to execute many queries serially and then compute on server which appears to me very poor in front of aggregation.
Also i read about 100MB restriction on mongodb aggregation when passing data from one pipeline to other, so how people handle that case efficiently and also if they turn on Disk usage then because Disk usage slow down everything than how people handle that case.
Also i fetched 30,000 sample collection and tried to run aggregation with $match and find query and i found that aggregation was little bit faster than find query which was aggregation took 180ms to execute where as find took 220 ms to execute.
Please help me out guys please it would be really helpful for me.

Aggregation pipelines are costly queries. It might impact on your performance as an increasing data because of CPU memory. If you can achieve the with find query, go for it because Aggregation is costlier once DB data increases.

Aggregation framework in MongoDB is similar to join operations in SQL. Aggregation pipelines are generally resource intensive operations. So if in case your work is satisfied with simple queries, you should use that one at first place.
However, if it is absolute necessary then you can use aggregation pipelines in case you need to fetch the data from the multiple collections.

Related

MongoDB is aggregation discourage over simple query

We are running a MongoDB instance for some of our price data, and I would like to find the most recent price update for each product that I have in the database.
Coming from a SQL background my initial thought was that to create an query with a subquery, where the subquery is a group by query. In the subquery price updates are grouped by the product and then one can find the most recent update for each price update.
I talked to a colleague about this approach and he claimed that in the official training material from MongoDB it is said that one should prefer simple queries over aggregated ones. i.e. he would run a query for each product and then find the most recent price update by ordering them by the update date. So that the number of queries will be linear in comparison to the number of products.
I do agree that it is simpler to write such a query, instead of an aggregated one, but I would have thought that performance wise it would have been faster to go through the collection once and find the queries i.e. the number of queries will be constant in comparison to the number of products.
He claims also that mongodb also will be able to better do optimization when running simple queries when running in a cluster.
Anybody know if that is the case?
I tried to search on the internet and I cannot find such a claim that one should prefer simple queries over aggregated ones.
Another colleague of mine was also thinking that it may be the case that since MongoDB are a new technology, then maybe aggregation queries have not been optimized for clustered MongoDB instances.
Anybody who can shed some light on these matters?
Thanks in advance
Here is some information on the aggregation pipeline on a sharded MongoDb implementation
Aggregation Pipeline and Sharded Collections
Assuming you have the right indexes in place on your collections, you shouldn't have any problems using MongoDB aggregation.

Why aggregate+sort is faster than find+sort in mongo?

I'm using mongoose in my project. When the number of documents in my collection becomes bigger, the method of find+sort becomes slower. So I use aggregate+$sort instead. I just wonder why?
Without seeing your data and your query it is difficult to answer why aggregate+sort is faster than find+sort.
But below are the things that holds good on find and aggregate
A well indexed(Indexing that suits your query) data will always yield faster results on your find query.
The components of aggregation pipeline which you use on your aggregate query, more operations is directly proportional to more execution time.
When you go for aggregation pipeline you can create new fields such as sum, avg and so on, which is not possible in a find.
see this thread for more info
MongoDB {aggregation $match} vs {find} speed

(Real time) Small data aggregation MongoDB: triggers?

What is a reliable and efficient way to aggregate small data in MongoDB?
Currently, my data that needs to be aggregated is under 1 GB, but can go as high as 10 GB. I'm looking for a real time strategy or near real time (aggregation every 15 minutes).
It seems like the likes of Map/Reduce, Hadoop, Storm are all over kill. I know that triggers don't exist, but I found this one post that may be ideal for my situation. Is creating a trigger in MongoDB an ideal solution for real time small data aggregation?
MongoDB has two built-in options for aggregating data - the aggregation framework and map-reduce.
The aggregation framework is faster (executing as native C++ code as opposed to a JavaScript map-reduce job) but more limited in the sorts of aggregations that are supported. Map-reduce is very versatile and can support very complex aggregations but is slower than the aggregation framework and can be more difficult to code.
Either of these would be a good option for near real time aggregation.
One further consideration to take into account is that as of the 2.4 release the aggregation framework returns a single document containing its results and is therefore limited to returning 16MB of data. In contrast, MongoDB map-reduce jobs have no such limitation and may output directly to a collection. In the upcoming 2.6 release of MongoDB, the aggregation framework will also gain the ability to output directly to a collection, using the new $out operator.
Based on the description of your use case, I would recommend using map-reduce as I assume you need to output more than 16MB of data. Also, note that after the first map-reduce run you may run incremental map-reduce jobs that run only on the data that is new/changed and merge the results into the existing output collection.
As you know, MongoDB doesn't support triggers, but you may easily implement triggers in the application by tailing the MongoDB oplog. This blog post and this SO post cover the topic well.

MongoDB - Materialized View/OLAP Style Aggregation and Performance

I've been reading up on MongoDB. I am particularly interested in the aggregation frameworks ability. I am looking at taking multiple dataset consisting of at least 10+ million rows per month and creating aggregations off of this data. This is time series data.
Example. Using Oracle OLAP, you can load data at the second/minute level and have this roll up to hours, days, weeks, months, quarters, years etc...simply define your dimensions and go from there. This works quite well.
So far I have read that MongoDB can handle the above using it's map reduce functionality. Map reduce functionality can be implemented so that it updates results incrementally. This makes sense since I would be loading new data say weekly or monthly and I would expect to only have to process new data that is being loaded.
I have also read that map reduce in MongoDB can be slow. To overcome this, the idea is to use a cheap commodity hardware and spread the load across multiple machines.
So here are my questions.
How good (or bad) does MongoDB handle map reduce in terms of performance? Do you really need a lot of machines to get acceptable performance?
In terms of workflow, is it relatively easy to store and merge the incremental results generated by map reduce?
How much of a performance improvement does the aggregation framework offer?
Does the aggregation framework offer the ability to store results incrementally in a similar manner that the map/reduce functionality that already exists does.
I appreciate your responses in advance!
How good (or bad) does MongoDB handle map reduce in terms of performance? Do you really need a lot of machines to get acceptable performance?
MongoDB's Map/Reduce implementation (as of 2.0.x) is limited by its reliance on the single-threaded SpiderMonkey JavaScript engine. There has been some experimentation with the v8 JavaScript engine and improved concurrency and performance is an overall design goal.
The new Aggregation Framework is written in C++ and has a more scalable implementation including a "pipeline" approach. Each pipeline is currently single-threaded, but you can run different pipelines in parallel. The aggregation framework won't currently replace all jobs that can be done in Map/Reduce, but does simplify a lot of common use cases.
A third option is to use MongoDB for storage in combination with Hadoop via the MongoDB Hadoop Connector. Hadoop currently has a more scalable Map/Reduce implementation and can access MongoDB collections for input and output via the Hadoop Connector.
In terms of workflow, is it relatively easy to store and merge the incremental results generated by map reduce?
Map/Reduce has several output options, including merging the incremental output into a previous output collection or returning the results inline (in memory).
How much of a performance improvement does the aggregation framework offer?
This really depends on the complexity of your Map/Reduce. Overall the aggregation framework is faster (and in some cases, significantly so). You're best doing a comparison for your own use case(s).
MongoDB 2.2 isn't officially released yet, but the 2.2rc0 release candidate has been available since mid-July.
Does the aggregation framework offer the ability to store results incrementally in a similar manner that the map/reduce functionality that already exists does.
The aggregation framework is currently limited to returning results inline so you have to process/display the results when they are returned. The result document is also restricted to the maximum document size in MongoDB (currently 16MB).
There is a proposed $out pipeline command (SERVER-3253) which will likely be added in future for more output options.
Some further reading that may be of interest:
a presentation at MongoDC 2011 on Time Series Data Storage in MongoDB
a presentation at MongoSF 2012 on MongoDB's New Aggregation Framework
capped collections, which could be used similar to RRD
Couchbase map reduce is designed for building incremental indexes, which can then be dynamically queried for the level of rollup you are looking for (much like the Oracle example you gave in your question).
Here is a write up of how this is done using Couchbase: http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-sample-patterns-timestamp.html

Running MongoDB Queries in Map/Reduce

Is it possible to run MongoDB commands like a query to grab additional data or to do an update from with in MongoDB's MapReduce command. Either in the Map or the Reduce function?
Is this completely ludicrous to do anyways? Currently I have some documents that refer to separate collections using the MongoDB DBReference command.
Thanks for the help!
Is it possible to run MongoDB commands... from within MongoDB's MapReduce command.
In theory, this is possible. In practice there are lots of problems with this.
Problem #1: exponential work. M/R is already pretty intense and poorly logged. Adding queries can easily make M/R run out of control.
Problem #2: context. Imagine that you're running a sharded M/R and you are querying into an unsharded collection. Does the current context even have that connection?
You're basically trying to implement JOIN logic and MongoDB has no joins. Instead, you may need to build the final data in a couple of phases by running a few loops on a few sets of data.