MongoDB is aggregation discourage over simple query - mongodb

We are running a MongoDB instance for some of our price data, and I would like to find the most recent price update for each product that I have in the database.
Coming from a SQL background my initial thought was that to create an query with a subquery, where the subquery is a group by query. In the subquery price updates are grouped by the product and then one can find the most recent update for each price update.
I talked to a colleague about this approach and he claimed that in the official training material from MongoDB it is said that one should prefer simple queries over aggregated ones. i.e. he would run a query for each product and then find the most recent price update by ordering them by the update date. So that the number of queries will be linear in comparison to the number of products.
I do agree that it is simpler to write such a query, instead of an aggregated one, but I would have thought that performance wise it would have been faster to go through the collection once and find the queries i.e. the number of queries will be constant in comparison to the number of products.
He claims also that mongodb also will be able to better do optimization when running simple queries when running in a cluster.
Anybody know if that is the case?
I tried to search on the internet and I cannot find such a claim that one should prefer simple queries over aggregated ones.
Another colleague of mine was also thinking that it may be the case that since MongoDB are a new technology, then maybe aggregation queries have not been optimized for clustered MongoDB instances.
Anybody who can shed some light on these matters?
Thanks in advance

Here is some information on the aggregation pipeline on a sharded MongoDb implementation
Aggregation Pipeline and Sharded Collections
Assuming you have the right indexes in place on your collections, you shouldn't have any problems using MongoDB aggregation.

Related

How to join two collections in mongo Db?

I've created two collections one for people details and another for their income, spending amount etc.
i want cumulative result how to combine both collections??
You can insert all documents from one collection into another. Here is how to do that using the mongo shell:
db.collection1.find().forEach( function(x){db.collection2.insert(x)} );
Instead of combining them into one collection (which may or may not be correct in your situation - no way of knowing) take a look at this StackOverflow. Also you should seriously study both the lookup and aggregation possibilities in MongoDB. Since Mongo is NoSQL the way we approach lookup information is a bit different.

MongoDB Aggregation V/S simple query performance?

I am reasking this question as i thought this question should be on seperate thread from this one in-mongodb-know-index-of-array-element-matched-with-in-operator.
I am using mongoDB and actually i was writing all of my queries using simple queries which are find, update etc. (No Aggregations). Now i read on many SO posts see this one for example mongodb-aggregation-match-vs-find-speed. Now i thought about why increasing computation time on server because as if i will compute more then my server load will become more, so i tried to use aggregations and i thought i am going in right direction now. But later on my previous question andreas-limoli told me about not using aggregations as it is slow and for using simple queries and computing on server. Now literally i am in a delimma about what should i use, i am working with mongoDB from a year now but i don't have any knowledge about its performance when data size increases so i completely don't know which one should i pick.
Also one more thing i didn't find on anywhere, if aggregation is slower than is it because of $lookup or not, because $lookup is the foremost thing i thought about using aggregation because otherwise i have to execute many queries serially and then compute on server which appears to me very poor in front of aggregation.
Also i read about 100MB restriction on mongodb aggregation when passing data from one pipeline to other, so how people handle that case efficiently and also if they turn on Disk usage then because Disk usage slow down everything than how people handle that case.
Also i fetched 30,000 sample collection and tried to run aggregation with $match and find query and i found that aggregation was little bit faster than find query which was aggregation took 180ms to execute where as find took 220 ms to execute.
Please help me out guys please it would be really helpful for me.
Aggregation pipelines are costly queries. It might impact on your performance as an increasing data because of CPU memory. If you can achieve the with find query, go for it because Aggregation is costlier once DB data increases.
Aggregation framework in MongoDB is similar to join operations in SQL. Aggregation pipelines are generally resource intensive operations. So if in case your work is satisfied with simple queries, you should use that one at first place.
However, if it is absolute necessary then you can use aggregation pipelines in case you need to fetch the data from the multiple collections.

MongoDB $lookup: Limitations & Usage

With the new aggregation pipeline stage $lookup we are now able to perform 'left outer joins'.
At first glance, I want to immediately replace one of our denormalised collections with two separate collections and use the $lookup to join them upon querying. This will solve the problem of having, when necessary, to update a huge number of documents. Now we can update just one document.
But surely this is too good to be true? This is a NoSQL, document database after all!
MongoDB's CTO also highlights his concerns:
We’re still concerned that $lookup can be misused to treat MongoDB
like a relational database. But instead of limiting its availability,
we’re going to help developers know when its use is appropriate, and
when it’s an anti-pattern. In the coming months, we will go beyond the
existing documentation to provide clear, strong guidance in this area.
What are the limitations of $lookup? Can I use them in real-time, operational querying of our data or should they be left for reporting, offline situations?
I share your same enthusiasm for $lookup.
I think there are trade-offs. One of the major concerns of SQL databases (and which is one of the reasons for the genesis of NoSQL) is that at large scale, joins can take a lot of time (well, relatively speaking).
It definitely helps in giving you a declarative model for your data, but then if you start to model your entire NoSQL database as though its a database of rows and tables (just using refs, for example), then you begin modeling it as though it's simply a SQL database (to a degree). Even MongoDB mentioned it (like you put in your question):
We’re still concerned that $lookup can be misused to treat MongoDB like a relational database.
You mentioned:
This will solve the problem of having, when necessary, to update a huge number of documents. Now we can update just one document.
I'm not sure what your collections look like exactly, but that definitely sounds like it could be a good use for $lookup.
Can I use them in real-time, operational querying
I would say, again, it depends on your use-case. You'll have to compare:
Desired semantics of your queries (declarative vs imperative)
Whether modeling your data as more relational (and thus using $lookup) in certain circumstances is worth the potential trade-off in computational time (that's assuming that querying across collections is even something to be concerned about, computationally speaking)
etc...
I'm sure in the coming months we'll see perf tests of the "left outer joins" and perhaps MongoDB will start writing some posts about when $lookup is an antipattern.
Hope this answer helps add to the discussion.
First of all MongoDB is a document-based database and will always be. So the $lookup aggregation pipeline stage new in version 3.2 didn't change MongoDB to relational database (RDBMS) as MongoDB's CTO mentioned:
We’re still concerned that $lookup can be misused to treat MongoDB like a relational database.
The first limitation of $lookup as mentioned in the documentation is that it:
Performs a left outer join to an unsharded collection in the same database to filter in documents from the “joined” collection for processing.
Which means that you can't use it with a sharded collection.
Also the $lookup operator doesn't work directly with an array as mentioned in post therefore you will need a preliminary $unwind stage to denormalize the localField if it is an array.
Now you said:
This will solve the problem of having, when necessary, to update a huge number of documents.
This is a good idea if your data are updated often than they are read.
as mentioned in 6 Rules of Thumb for MongoDB Schema Design: Part 3 especially if you have a large hierarchical data sets.
Denormalizing one or more fields makes sense if those fields are read much more often than they are updated.
I believe that with careful schema design you probably will not need the $lookup operator.

Denormalization vs Parent Referencing vs MapReduce

I have a highly normalized data model with me. Currently I'm using manual referencing by storing the _id and running sequential queries to fetch details from the deepest collection.
The referencing is one-way and the flow has around 5-6 collections. For one particular use case, I'm having to query down to the deepest collection by querying subsequent "_id" from the higher level collections. So technically I'm hitting the database every time I run a
db.collection_name.find(_id: ****).
My prime goal is to optimize the read without hugely affecting the atomicity of the other collections. I have read about de-normalization and it does not make sense to me because I want to keep an option for changing the cardinality down the line and hence want to maintain a separate collection altogether.
I was initially thinking of using MapReduce to do an aggregation from the back and have a collection primarily for the particular use-case. But well even that does not sound that good.
In a relational db, I would be breaking the query in sub-queries and performing a join to get the data sets that intersect from the initial results. Since mongodb does not support joins, I'm having a tough time figuring anything out.
Please help if you have faced anything like this earlier or have any idea how to resolve it.
Denormalize your data.
MongoDB does not do JOIN's - period.
There is no operation on the database which gets data from more than one collection. Not find(), not aggregate() and not MapReduce. When you need to puzzle your data together from more than one collection, there is no other way than doing it on the application layer. For that reason you should organize your data in a way that any common and performance-relevant query can be resolved by querying just a single collection.
In order to do that you might have to create redundancies and transitive dependencies. This is normal in MongoDB.
When this feels "dirty" to you, then you should either accept the fact that your performance will be sub-optimal or use a different kind of database, like a classic relational database or a graph database.

MongoDB: Optimization of Performance: Aggregation Pipeline (one collection) VS Aggregation plus Additional Query on Seperate Collection

I would like to know what is faster in terms of querying for mongodb.
Lets say I would like to search for income information based on areas
And a person can have many residencies in different states. And each polygon area would have an associated income for that individual.
I have outlined two options for querying this information, I would like to know which would be faster to search.
1) To have a single collection which has two types of documents.
Document1: has a geospatial index on it with polygons, and will have
2dsphere index on it. It will be searched with aggregation to return ids that will link to document 2. Essentially taking the place of a relation in mysql.
Document2: has other information (lets say income amount) and different indexes, but has an id
which the first document also has to reference it.
And also has an index on income amount.
The two documents are searched with an aggregation pipeline.
Stage 1 of pipeline: searching document1 geospatially for items and getting the id value .
Stage 2 of pipeline: using id found in document1 to search second document. As well searched by income type.
2) Seperating out the documents where each has its own collection and avoiding aggregation.
querying collection1 for geospatial and using the person id's found to query collection2 for income info.
3) A third option involving polyglot database, a combination of mongodb and postigs: Query postgis for the id and then use that to search mongodb collecton. I am including this option since I believe postgis to be faster for querying geospatially than mogo but I am curious if the speed of postgis will not matter due to latency of now querying two databases.
The end goal is to pull back data based on a geospatial radius. One geospatial polygon representing area where the person lives and does business for that income.
maps to 1 relational id and each relational id maps to many sets of data. Essentially I have a many to 1 to many relationship.
Many geospatials map to 1 person which maps to many data sets.
You should generally keep collections limited to a single type of document.
Solution 1 will not work. You cannot use the aggregation pipeline the way you are describing (if I'm understanding you correctly). Also, it sounds as though you are thinking in a relational way about a solution using a non-relational database.
Solution 2 will work but it will not have optimum performance. This solution sounds even more like a relational database solution where the collections are being treated like tables.
Solution 3 will probably work but as you said it will now require two databases.
All three of these solutions are progressively pulling these two types of documents further and further away from one another. I believe the best solution for a document database like MongoDB is to embed this data. It's impossible without a real example of your documents and without a clear understanding of your application to suggest an exact solution. But in general embedding data is preferred over creating relationships between documents in MongoDB. As long as no document will ever get to be over 16MB it's worth looking into whether embedding is the right solution.