Mongodb aggregation performance with indexes

Mongodb aggregation performance with indexes - mongodb

I have a collection with a structure similar to this.
{
"_id" : ObjectId("59d7cd63dc2c91e740afcdb"),
"dateJoined": ISODate("2014-12-28T16:37:17.984Z"),
"dateActivatedMonth": 15,
"enrollments" : [
{ "month":-10, "enrolled":'00'},
{ "month":-9, "enrolled":'00'},
{ "month":-8, "enrolled":'01'},
//other months
{ "month":8, "enrolled":'11'},
{ "month":9, "enrolled":'11'},
{ "month":10, "enrolled":'00'}
]
}
dateActivatedMonth is number of months from dateJoined.
month in enrollments sub document is a relative month from dateJoined.
I am using Mongodb aggregation framework to process queries like "all enrollments with enrolled as '01' and 'enrolled month is 10 months before activation and 25 months after activation' ".
In my aggregation, first I am applying all possible filters in the $match pipeline and the applying condition on "month" in the $project pipeline.
db.getCollection("enrollments").aggregate(
{ $match:{ //initial conditions }},
{ $project: {
//Here i will apply month filter by considering the number of month before and after.
}
}
//other pipeline operations
)
All the filters that I am applying in my condition are optional. So in some cases $match will not filter anything. The only condition that is guaranteed to be there is the one on "month" in the "enrollments" sub document.
This query is slow. Takes about 6-7 seconds. (data is also huge). I am looking for ways to improve this. First thing I am looking at is to create indexes.
Now my questions are:
Can $project use indexes? I tried creating index on month" but I don't see anything about index usage in the queryPlanner with explain()
I would like to move the "month" condition to $match, So that it uses the index. How can I use a field value in $match? something like this:
db.getCollection('collection').aggregate([
{
$match:{
dateActivatedMonth: {$exists: true}
//,'enrollment.month': dateActivatedMonth -- not working
//,'enrollment.month': '$dateActivatedMonth' -- not working
}
}])
Thank you for your patience.

Related

Difference between aggregation and operator

I have been reading up on some mongodb documentation and ran across some confusing terminology, namely how to differentiate when a symbol will be used as an aggregate function or as an operator.
For example, the $size function either calculates the number of items in an array, or checks if the number of elements in an array is equal to a number, is there any way to know what the function will do at what time? Through trial and error I discovered that the $size will throw an error unless a number is passed to it in the $match step, but is there some rule/guideline so I can know what it will do beforehand?
db.collection.aggregate([
{
$project: {
key: 1,
number: {
$size: "$key"
}
}
},
{
$match: {
key: {
$size: 1
}
}
}
])

For querying data in MongoDB you can use the find method or the aggregate method. There are operators which you can use with these methods.
These query operators are used with the find method. Some of these can also used with the $match stage of the aggregate method (details later in the post).
These aggregation pipeline operators are used within the aggregate's stages. Some of these can also be used with the find method (details later in the post).
You will notice that, there are common operator names; for example, $eq, $gte, $or, $type, $size, etc. But, their usage and / or functionality can be different. The $eq operator has same function but different usage syntax and the $typeoperator has different functionality (and usage syntax).
And, some of these operators can be used with both the methods.
Some Usage Scenarios:
Lets consider a users collection and some queries:
{ "_id" : 1, "age" : 21, "firstname" : "John" }
{ "_id" : 2, "age" : 18, "firstname" : "John" }
{ "_id" : 3, "age" : "39", "firstname" : "Johnson" }
The query:
db.users.find( { firstname: { $eq: "John"}, age: { $gt: 20 } } )
This query's filter is same as { firstname: "John"}, age: { $gt: 20 } }. This uses the query operators $eq and $gt. The same query can be written in the aggregate method's $match stage:
db.users.aggregate([
{ $match: { firstname: "John", age: { $gt: 20 } } },
])
The operators used in this case are the same query operators. The query comparison operators can be used with the aggregate method's $match and $lookup stages.
Another Scenario:
db.users.aggregate([
{ $project: { ageGreaterThan20: { $gt: [ "$age", 20 ] } } },
])
This is the usage of aggregation comparison operator $gt. Note this is used within the aggregation query, but within the $project stage.
As you had used the query operators in the aggregation query, you can also use the aggregation operators within the find method. But, this must be used with the "special" $expr operator. For example:
db.users.find( { $expr: { $gt: [ "$age", 20 ] } } )
The advantage of the $expr is that - there are number of aggregation operators which can be used within the find queries. For example, using the $strLenCP:
db.users.find( { $expr: { $gt: [ { $strLenCP: "$firstname" }, 4 ] } } )
You can also use the $expr within the aggregation, within the $match or $lookup stages:
db.users.aggregate([
{ $match: { $expr: { $gt: [ "$age", 20 ] } } },
])
Finally:
I have been reading up on some mongodb documentation and ran across
some confusing terminology, namely how to differentiate when a symbol
will be used as an aggregate function or as an operator. ... but is
there some rule/guideline so I can know what it will do beforehand?
Reading helps, practicing helps better and experience helps the best. You use a specific operator in a specific scenario or use case. To achieve some functionality you use an appropriate method and operator(s).
Reference: Operators

As per my understanding, I'm pointing out few things :
Difference between aggregation and operator
You'll have certain basic functions for crud operations like .find() or .insert() or .delete() or .update() on MongoDB (There are few others like .count(), .distinct() but those are primary)
Versus
aggregation is a whole framework heavily used for complex reads, only two stages in aggregation is capable of writes $out and $merge.
Operators :
There are different types of operators :
Query and Projection Operators : These operators are crucial and are used to filter docs and to transform fields of docs in the response, Usually used in filter and project part of .find(filter, project) or .update(filter, update, project/options) etc.. These are also used in $match stage of aggregation pipeline ($match is similar to filter in .find()). Ex. :- $and, $or, $in, $or & more.
Update Operators : By name, these operators help to update documents in the collection. Usually used in update part of .update() or .findOneAndUpdate() etc.. Ex. :- $set, $unset, $inc & more.
Aggregation :
When it comes to aggregation they call it aggregation framework, definitely for a reason as you can do a lot of things with data using aggregation.
Aggregation has aggregation pipeline , which has syntax .aggregate([]). pipeline is an array with stages. Each stage in aggregation does certain operation on data flowing through them. Ex. :- $match, $project, $group etc..
As we know each document is independent on it's own, most of aggregation stages operate independently on each doc flowing through them.
Aggregation pipeline Operators :
These operators are generally used to achieve what you're looking for, certain operators can't be used in conjunction with other operators or in a stage.
Let's say in $match stage you would mostly use Query operators but not Aggregation operators as aggregation operators can't directly be used in $match in contrast with $project or $addFields stages.
Example with Stages & Operators :
Let's say you got to make Smoothie :
Out of a bunch of groceries you would filter needed fruits(docs in fruits collection) by matching with what you wanted to blend using $match stage ($and to match fruits/veggies, $lt to filter only fruits that hasn't expired and $size to limit no.of fruits needed).
You would peel off skin or chop into pieces to keep just keep useful parts of fruits(docs) using $project.
You would group all of the fruits into a blender using $group & add ingredients like cream & sugar using $addFields - You're too cautious about sugar quantity so you'll use operators like $size to check size and $multiply no.of nutrients based on conditions ($cond) you would $divide both sugar/nutrients to count nutrition value.
You'll iterate on adding ice pieces again and again using $map, $filter to remove un crushed ice pieces.
uhh, you always forget to add veggies (A different collection needs to be merged) based on fruits that already got blended you use $lookup to lookup for matching veggies(docs) & either blend in again with group or just put it on top.
Finally either you drink it from blender (just return the docs no more stages) or take into glass using $out or $merge stage (Remember as I said, only two stages that can write to a collection).
Off course every stage is Optional - You can eat fruits as is instead of blending (get all docs and all their fields) but you know what you wouldn't do that (for many reasons like performance, unnecessary data flowing through network) unless it's pre-prepared product (like a small configuration collection which has limited data with few docs & every doc is need & can be easily retrieved in one DB call).
Note :
Usually you can't use aggregation operators in filter part i.e; .find(filter) or in '$match' stage unless you use $expr.
Starting MongoDB v4.2 you can run aggregation pipeline in update-with-an-aggregation-pipeline where you can take advantage of aggregation stages/operators in update part.
While you search MongoDB's documentation often you would find the same operator in multiple places, Let's say if you search for $size you would find multiple references one is Query operator or as Projection operator or as aggregation operator - So depends on need/where you wanted to use you can refer their documentation for usage cause though name seems to be similar or does almost same but functionality may differ & syntax also differs.
You can always MongoDB for free at MongoDB University.

getting the latest xx records with mongoose, How to order them?

I'm trying to get the last 20 records of user collection with mongoose:
User.find({'owner': req.params.id}).
sort(date:'-1').
limit(20).
exec(.....)
This works well, show the last 20 items.
But the items inside the array are sorted from the most recent to the oldest, Is there any way to reverse this with mongoose?
Thanks

You can certainly do this with an aggregation, such as this:
db.user.aggregate[(
{ $match : {"owner" : req.params.id}},
{ $sort : {"date" : -1}},
{ $limit : 20},
{ $sort : {"date" : 1}}
])
Notes on this aggregation:
The first three parts do the same job as the Find in your question
The fourth part applies a further sort, which re-orders the returned 20 records from oldest to most recent
I have written it in native MongoDB aggregation syntax; you will need to adjust the code to generate the same aggregation from Mongoose.
Update: I think this is not possible with a find() with cursor methods, because you would need two different sort() operations. But, MongoDB does not treat them as a sequence of independent operations; the docs give an example of methods written in one order — sort().limit() — being equivalent to the opposite order — limit().sort(), showing that the order cannot be relied upon as meaningful.

Find total and select only latest 20 , may be this is not effective way you found , but this will solve your problem.
User.count({'owner': req.params.id},function(err,count){
if(count){
var skipItem=count-20;
User.find({'owner': req.params.id}).
.skip(skipItem)
.limit(20)
.sort(date:'1').
exec(.....)
}
});

db.users.aggregate([
{ $match: {
'owner': req.params.id
}},
{ $unwind: '[arrayFieldName]' },
{ $sort: {
'[arrayFieldName]': -1/1,
'date':-1
}}
])

Mongodb aggregate works slowly with 200Gb collection

I have a sharded mongo collection with 10M elements (200Gb).
Document structure:
{
_id
updateDate
cleanDate
events1: [{...}, {...}, ...]
events2: [{...}, {...}, ...]
events3: [{...}, {...}, ...]
}
There is no indexes except _id.
1st java application creates, reads and updates documents from collection.
2nd java application has scheduled task that finds documents with updateDate > cleanDate and removes old objects from eventX arrays. When the task cleans any object, it updates cleanDate.
I use myCollection.aggregate({ $project : { delta : { $cmp : ['$updateDate', '$cleanDate'] } } }, { $match : { delta : { $gt : 0 } } } , { $limit : 10000}) query to get next portion for cleaning.
The query execution takes a lot of time (sometimes 10 min and more) especially when the first elements was cleaned in collection.
How I can speed up my 2nd app?

The point is that whole your collection projects and $match only after that. If you know last updateDate or cleanDate (maybe you store last scheduled task run time somewhere) so you can limit initial aggregation set by this field prior projecting.

MongoDB - Aggregation Framework (Total Count)

When running a normal "find" query on MongoDB I can get the total result count (regardless of limit) by running "count" on the returned cursor. So, even if I limit to result set to 10 (for example) I can still know that the total number of results was 53 (again, for example).
If I understand it correctly, the aggregation framework, however, doesn't return a cursor but simply the results. And so, if I used the $limit pipeline operator, how can I know the total number of results regardless of said limit?
I guess I could run the aggregation twice (once to count the results via $group, and once with $limit for the actual limited results), but this seems inefficient.
An alternative approach could be to attach the total number of results to the documents (via $group) prior to the $limit operation, but this also seems inefficient as this number will be attached to every document (instead of just returned once for the set).
Am I missing something here? Any ideas? Thanks!
For example, if this is the query:
db.article.aggregate(
{ $group : {
_id : "$author",
posts : { $sum : 1 }
}},
{ $sort : { posts: -1 } },
{ $limit : 5 }
);
How would I know how many results are available (before $limit)? The result isn't a cursor, so I can't just run count on it.

There is a solution using push and slice: https://stackoverflow.com/a/39784851/4752635 (#emaniacs mentions it here as well).
But I prefer using 2 queries. Solution with pushing $$ROOT and using $slice runs into document memory limitation of 16MB for large collections. Also, for large collections two queries together seem to run faster than the one with $$ROOT pushing. You can run them in parallel as well, so you are limited only by the slower of the two queries (probably the one which sorts).
First for filtering and then grouping by ID to get number of filtered elements. Do not filter here, it is unnecessary.
Second query which filters, sorts and paginates.
I have settled with this solution using 2 queries and aggregation framework (note - I use node.js in this example):
var aggregation = [
{
// If you can match fields at the begining, match as many as early as possible.
$match: {...}
},
{
// Projection.
$project: {...}
},
{
// Some things you can match only after projection or grouping, so do it now.
$match: {...}
}
];
// Copy filtering elements from the pipeline - this is the same for both counting number of fileter elements and for pagination queries.
var aggregationPaginated = aggregation.slice(0);
// Count filtered elements.
aggregation.push(
{
$group: {
_id: null,
count: { $sum: 1 }
}
}
);
// Sort in pagination query.
aggregationPaginated.push(
{
$sort: sorting
}
);
// Paginate.
aggregationPaginated.push(
{
$limit: skip + length
},
{
$skip: skip
}
);
// I use mongoose.
// Get total count.
model.count(function(errCount, totalCount) {
// Count filtered.
model.aggregate(aggregation)
.allowDiskUse(true)
.exec(
function(errFind, documents) {
if (errFind) {
// Errors.
res.status(503);
return res.json({
'success': false,
'response': 'err_counting'
});
}
else {
// Number of filtered elements.
var numFiltered = documents[0].count;
// Filter, sort and pagiante.
model.request.aggregate(aggregationPaginated)
.allowDiskUse(true)
.exec(
function(errFindP, documentsP) {
if (errFindP) {
// Errors.
res.status(503);
return res.json({
'success': false,
'response': 'err_pagination'
});
}
else {
return res.json({
'success': true,
'recordsTotal': totalCount,
'recordsFiltered': numFiltered,
'response': documentsP
});
}
});
}
});
});

Assaf, there's going to be some enhancements to the aggregation framework in the near future that may allow you to do your calculations in one pass easily, but right now, it is best to perform your calculations by running two queries in parallel: one to aggregate the #posts for your top authors, and another aggregation to calculate the total posts for all authors. Also, note that if all you need to do is a count on documents, using the count function is a very efficient way of performing the calculation. MongoDB caches counts within btree indexes allowing for very quick counts on queries.
If these aggregations turn out to be slow there are a couple of strategies. First off, keep in mind that you want start the query with a $match if applicable to reduce the result set. $matches can also be speed up by indexes. Secondly, you can perform these calculations as pre-aggregations. Instead of possible running these aggregations every time a user accesses some part of your app, have the aggregations run periodically in the background and store the aggregations in a collection that contains pre-aggregated values. This way, your pages can simply query the pre-calculated values from this collection.

$facets aggregation operation can be used for Mongo versions >= 3.4.
This allows to fork at a particular stage of a pipeline in multiple sub-pipelines allowing in this case to build one sub pipeline to count the number of documents and another one for sorting, skipping, limiting.
This allows to avoid making same stages multiple times in multiple requests.

If you don't want to run two queries in parallel (one to aggregate the #posts for your top authors, and another aggregation to calculate the total posts for all authors) you can just remove $limit on pipeline and on results you can use
totalCount = results.length;
results.slice(number of skip,number of skip + number of limit);
ex:
db.article.aggregate([
{ $group : {
_id : "$author",
posts : { $sum : 1 }
}},
{ $sort : { posts: -1 } }
//{$skip : yourSkip}, //--remove this
//{ $limit : yourLimit }, // remove this too
]).exec(function(err, results){
var totalCount = results.length;//--GEt total count here
results.slice(yourSkip,yourSkip+yourLimit);
});

I got the same problem, and solved with $project, $slice and $$ROOT.
db.article.aggregate(
{ $group : {
_id : '$author',
posts : { $sum : 1 },
articles: {$push: '$$ROOT'},
}},
{ $sort : { posts: -1 } },
{ $project: {total: '$posts', articles: {$slice: ['$articles', from, to]}},
).toArray(function(err, result){
var articles = result[0].articles;
var total = result[0].total;
});
You need to declare from and to variable.
https://docs.mongodb.com/manual/reference/operator/aggregation/slice/

in my case, we use $out stage to dump result set from aggeration into a temp/cache table, then count it. and, since we need to sort and paginate results, we add index on the temp table and save table name in session, remove the table on session closing/cache timeout.

I get total count with aggregate().toArray().length

Mongo: Ensuring latest nested attribute has a value between given arguments

I have a mongo collection 'books'. Here's a typical book:
BOOK
name: 'Test Book'
author: 'Joe Bloggs'
print_runs: [
{publisher: 'OUP', year: 1981},
{publisher: 'Penguin', year: 1987},
{publisher: 'Harper-Collins', year: 1992}
]
I'd like to be able to filter books to return only books whose last print run was after a given date, and/or before a given date...and I've been struggling to find a feasible query. Any suggestions appreciated.

There are a few options, as getting access to the "last" element in the array and only filtering on that is difficult/impossible with the normal find options in MongoDB queries. (Unfortunately, you can't $slice with find).
Store the most recent published publisher and year in the print_runs array and in a special (denormalized/copy) of the data directly on the book object. Book.last_published_by and Book.last_published_date for example. Queries would be simple and super fast.
MapReduce. This would be simple enough to emit the last element in the array and then "reduce" it to just that. You'd need to do incremental updates on the MapReduce to keep it accurate.
Write a relatively complex aggregation framework expression
The aggregation might look like:
db.so.aggregate({ $project :
{ _id: 1, "print_run_year" : "$print_runs.year" }},
{ $unwind: "$print_run_year" },
{ $group : { _id : "$_id", "newest" : { $max : "$print_run_year" }}},
{ $match : { "newest" : { $gt : 1991, $lt: 2000 } }
})
As it may require a bit of explanation:
It projects and unwinds the year of the print runs for each book.
Then, group on the _id (of the book, and create a new computed field called, newest which contains the highest print run year (from the projection).
Then, filter on newest using a $gt and $lt
I'd suggest option #1 above would be the best from an efficiency perspective, followed by the MapReduce, and then a distant third, option #3.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse