How can I improve performance on a MongoDB aggregation query? - mongodb

I am using the following query to get the count of records per day where the air temperature is bellow 7.2 degree. The documentation recommends to use the aggregation framework since it is faster than the map reduce
db.maxial.aggregate([{
$project: {
time:1,
temp:1,
frio: {
$cond: [
{ $lte: [ "$temp", 7.2 ] },
0.25,
0
]
}
}
}, {
$match: {
time: {
$gte: new Date('11/01/2011'),
$lt: new Date('11/03/2011')
}
}
}, {
$group: {
_id: {
ord_date: {
day: { $dayOfMonth: "$time" },
month: { $month: "$time" },
year: { $year: "$time" }
}
},
horasFrio: { $sum: '$frio' }
}
}, {
$sort: {
'_id.ord_date': 1
}
}])
I get an average execution time of 2 secs. Am I doing something wrong? I am already using indexes on time and temp field.

You might have indexes defined but you are not using them. In order for an aggregation pipeline to "use" an index the $match stage must be implemented first. Also if you omit the $project entirely and just include this in $group you are doing it in the most efficient way.
db.maxial.aggregate( [
{ "$match": {
"time": {
"$gte": new Date('2011-11-01'),
"$lt": new Date('2011-11-03')
}
}},
{ "$group": {
"_id": {
"day": { "$dayOfMonth": "$time" },
"month": { "$month": "$time" },
"year": { "$year": "$time" }
},
"horasFrio": {
"$sum": {
"$cond": [{ "$lte": [ "$temp", 7.2 ] }, 0.25, 0 ]
}
}
}},
{ "$sort": { "_id": 1} }
])
Project does not provide the benefits people think it does in terms of "reducing fields" in a direct way.
And beware JavaScript "Date" object constructors. Unless you issue in the right way you will get a locally converted date rather then the UTC time reference you should be issuing. That and other misconceptions are cleared up in the re-written listing.

To improve the performance of an aggregate query you would have to use the various pipeline stages and in the right order.
You can use the $match at first and later follow by $limit and $skip if needed. These all will shorten the number of records to be traversed for grouping and hence improves the performance.

Related

Count number of rows and get only the last row in MongoDB

I have a collection of posts as follows:
{
"author": "Rothfuss",
"text": "Name of the Wind",
"likes": 1007,
"date": ISODate("2013-03-20T11:30:05Z")
},
{
"author": "Rothfuss",
"text": "Doors of Stone",
"likes": 1,
"date": ISODate("2051-03-20T11:30:05Z")
}
I want to get the count of each author's posts and his/her last post.
There is a SQL answer for the same question here. I try to find its MongoDB alternative.
I ended up this query so far:
db.collection.aggregate([
{
"$group": {
"_id": "$author",
"count": {
"$sum": 1
},
"lastPost": {
"$max": {
"_id": "$date",
"post": "$text"
}
}
}
}
])
which seems to work, but its different runs generate different results. It can be tested here in Mongo playground.
I don't understand how to use $max to select another property from the document containing the maximum. I am new to MongoDB, so describing the basics is also warmly appreciated.
extra question
Is it possible to limit $sum to only add posts with likes more than 100?
its different runs generate different results.
I don't understand how to use $max to select another property from the document containing the maximum.
The $max does not work in multiple fields, and also it is not effective in that field that having text/string value.
It will select any of the properties from a group of posts, it will different every time.
So the accurate result you can add new stage $sort before $group stage, to sort by date in descending order, and in the group stage you can select a value by $first operator,
{ $sort: { date: -1 } },
{
$group: {
_id: "$author",
count: { $sum: 1 },
date: { $first: "$date" },
post: { $first: "$text" }
}
}
Is it possible to limit $sum to only add posts with likes more than 100?
There is two meaning of your requirement, I am not sure which is you are asking but let me give both the solutions,
If you only don't want to count posts in count but you want to get it as the last post's date and text if it is.
$cond check condition if likes is greater than 100 then count 1 otherwise count 0
db.collection.aggregate([
{ $sort: { date: -1 } },
{
$group: {
_id: "$author",
count: {
$sum: {
$cond: [{ $gt: ["$likes", 100] }, 1, 0]
}
},
date: { $first: "$date" },
post: { $first: "$text" }
}
}
])
Playground
If you don't want to count and also don't want the last post if it is.
You can add a $match stage at the first stage to check greater than condition, and your final query would be,
db.collection.aggregate([
{ $match: { likes: { $gt: 100 } } },
{ $sort: { date: -1 } },
{
$group: {
_id: "$author",
count: { $sum: 1 },
date: { $first: "$date" },
post: { $first: "$text" }
}
}
])
Playground
Your query looks ok to me, adding a $match stage can filter out the posts if not likes > 100. (you can also do it in $sum, with $cond but there is no need here)
Query
$max accumulator can be used for documents also
Here you can see how MongoDB compares documents
mongoplayground has a problem and loses the order of fields in the documents(behaves likes they are are hashmaps when they are not) (test it in your driver also)
Test code here
db.collection.aggregate([
{
"$match": {
"likes": {
"$gt": 100
}
}
},
{
"$group": {
"_id": "$author",
"count": {
"$sum": 1
},
"lastPost": {
"$max": {
_id: "$date",
post: "$text"
}
}
}
}
])

Mongo query: compare multiple of the same field (timestamps) to only show documents that have a few second difference

I am spinning my wheels on this. I am needing to find all documents within a collection that have a timestamp ("createdTs") that have a 3 second or less difference (to be clear: month/day/time/year all the same, save those few seconds). An example of createdTs field (it's type Date): 2021-04-26T20:39:01.851Z
db.getCollection("CollectionName").aggregate([
{ $match: { memberId: ObjectId("1234") } },
{
$project:
{
year: { $year: "$createdTs" },
month: { $month: "$createdTs" },
day: { $dayOfMonth: "$createdTs" },
hour: { $hour: "$createdTs" },
minutes: { $minute: "$createdTs" },
seconds: { $second: "$createdTs" },
milliseconds: { $millisecond: "$createdTs" },
dayOfYear: { $dayOfYear: "$createdTs" },
dayOfWeek: { $dayOfWeek: "$createdTs" },
week: { $week: "$createdTs" }
}
}
])
I've tried a lot of different variances. Where I'm struggling is how to compare these findings to one another. I'd also prefer to just search the entire collection and not match on the "memberId" field, just collect any documents that have less than a 3 second createdTs difference, and group/display those.
Is this possible? Newer to Mongo, and spun my wheels on this for two days now. Any advice would be greatly appreciated, thank you!
I saw this on another post, but not sure how to utilize it since I'm wanting to compare the same field:
db.collection.aggregate([
{ "$project": {
"difference": {
"$divide": [
{ "$subtract": ["$logoutTime", "$loginTime"] },
60 * 1000 * 60
]
}
}},
{ "$group": {
"_id": "$studentID",
"totalDifference": { "$sum": "$difference" }
}},
{ "$match": { "totalDifference": { "$gte": 20 }}}
])
Also am trying...
db.getCollection("CollectionName").aggregate([
{ $match: { memberId: ObjectId("1234") } },
{
$project:
{
year: { $year: "$createdTs" },
month: { $month: "$createdTs" },
total:
{ $add: ["$year", "$month"] }
}
}
])
But this returns a total of null. Not sure if it's because $year and $month are difference? The types are both int32, so I thought that'd work. Was wondering if there's a way to compare if all the fields are 0, then if seconds is not/difference is $gte 3 when using $group, could go from there.

How to get last two values from array with aggregation group by date in mongodb?

UserActivity.aggregate([
{
$match: {
user_id: {$in: user_id},
"tracker_id": {$in:getTrackerId},
date: { $gte: req.app.locals._mongo_date(req.params[3]),$lte: req.app.locals._mongo_date(req.params[4]) }
}
},
{ $sort: {date: 1 } },
{ $unwind: "$duration" },
{
$group: {
_id: {
tracker: '$tracker_id',
$last:"$duration",
year:{$year: '$date'},
month: {$month: '$date'},
day: {$dayOfMonth: '$date'}
},
resultData: {$sum: "$duration"}
}
},
{
$group: {
_id: {
year: "$_id.year",
$last:"$duration",
month:"$_id.month",
day: "$_id.day"
},
resultData: {
$addToSet: {
tracker: "$_id.tracker",
val: "$resultData"
}
}
}
}
], function (err,tAData) {
tAData.forEach(function(key){
console.log(key);
});
});
I got output from this collection
{ _id: { year: 2015, month: 11, day: 1 },
resultData:[ { tracker: 55d2e6b043d77c0877105397, val: 60 },
{ tracker: 55d2e6b043d77c0877105397, val: 75 },
{ tracker: 55d2e6b043d77c0877105397, val: 25 },
{ tracker: 55d2e6b043d77c0877105397, val: 21 } ] }
{ _id: { year: 2015, month: 11, day: 2 },
resultData:[ { tracker: 55d2e6b043d77c0877105397, val: 100 },
{ tracker: 55d2e6b043d77c0877105397, val: 110 },
{ tracker: 55d2e6b043d77c0877105397, val: 40 },
{ tracker: 55d2e6b043d77c0877105397, val: 45 } ] }
But I need this output from this collection, I want to fetch two last record from each collection:
{ _id: { year: 2015, month: 11, day: 1 },
resultData:[ { tracker: 55d2e6b043d77c0877105397, val: 25 },
{ tracker: 55d2e6b043d77c0877105397, val: 21 } ] }
{ _id: { year: 2015, month: 11, day: 2 },
resultData:[ { tracker: 55d2e6b043d77c0877105397, val: 40 },
{ tracker: 55d2e6b043d77c0877105397, val: 45 } ] }
You have clear syntax errors in your $group statement with $last as that is not a valid usage, but I suspect this has something to do with what you are "trying" to do rather than what you are using to get your actual result.
Getting a result with the "best n values" is a bit of a problem for the aggregation framework. There is this recent answer from myself with a longer explaination of the basic case, but it all boils down to the aggregation framework lacks the basic tools to do this "limitted" grouping per grouping key that you want.
Doing it badly
The horrible way to approach this is very "iterative" per the number of results you want to return. It basically means pushing everything into an array and then using operators like $first ( after sorting in reverse ) to return the result off the stack and subsequently "filter" that result ( think an array pop or shift operation ) out of the results and then do it again to get the next one.
Basically this with a 2 iteration example:
UserActivity.aggregate(
[
{ "$match": {
"user_id": { "$in": user_id },
"tracker_id": { "$in": getTrackerId },
"date": {
"$gte": startDate,
"$lt": endDate
}
}},
{ "$unwind": "$duration" },
{ "$group": {
"_id": {
"tracker_id": "$tracker_id",
"date": {
"$add": [
{ "$subtract": [
{ "$subtract": [ "$date", new Date(0) ] },
{ "$mod": [
{ "$subtract": [ "$date", new Date(0) ] },
1000 * 60 * 60 * 24
]}
]},
new Date(0)
]
},
"val": { "$sum": "$duration" }
}
}},
{ "$sort": { "_id": 1, "val": -1 } },
{ "$group": {
"_id": "$_id.date",
"resultData": {
"$push": {
"tracker_id": "$_id.tracker_id",
"val": "$val"
}
}
}},
{ "$unwind": "$resultData " },
{ "$group": {
"_id": "$_id",
"last": { "$first": "$resultData" },
"resultData": { "$push": "$resultData" }
}},
{ "$unwind": "$resultData" },
{ "$redact": {
"if": { "$eq": [ "$resultData", "$last" ] },
"then": "$$PRUNE",
"else": "$$KEEP"
}},
{ "$group": {
"_id": "$_id",
"last": { "$first": "$last" },
"secondLast": { "$first": "$resultData" }
}},
{ "$project": {
"resultData": {
"$map": {
"input": [0,1],
"as": "index",
"in": {
"$cond": {
"if": { "$eq": [ "$$index", 0 ] },
"$last",
}
}
}
}
}}
],
function (err,tAData) {
console.log(JSON.stringify(tAData,undefined,2))
}
);
Also simplifying your date inputs to startDate and endDate as pre determined date object values before the pipeline code. But the principles here show this is not a performant or very scalable approach, and mostly due to needing to put all results into an array and then deal with that to just get the values.
Doing it better
A much better approach is to send an aggregation query to the server for each date in the range, as date is what you want as the eventual key. Since you only return each "key" at once, it is easy to just apply $limit to restrict the response.
The ideal case is to perform these queries in parallel and then combine them. Fortunately the node async library provides an async.map or specifically async.mapLimit which performs this function exactly:
N.B You don't want async.mapSeries for the best performance since queries are "serially executed in order" and that means only one operation occurs on the server at a time. The results are array ordered, but it's going to take longer. A client sort makes more sense here.
var dates = [],
returnLimit = 2,
OneDay = 1000 * 60 * 60 * 24;
// Produce an array for each date in the range
for (
var myDate = startDate;
myDate < endDate;
myDate = new Date( startDate.valueOf() + OneDay )
) {
dates.push(myDate);
}
async.mapLimit(
dates,
10,
function (date,callback) {
UserActivity.aggregate(
[
{ "$match": {
"user_id": { "$in": user_id },
"tracker_id": { "$in": getTrackerId },
"date": {
"$gte": date,
"$lt": new Date( date.valueOf() + OneDay )
}
}},
{ "$unwind": "$duration" },
{ "$group": {
"_id": {
"tracker_id": "$tracker_id",
"date": {
"$add": [
{ "$subtract": [
{ "$subtract": [ "$date", new Date(0) ] },
{ "$mod": [
{ "$subtract": [ "$date", new Date(0) ] },
OneDay
]}
]},
new Date(0)
]
},
"val": { "$sum": "$duration" }
}
}},
{ "$sort": { "_id": 1, "val": -1 } },
{ "$limit": returnLimit },
{ "$group": {
"_id": "$_id.date",
"resultData": {
"$push": {
"tracker_id": "$_id.tracker_id",
"val": "$val"
}
}
}}
],
function (err,result) {
callback(err,result[0]);
}
);
},
function (err,results) {
if (err) throw err;
results.sort(function (a,b) {
return a._id > b._id;
});
console.log( JSON.stringify( results, undefined, 2 ) );
}
);
Now that is a much cleaner listing and a lot more efficient and scalable than the first approach. By issuing each aggregation per single date and then combining the results, the "limit" there allows up to 10 queries to execute on the server at the same time ( tune to your needs ) and ultimately return a singular response.
Since these are "async" and not performed in series ( the best performance option ) then you just need to sort the returned array as is done in the final block:
results.sort(function (a,b) {
return a._id > b._id;
});
And then everything is ordered as you would expect in the response.
Forcing the aggregation pipeline to do this where it really is not necessary is a sure path to code that will fail in the future if it does not already do so now. Parallel query operations and combining the results "just makes sense" for efficient and scalable output.
Also note that you should not use $lte with range selections on dates. Even if you though about it, the better approach is "startDate" with "endDate" being the next "whole day" ( start ) after the range you want. This makes a cleaner distinction on the selection that say "the last second of the day".

Aggregate with a Composite Key

I am trying to aggregate some data and group it by Time Intervals as well as maintaining a sub-category, if you will. I want to be able to chart this data out so that I will have multiple different Lines corresponding to each Office that was called. The X axis will be the Time Intervals and the Y axis would be the Average Ring Time.
My data looks like this:
Calls: [{
created: ISODate(xyxyx),
officeCalled: 'ABC Office',
answeredAt: ISODate(xyxyx)
},
{
created: ISODate(xyxyx),
officeCalled: 'Office 2',
answeredAt: ISODate(xyxyx)
},
{
created: ISODate(xyxyx),
officeCalled: 'Office 3',
answeredAt: ISODate(xyxyx)
}];
My goal is to get my calls grouped by Time Intervals (30 Minutes/1 Hour/1 Day) AND by the Office Called. So when my aggregate completes, I'm looking for data like this:
[{"_id":TimeInterval1,"calls":[{"office":"ABC Office","ringTime":30720},
{"office":"Office2","ringTime":3070}]},
{"_id":TimeInterval2,"calls":[{"office":"Office1","ringTime":1125},
{"office":"ABC Office","ringTime":15856}]}]
I have been poking around for the past few hours and I was able to aggregate my data, but I haven't figured out how to group it properly so that I have each time interval along with the office data. Here is my latest code:
Call.aggregate([
{$match: {
$and: [
{created: {$exists: 1}},
{answeredAt: {$exists: 1}}]}},
{$project: { created: 1,
officeCalled: 1,
answeredAt: 1,
timeToAns: {$subtract: ["$answeredAt", "$created"]}}},
{$group: {_id: {"day": {"$dayOfYear": "$created"},
"hour": {
"$subtract": [
{"$hour" : "$created"},
{"$mod": [ {"$hour": "$created"}, 2]}
]
},
"officeCalled": "$officeCalled"
},
avgRingTime: {$avg: '$timeToAns'},
total: {$sum: 1}}},
{"$group": {
"_id": "$_id.day",
"calls": {
"$push": {
"office": "$_id.officeCalled",
"ringTime": "$avgRingTime"
},
}
}},
{$sort: {_id: 1}}
]).exec(function(err, results) {
//My results look like this
[{"_id":118,"calls":[{"office":"ABC Office","ringTime":30720},
{"office":"Office 2","ringTime":31384.5},
{"office":"Office 3","ringTime":7686.066666666667},...];
});
This just doesn't quite get it...I get my data but it's broken down by Day only. Not my 2 hour time interval that I was shooting for. Let me know if I'm doing this all wrong, please --- I am VERY NEW to aggregation so your help is very much appreciated.
Thank you!!
All you really need to do is include the both parts of the _id value your want in the final group. No idea why you thought to only reference a single field.
Also "loose the $project" as it is just wasted cycles and processing, when you can just use directly in $group on the first try:
Call.aggregate(
[
{ "$match": {
"created": { "$exists": 1 },
"answeredAt": { "$exists": 1 }
}},
{ "$group": {
"_id": {
"day": {"$dayOfYear": "$created"},
"hour": {
"$subtract": [
{"$hour" : "$created"},
{"$mod": [ {"$hour": "$created"}, 2]}
]
},
"officeCalled": "$officeCalled"
},
"avgRingTime": {
"$avg": { "$subtract": [ "$answeredAt", "$created" ] }
},
"total": { "$sum": 1 }
}},
{ "$group": {
"_id": {
"day": "$_id.day",
"hour": "$_id.hour"
},
"calls": {
"$push": {
"office": "$_id.officeCalled",
"ringTime": "$avgRingTime"
},
},
"total": { "$sum": "$total" }
}},
{ "$sort": { "_id": 1 } }
]
).exec(function(err, results) {
});
Also note the complete omission of $and. This is not needed as all MongoDB query arguments are already "AND" conditions anyway, unless specifically stated otherwise. Just stick to what is simple. It's meant to be simple.

Mongodb mapreduce sorting (optimization) or alternative

I have a few documents that look like this:
{
'page_id': 123131,
'timestamp': ISODate('2014-06-10T12:13:59'),
'processed': false
}
The documents have other fields, but these are the only one relevant for this purpose. On this collection is also an index for these documents:
{
'page_id': 1
'timestamp': -1
}
I run a mapreduce that returns distinct (page_id, day) results, with day being the date-portion of the timestamp (in the above, it would be 2014-06-10).
This is done with the following mapreduce:
function() {
emit({
site_id: this.page_id,
day: Date.UTC(this.timestamp.getUTCFullYear(),
this.timestamp.getUTCMonth(),
this.timestamp.getUTCDate())
}, {
count: 1
});
}
The reduce-function basically just returns { count: 1 } as I am not really interested in the number, just unique tuples.
I wish to make this more efficient. I tried adding sort: { 'page_id' }, but it triggers an error - googling shows that I can apparently only sort by the key, but since this is not a "raw" key how does that work?
Also, is there an alternative to this mapreduce that is faster? I know mongodb has the distinct, but from what I can gather it only works on one field. Might the group aggregate function be relevant?
The aggregation framework would seem more appropriate since it runs in native code where mapReduce runs under a JavaScript interpreter instance. MapReduce has it's uses, but generally the aggregation framework should be best suited to common tasks that do not require specific processing where only the JavaScript methods allow the needed control:
db.collection.aggregate([
{ "$group": {
"_id": {
"page": "$page_id",
"day": {
"year": { "$year": "$timestamp" },
"month": { "$month": "$timestamp" },
"day": { "$dayOfMonth": "$timestamp" },
}
},
"count": { "$sum": 1 }
}}
])
This largely makes use of the date aggregation operators. See other aggregation framework operators for more details.
Of course if you wanted to reverse sort those unique dates (which is the opposite of what mapReduce will do) or other fields then just add a $sort to the end of the pipeline for what you want:
db.collection.aggregate([
{ "$group": {
"_id": {
"page": "$page_id",
"day": {
"year": { "$year": "$timestamp" },
"month": { "$month": "$timestamp" },
"day": { "$dayOfMonth": "$timestamp" },
}
},
"count": { "$sum": 1 }
}},
{ "$sort": {
"day.year": -1, "day.month": -1, "day.day": -1
}}
])
you might want to look at the aggregation framework.
query like this:
collection.aggregate([
{$group:
{
_id: {
year: { $year: [ "$timestamp" ] },
month: { $month: [ "$timestamp" ] },
day: { $dayOfMonth: [ "$timestamp" ] },
pageId: "$page_id"
}
}
])
will give you all unique combinations of the fields you're looking for.