Mongo aggregation with paginated data and totals - mongodb

I've crawled all over stack overflow, and have not found any info on how to return proper pagination data included in the resultset.
I'm trying to aggregate some data from my mongo store. What I want, is to have something return:
{
total: 5320,
page: 0,
pageSize: 10,
data: [
{
_id: 234,
currentEvent: "UPSTREAM_QUEUE",
events: [
{ ... }, { ... }, { ... }
]
},
{
_id: 235,
currentEvent: "UPSTREAM_QUEUE",
events: [
{ ... }, { ... }, { ... }
]
}
]
}
This is what I have so far:
// page and pageSize are variables
db.mongoAuditEvent.aggregate([
// Actual grouped data
{"$group": {
"_id" : "$corrId",
"currentEvent": {"$last": "$event.status"},
"events": { $push: "$$ROOT"}
}},
// Pagination group
{"$group": {
"_id": 0,
"total": { "$sum": "corrId" },
"page": page,
"pageSize": pageSize,
"data": {
"$push": {
"_id": "$_id",
"currentEvent": "$currentEvent",
"events": "$events"
}
}
}},
{"$sort": {"events.timestamp": -1} }, // Latest first
{"$skip": page },
{"$limit": pageSize }
], {allowDiskUse: true});
I'm trying to have a pagination group as root, containing the actual grouped data inside (so that I get actual totals, whilst still retaining skip and limits).
The above code will return the following error in mongo console:
The field 'page' must be an accumulator object
If I remove the page and pageSize from the pagination group, I still get the following error:
BSONObj size: 45707184 (0x2B96FB0) is invalid. Size must be between 0 and 16793600(16MB) First element: id: 0
If I remove the pagination group alltogether, the query works fine. But I really need to return how many documents I have stored total, and allthough not actually necessary, page and pageSize would be nice to return as well.
Can somebody please tell me what I am doing wrong? Or tell me if it is at all possible to do this in one go?

If you have a lot of events, {$ push: "$$ ROOT"}, will make Mongo return an error, I have solved it with $facet (Only works with version 3.4+)
aggregate([
{ $match: options },
{
$facet: {
edges: [
{ $sort: sort },
{ $skip: skip },
{ $limit: limit },
],
pageInfo: [
{ $group: { _id: null, count: { $sum: 1 } } },
],
},
},
])

A performance optimization tip:
When you use $facet stage for pagination, Try to add it as soon as it's possible.
For example: if you want to add $project or $lookup stage, add them after $facet, not before it.
it will have impressive effect in aggregation speed. because $project stage require MongoDB to explore all documents and get involve with all fields(which is not necessary).

Did this in two steps instead of one:
// Get the totals
db.mongoAuditEvent.aggregate([{$group: {_id: "$corrId"}}, {$group: {_id: 1, total: {$sum: 1}}}]);
// Get the data
db.mongoAuditEvent.aggregate([
{$group: {
_id : "$corrId",
currentEvent: {"$last": "$event.status"},
"events": { $push: "$$ROOT"}
}},
{$sort: {"events.timestamp": -1} }, // Latest first
{$skip: 0 },
{$limit: 10}
], {allowDiskUse: true}).pretty();
I would be very happy if anybody got a better solution to this though.

Related

MongoDB sum returning always 0

I am new in Mongodb and I´m trying to create a query which print all the combinations of points that were assigned to the accommodations and sort them by the number of accommodations that received these points. However when I execute this query, the $sum is always returning 0 despite the 3 fields are numeric values:
db.test.aggregate([
{$addFields: {sumPoints: **{$sum: ["$lodging.reviews.cleanliness", "lodging.reviews.location", "lodging.reviews.food"]}**}},
{$group: {
_id: "$sumPoints",
count: {$sum: 1}
}},
{$sort: {count: 1}},
{$project: {_id: 0, count: 1, sumPoints: "$_id"}}
])
In the photo I show a document example.
Document example
Does anyone know what can be the problem?
I tried with that query and the result is just:
{ count: 5984, sumPoints: 0 }
because sumPoints is always returning 0.
I think there are two problems. The first is that you are missing the dollar sign (to indicate that you want to access the fields) for the second and third items. But on top of that, it seems that $sum might not be able to add from different values in the array by itself? Summing sums seems to have worked:
{
"$addFields": {
"sumPoints": {
$sum: [
{
$sum: [
"$lodging.reviews.cleanliness"
]
},
{
$sum: [
"$lodging.reviews.location"
]
},
{
$sum: [
"$lodging.reviews.food"
]
}
]
}
}
}
Playground example here
Alternatively, you can use the $reduce operator here:
{
"$addFields": {
"sumPoints": {
"$reduce": {
"input": "$lodging.reviews",
"initialValue": 0,
"in": {
$sum: [
"$$value",
"$$this.cleanliness",
"$$this.location",
"$$this.food"
]
}
}
}
}
}
Playground example here
In the future please also provide the text for your sample documents (or, better yet, a playground example directly) so that it is easier to assist.

Sort element with property: true to the top, but only one out of many

My app can search through a database of resources using MongoDB's aggregation pipeline. Some of these documents have the property sponsored: true.
I want to move exactly one of these sponsored entries to the top of the search results, but keep natural ordering up for the remaining ones (no matter if sponsored or not).
Below is my code. My idea was to make use of addFields but change the logic so that it only applies to the first element that meets the condition. Is this possible?
[...]
const aggregationResult = await Resource.aggregate()
.search({
compound: {
must: [
[...]
],
should: [
[...]
]
}
})
[...]
//only do this for the first sponsored result
.addFields({
selectedForSponsoredSlot: { $cond: [{ $eq: ['$sponsored', true] }, true, false] }
})
.sort(
{
selectedForSponsoredSlot: -1,
_id: 1
}
)
.facet({
results: [
{ $match: matchFilter },
{ $skip: (page - 1) * pageSize },
{ $limit: pageSize },
],
totalResultCount: [
{ $match: matchFilter },
{ $group: { _id: null, count: { $sum: 1 } } }
],
[...]
})
.exec();
[...]
Update:
One option is to change your $facet a bit:
You can get the $match out of the $facet since it is relevant to all pipelines.
instead of two pipelines, one for the results and one for the counting, we have now three: one more for sponsored documents only.
remove items that were already seen previously according to the sposerted item relevance score.
remove the item that is in the sponserd array from the allDocs array (if it is in this page).
$slice the allDocs array to be in the right size to complete the sponsered items to the wanted pageSize
$project to concatenate sponsored and allDocs docs
db.collection.aggregate([
{$sort: {relevance: -1, _id: 1}},
{$match: matchFilter},
{$facet: {
allDocs: [{$skip: (page - 1) * (pageSize - 1)}, {$limit: pageSize + 1 }],
sposerted: [{$match: {sponsored: true}}, {$limit: 1}],
count: [{$count: "total"}]
}},
{$set: {
allDocs: {
$slice: [
"$allDocs",
{$cond: [{$gte: [{$first: "$sposerted.relevance"},
{$first: "$allDocs.relevance"}]}, 1, 0]},
pageSize + 1
]
}
}},
{$set: {
allDocs: {
$filter: {
input: "$allDocs",
cond: {$not: {$in: ["$$this._id", "$sposerted._id"]}}
}
}
}},
{$set: {allDocs: {$slice: ["$allDocs", 0, (pageSize - 1)]}}},
{$project: {
results: {
$concatArrays: [ "$sposerted", "$allDocs"]},
totalResultCount: {$first: "$count.total"}
}}
])
See how it works on the playground example

Efficiently find the most recent filtered document in MongoDB collection using datetime field

I have a large collection of documents with datetime fields in them, and I need to retrieve the most recent document for any given queried list.
Sample data:
[
{"_id": "42.abc",
"ts_utc": "2019-05-27T23:43:16.963Z"},
{"_id": "42.def",
"ts_utc": "2019-05-27T23:43:17.055Z"},
{"_id": "69.abc",
"ts_utc": "2019-05-27T23:43:17.147Z"},
{"_id": "69.def",
"ts_utc": "2019-05-27T23:44:02.427Z"}
]
Essentially, I need to get the most recent record for the "42" group as well as the most recent record for the "69" group. Using the sample data above, the desired result for the "42" group would be document "42.def".
My current solution is to query each group one at a time (looping with PyMongo), sort by the ts_utc field, and limit it to one, but this is really slow.
// Requires official MongoShell 3.6+
db = db.getSiblingDB("someDB");
db.getCollection("collectionName").find(
{
"_id" : /^42\..*/
}
).sort(
{
"ts_utc" : -1.0
}
).limit(1);
Is there a faster way to get the results I'm after?
Assuming all your documents have the format displayed above, you can split the id into two parts (using the dot character) and use aggregation to find the max element per each first array (numeric) element.
That way you can do it in a one shot, instead of iterating per each group.
db.foo.aggregate([
{ $project: { id_parts : { $split: ["$_id", "."] }, ts_utc : 1 }},
{ $group: {"_id" : { $arrayElemAt: [ "$id_parts", 0 ] }, max : {$max: "$ts_utc"}}}
])
As #danh mentioned in the comment, the best way you can do is probably adding an auxiliary field to indicate the grouping. You may further index the auxiliary field to boost the performance.
Here is an ad-hoc way to derive the field and get the latest result per grouping:
db.collection.aggregate([
{
"$addFields": {
"group": {
"$arrayElemAt": [
{
"$split": [
"$_id",
"."
]
},
0
]
}
}
},
{
$sort: {
ts_utc: -1
}
},
{
"$group": {
"_id": "$group",
"doc": {
"$first": "$$ROOT"
}
}
},
{
"$replaceRoot": {
"newRoot": "$doc"
}
}
])
Here is the Mongo playground for your reference.

Aggregation with update in mongoDB

I've a collection with many similar structured document, two of the document looks like
Input:
{
"_id": ObjectId("525c22348771ebd7b179add8"),
"cust_id": "A1234",
"score": 500,
"status": "A"
"clear": "No"
}
{
"_id": ObjectId("525c22348771ebd7b179add9"),
"cust_id": "A1234",
"score": 1600,
"status": "B"
"clear": "No"
}
By default the clear for all document is "No",
Req: I have to add the score of all documents with same cust_id, provided they belong to status "A" and status "B". If the score exceeds 2000 then I have to update the clear attribute to "Yes" for all of the document with the same cust_id.
Expected output:
{
"_id": ObjectId("525c22348771ebd7b179add8"),
"cust_id": "A1234",
"score": 500,
"status": "A"
"clear": "Yes"
}
{
"_id": ObjectId("525c22348771ebd7b179add9"),
"cust_id": "A1234",
"score": 1600,
"status": "B"
"clear": "Yes"
}
Yes because 1600+500 = 2100, and 2100 > 2000.
My Approach:
I was only able to get the sum by aggregate function but failed at updating
db.aggregation.aggregate([
{$match: {
$or: [
{status: 'A'},
{status: 'B'}
]
}},
{$group: {
_id: '$cust_id',
total: {$sum: '$score'}
}},
{$match: {
total: {$gt: 2000}
}}
])
Please suggest me how do I proceed.
After a lot of trouble, experimenting mongo shell I've finally got a solution to my question.
Psudocode:
# To get the list of customer whose score is greater than 2000
cust_to_clear=db.col.aggregate(
{$match:{$or:[{status:'A'},{status:'B'}]}},
{$group:{_id:'$cust_id',total:{$sum:'$score'}}},
{$match:{total:{$gt:500}}})
# To loop through the result fetched from above code and update the clear
cust_to_clear.result.forEach
(
function(x)
{
db.col.update({cust_id:x._id},{$set:{clear:'Yes'}},{multi:true});
}
)
Please comment, if you have any different solution for the same question.
With Mongo 4.2 it is now possible to do this using update with aggregation pipeline. The example 2 has example how you do conditional updates:
db.runCommand(
{
update: "students",
updates: [
{
q: { },
u: [
{ $set: { average : { $avg: "$tests" } } },
{ $set: { grade: { $switch: {
branches: [
{ case: { $gte: [ "$average", 90 ] }, then: "A" },
{ case: { $gte: [ "$average", 80 ] }, then: "B" },
{ case: { $gte: [ "$average", 70 ] }, then: "C" },
{ case: { $gte: [ "$average", 60 ] }, then: "D" }
],
default: "F"
} } } }
],
multi: true
}
],
ordered: false,
writeConcern: { w: "majority", wtimeout: 5000 }
}
)
Another example:
db.c.update({}, [
{$set:{a:{$cond:{
if: {}, // some condition
then:{} , // val1
else: {} // val2 or "$$REMOVE" to not set the field or "$a" to leave existing value
}}}}
]);
You need to do this in two steps:
Identify customers (cust_id) with a total score greater than 200
For each of these customers, set clear to Yes
You already have a good solution for the first part. The second part should be implemented as a separate update() calls to the database.
Psudocode:
# Get list of customers using the aggregation framework
cust_to_clear = db.col.aggregate(
{$match:{$or:[{status:'A'},{status:'B'}]}},
{$group:{_id:'$cust_id', total:{$sum:'$score'}}},
{$match:{total:{$gt:2000}}}
)
# Loop over customers and update "clear" to "yes"
for customer in cust_to_clear:
id = customer[_id]
db.col.update(
{"_id": id},
{"$set": {"clear": "Yes"}}
)
This isn't ideal because you have to make a database call for every customer. If you need to do this kind of operation often, you might revise your schema to include the total score in each document. (This would have to be maintained by your application.) In this case, you could do the update with a single command:
db.col.update(
{"total_score": {"$gt": 2000}},
{"$set": {"clear": "Yes"}},
{"multi": true}
)
Short Answer: To avoid looping a Database query, just add $merge to the end and specify your collection like so:
db.aggregation.aggregate([
{$match: {
$or: [
{status: 'A'},
{status: 'B'}
]
}},
{$group: {
_id: '$cust_id',
total: {$sum: '$score'}
}},
{$match: {
total: {$gt: 2000}
}},
{ $merge: "<collection name here>"}
])
Elaboration: The current solution is looping through a database query, which is not good time efficiency wise and also a lot more code.
Mitar's answer is not updating through an aggregation, but the opposite => using an aggregation within Mongo's update. If your wondering what is a pro in doing it this way, well you can use all of the aggregation pipeline as opposed to being restricted to only a few as specified in their documentation.
Here is an example of an aggregate that won't work with Mongo's update:
db.getCollection('foo').aggregate([
{ $addFields: {
testField: {
$in: [ "someValueInArray", '$arrayFieldInFoo']
}
}},
{ $merge : "foo" }]
)
This will output the updated collection with a new test field that will be true if "someValueInArray" is in "arrayFieldInFoo" or false otherwise. This is NOT possible currently with Mongo.update since $in cannot be used inside update aggregate.
Update: Changed from $out to $merge since $out would only work if updating the entire collection as $out replaces entire collection with the result of the aggregate. $merge will only overrite if the aggregate matches a document (much safer).
In MongoDB 2.6., it will be possible to write the output of aggregation query, with the same command.
More information here : http://docs.mongodb.org/master/reference/operator/aggregation/out/
The solution which I found is using "$out"
*) e.g adding a field :
db.socios.aggregate(
[
{
$lookup: {
from: 'cuotas',
localField: 'num_socio',
foreignField: 'num_socio',
as: 'cuotas'
}
},
{
$addFields: { codigo_interno: 1001 }
},
{
$out: 'socios' //Collection to modify
}
]
)
*) e.g modifying a field :
db.socios.aggregate(
[
{
$lookup: {
from: 'cuotas',
localField: 'num_socio',
foreignField: 'num_socio',
as: 'cuotas'
}
},
{
$set: { codigo_interno: 1001 }
},
{
$out: 'socios' //Collection to modify
}
]
)

mongodb aggregation framework group + project

I have the following issue:
this query return 1 result which is what I want:
> db.items.aggregate([ {$group: { "_id": "$id", version: { $max: "$version" } } }])
{
"result" : [
{
"_id" : "b91e51e9-6317-4030-a9a6-e7f71d0f2161",
"version" : 1.2000000000000002
}
],
"ok" : 1
}
this query ( I just added projection so I can later query for the entire document) return multiple results. What am I doing wrong?
> db.items.aggregate([ {$group: { "_id": "$id", version: { $max: "$version" } }, $project: { _id : 1 } }])
{
"result" : [
{
"_id" : ObjectId("5139310a3899d457ee000003")
},
{
"_id" : ObjectId("513931053899d457ee000002")
},
{
"_id" : ObjectId("513930fd3899d457ee000001")
}
],
"ok" : 1
}
found the answer
1. first I need to get all the _ids
db.items.aggregate( [
{ '$match': { 'owner.id': '9e748c81-0f71-4eda-a710-576314ef3fa' } },
{ '$group': { _id: '$item.id', dbid: { $max: "$_id" } } }
]);
2. then i need to query the documents
db.items.find({ _id: { '$in': "IDs returned from aggregate" } });
which will look like this:
db.items.find({ _id: { '$in': [ '1', '2', '3' ] } });
( I know its late but still answering it so that other people don't have to go search for the right answer somewhere else )
See to the answer of Deka, this will do your job.
Not all accumulators are available in $project stage. We need to consider what we can do in project with respect to accumulators and what we can do in group. Let's take a look at this:
db.companies.aggregate([{
$match: {
funding_rounds: {
$ne: []
}
}
}, {
$unwind: "$funding_rounds"
}, {
$sort: {
"funding_rounds.funded_year": 1,
"funding_rounds.funded_month": 1,
"funding_rounds.funded_day": 1
}
}, {
$group: {
_id: {
company: "$name"
},
funding: {
$push: {
amount: "$funding_rounds.raised_amount",
year: "$funding_rounds.funded_year"
}
}
}
}, ]).pretty()
Where we're checking if any of the funding_rounds is not empty. Then it's unwind-ed to $sort and to later stages. We'll see one document for each element of the funding_rounds array for every company. So, the first thing we're going to do here is to $sort based on:
funding_rounds.funded_year
funding_rounds.funded_month
funding_rounds.funded_day
In the group stage by company name, the array is getting built using $push. $push is supposed to be part of a document specified as the value for a field we name in a group stage. We can push on any valid expression. In this case, we're pushing on documents to this array and for every document that we push it's being added to the end of the array that we're accumulating. In this case, we're pushing on documents that are built from the raised_amount and funded_year. So, the $group stage is a stream of documents that have an _id where we're specifying the company name.
Notice that $push is available in $group stages but not in $project stage. This is because $group stages are designed to take a sequence of documents and accumulate values based on that stream of documents.
$project on the other hand, works with one document at a time. So, we can calculate an average on an array within an individual document inside a project stage. But doing something like this where one at a time, we're seeing documents and for every document, it passes through the group stage pushing on a new value, well that's something that the $project stage is just not designed to do. For that type of operation we want to use $group.
Let's take a look at another example:
db.companies.aggregate([{
$match: {
funding_rounds: {
$exists: true,
$ne: []
}
}
}, {
$unwind: "$funding_rounds"
}, {
$sort: {
"funding_rounds.funded_year": 1,
"funding_rounds.funded_month": 1,
"funding_rounds.funded_day": 1
}
}, {
$group: {
_id: {
company: "$name"
},
first_round: {
$first: "$funding_rounds"
},
last_round: {
$last: "$funding_rounds"
},
num_rounds: {
$sum: 1
},
total_raised: {
$sum: "$funding_rounds.raised_amount"
}
}
}, {
$project: {
_id: 0,
company: "$_id.company",
first_round: {
amount: "$first_round.raised_amount",
article: "$first_round.source_url",
year: "$first_round.funded_year"
},
last_round: {
amount: "$last_round.raised_amount",
article: "$last_round.source_url",
year: "$last_round.funded_year"
},
num_rounds: 1,
total_raised: 1,
}
}, {
$sort: {
total_raised: -1
}
}]).pretty()
In the $group stage, we're using $first and $last accumulators. Right, again we can see that as with $push - we can't use $first and $last in project stages. Because again, project stages are not designed to accumulate values based on multiple documents. Rather they're designed to reshape documents one at a time. Total number of rounds is calculated using the $sum operator. The value 1 simply counts the number of documents passed through that group together with each document that matches or is grouped under a given _id value. The project may seem complex, but it's just making the output pretty. It's just that it's including num_rounds and total_raised from the previous document.