Is there any way to get the last non-zero value in an aggregation. Is that possible?
Scenario:
I have an events collection, in which I store all the events from users. I want to fetch the list of users with last purchased items cost is $1.99 and logged in at least once last week.
My events collection will have records like
{_id:ObjectId("58af54d5ab7df73d71822708"),uid:1,event:"login"}
{_id:ObjectId("58db7189296fdedde1c04bc1"),uid:2,event:"login"}
{_id:ObjectId("5888419bfa4b69dc4af7c76c"),uid:2,event:"purchase",amount:3}
{_id:ObjectId("5888419bfa4b69dc4af7d45c"),uid:1,event:"purchase",amount:1.9}
{_id:ObjectId("5888819bfa4b69dc4af7c76c"),uid:1,event:"custom",type:3,value:2}
What am trying to do:
db.events.aggregate([{
{
$group: {
_id: uid,
last_login: {
$max: {
$cond: [{
$eq: ['$event', 'login']
}, '$_id', 0]
}
},
last_amount: {
$last: {
$cond: [{
$eq: ['$event', 'login']
}, '$_id', 0]
}
}
}
}
}, {
$match: {
last_purchase: {
$gte: ObjectId("58af54d50000000000000000")
},
last_amount: 1.9
}
}])
which obviously will fail because last will have 0 as the last item.
The output am expecting is
{_id:1,last_login:_id:ObjectId("58af54d5ab7df73d71822708"),last_amount:1.9}
The query is system generated. Please help.
To answer your question, you cannot modify the behaviour of $last using $cond. i.e. 'not the last entry because of a criteria'.
There are few alternatives that you could try:
Alter your document schema for the access pattern, especially if this is a frequently used query in your application. You should utilise the schema as a leverage to optimise your application queries.
Use one of the MongoDB supported drivers to process the expected outcome.
Depending on the use case, you could execute 2 separate aggregation queries; first query all the users uid that logged in at least once last week, and second to query only those users that have the last purchased item value of $1.99.
I found a workaround.Instead of $last, I used $push in the $group and added $slice with $setDifference in $project to remove null values. The query will now look something like
db.events.aggregate([{
{
$group: {
_id: uid,
last_login: {
$max: {
$cond: [{
$eq: ['$event', 'login']
}, '$_id', 0]
}
},
last_amount: {
$push: {
$cond: [{
$eq: ['$event', 'login']
}, '$_id', null]
}
}
}
}
},
{$project:{last_login:{$slice:[{$setDifference,'$last_login',[null]},-1,1]}}}
, {
$match: {
last_purchase: {
$gte: ObjectId("58af54d50000000000000000")
},
last_amount: 1.9
}
}])
Related
I have a large collection of documents with datetime fields in them, and I need to retrieve the most recent document for any given queried list.
Sample data:
[
{"_id": "42.abc",
"ts_utc": "2019-05-27T23:43:16.963Z"},
{"_id": "42.def",
"ts_utc": "2019-05-27T23:43:17.055Z"},
{"_id": "69.abc",
"ts_utc": "2019-05-27T23:43:17.147Z"},
{"_id": "69.def",
"ts_utc": "2019-05-27T23:44:02.427Z"}
]
Essentially, I need to get the most recent record for the "42" group as well as the most recent record for the "69" group. Using the sample data above, the desired result for the "42" group would be document "42.def".
My current solution is to query each group one at a time (looping with PyMongo), sort by the ts_utc field, and limit it to one, but this is really slow.
// Requires official MongoShell 3.6+
db = db.getSiblingDB("someDB");
db.getCollection("collectionName").find(
{
"_id" : /^42\..*/
}
).sort(
{
"ts_utc" : -1.0
}
).limit(1);
Is there a faster way to get the results I'm after?
Assuming all your documents have the format displayed above, you can split the id into two parts (using the dot character) and use aggregation to find the max element per each first array (numeric) element.
That way you can do it in a one shot, instead of iterating per each group.
db.foo.aggregate([
{ $project: { id_parts : { $split: ["$_id", "."] }, ts_utc : 1 }},
{ $group: {"_id" : { $arrayElemAt: [ "$id_parts", 0 ] }, max : {$max: "$ts_utc"}}}
])
As #danh mentioned in the comment, the best way you can do is probably adding an auxiliary field to indicate the grouping. You may further index the auxiliary field to boost the performance.
Here is an ad-hoc way to derive the field and get the latest result per grouping:
db.collection.aggregate([
{
"$addFields": {
"group": {
"$arrayElemAt": [
{
"$split": [
"$_id",
"."
]
},
0
]
}
}
},
{
$sort: {
ts_utc: -1
}
},
{
"$group": {
"_id": "$group",
"doc": {
"$first": "$$ROOT"
}
}
},
{
"$replaceRoot": {
"newRoot": "$doc"
}
}
])
Here is the Mongo playground for your reference.
I have a mongo collection that stores data on page views such as location, user type (e.g. admin, user) and time spent on page. I want to use $match to get a subset of the documents and then use $group to group them by state and also by user type. The $match is rather expensive, so I was wondering if there was a way through the aggregation pipeline to somehow reuse the $match and get two sets of grouped data rather than needing to run two aggregates.
Current js pseudocode:
groupedByState = Views.aggregate([
$match: { ... },
$group: {
_id: '$state',
secondsViewed: { $avg: '$seconds_viewed' },
},
])
groupedByUserType = Views.aggregate([
$match: { ... },
$group: {
_id: '$user_type',
secondsViewed: { $avg: '$seconds_viewed' },
},
])
You can use the $facet aggregation operator to perform multiple $group operations against the output of the $match:
Views.aggregate([
$match: { ... },
$facet: {
byState: [{
$group: {
_id: '$state',
secondsViewed: { $avg: '$seconds_viewed' }
}
}],
byUserType: [
$group: {
_id: '$user_type',
secondsViewed: { $avg: '$seconds_viewed' }
}
}]
}
])
I would like to highlight a list of _id documents (with a limit) ranked in descending order (via their timestamp) based on a list of ObjectId.
Corresponding to this:
db.collection.aggregate( [ { $match: { _id: { $in: [ObjectId("X"), ObjectId("Y") ] } } }, { $sort: { timestamp: -1 } }, { $group: { _id: "$_id" } }, { $skip: 0 }, { $limit: 100 } ] )
Knowing that the list from the loop may contain way more than 1000 ObjectId (in $in array), do you think my solution is viable? Is not there a faster and less resource intensive way?
Best Regards.
I'm having trouble with my database because I have documents representing my users with the field email with different cases (due to the ability to create ghost user, waiting for them to register). When the user registers, I use the lowered version of his email and overwrite the previous entry. The problem is that 'ghost' email has not been lowered.
If Foo#bar.com ghost is created, Foo#bar.com register, he will be known as 'foo#bar.com', so Foo#bar.com will just pollute my database.
I looking for a way in order to find the duplicates entries, remove the irrelevant one (by hand) before I push my fix about case. Ideas?
Thank you!
Try this:
db.users.aggregate([
{ $match: {
"username": { $exists: true }
}},
{ $project: {
"username": { "$toLower": [ "$username" ]}
}},
{ $group: {
_id: "$username",
total: { $sum : 1 }
}},
{ $match: {
total: { $gte: 2 }
}},
{ $sort: {
total: -1
}}
]);
This will find every user with a username, make the user names lower case, then group them by username and display the usernames that have a count greater than 1.
You can use projection and toLower function to achieve what you are looking for. Assuming that your attribute name is "email" in your collection document, here is an example of how to achieve this:
db.yourcollection.aggregate([
{ $project: {
"email": { "$toLower" : [ "$email" ] }
}},
{ $match: {
"email": /foo#bar.com/
}}
]);
I have the following issue:
this query return 1 result which is what I want:
> db.items.aggregate([ {$group: { "_id": "$id", version: { $max: "$version" } } }])
{
"result" : [
{
"_id" : "b91e51e9-6317-4030-a9a6-e7f71d0f2161",
"version" : 1.2000000000000002
}
],
"ok" : 1
}
this query ( I just added projection so I can later query for the entire document) return multiple results. What am I doing wrong?
> db.items.aggregate([ {$group: { "_id": "$id", version: { $max: "$version" } }, $project: { _id : 1 } }])
{
"result" : [
{
"_id" : ObjectId("5139310a3899d457ee000003")
},
{
"_id" : ObjectId("513931053899d457ee000002")
},
{
"_id" : ObjectId("513930fd3899d457ee000001")
}
],
"ok" : 1
}
found the answer
1. first I need to get all the _ids
db.items.aggregate( [
{ '$match': { 'owner.id': '9e748c81-0f71-4eda-a710-576314ef3fa' } },
{ '$group': { _id: '$item.id', dbid: { $max: "$_id" } } }
]);
2. then i need to query the documents
db.items.find({ _id: { '$in': "IDs returned from aggregate" } });
which will look like this:
db.items.find({ _id: { '$in': [ '1', '2', '3' ] } });
( I know its late but still answering it so that other people don't have to go search for the right answer somewhere else )
See to the answer of Deka, this will do your job.
Not all accumulators are available in $project stage. We need to consider what we can do in project with respect to accumulators and what we can do in group. Let's take a look at this:
db.companies.aggregate([{
$match: {
funding_rounds: {
$ne: []
}
}
}, {
$unwind: "$funding_rounds"
}, {
$sort: {
"funding_rounds.funded_year": 1,
"funding_rounds.funded_month": 1,
"funding_rounds.funded_day": 1
}
}, {
$group: {
_id: {
company: "$name"
},
funding: {
$push: {
amount: "$funding_rounds.raised_amount",
year: "$funding_rounds.funded_year"
}
}
}
}, ]).pretty()
Where we're checking if any of the funding_rounds is not empty. Then it's unwind-ed to $sort and to later stages. We'll see one document for each element of the funding_rounds array for every company. So, the first thing we're going to do here is to $sort based on:
funding_rounds.funded_year
funding_rounds.funded_month
funding_rounds.funded_day
In the group stage by company name, the array is getting built using $push. $push is supposed to be part of a document specified as the value for a field we name in a group stage. We can push on any valid expression. In this case, we're pushing on documents to this array and for every document that we push it's being added to the end of the array that we're accumulating. In this case, we're pushing on documents that are built from the raised_amount and funded_year. So, the $group stage is a stream of documents that have an _id where we're specifying the company name.
Notice that $push is available in $group stages but not in $project stage. This is because $group stages are designed to take a sequence of documents and accumulate values based on that stream of documents.
$project on the other hand, works with one document at a time. So, we can calculate an average on an array within an individual document inside a project stage. But doing something like this where one at a time, we're seeing documents and for every document, it passes through the group stage pushing on a new value, well that's something that the $project stage is just not designed to do. For that type of operation we want to use $group.
Let's take a look at another example:
db.companies.aggregate([{
$match: {
funding_rounds: {
$exists: true,
$ne: []
}
}
}, {
$unwind: "$funding_rounds"
}, {
$sort: {
"funding_rounds.funded_year": 1,
"funding_rounds.funded_month": 1,
"funding_rounds.funded_day": 1
}
}, {
$group: {
_id: {
company: "$name"
},
first_round: {
$first: "$funding_rounds"
},
last_round: {
$last: "$funding_rounds"
},
num_rounds: {
$sum: 1
},
total_raised: {
$sum: "$funding_rounds.raised_amount"
}
}
}, {
$project: {
_id: 0,
company: "$_id.company",
first_round: {
amount: "$first_round.raised_amount",
article: "$first_round.source_url",
year: "$first_round.funded_year"
},
last_round: {
amount: "$last_round.raised_amount",
article: "$last_round.source_url",
year: "$last_round.funded_year"
},
num_rounds: 1,
total_raised: 1,
}
}, {
$sort: {
total_raised: -1
}
}]).pretty()
In the $group stage, we're using $first and $last accumulators. Right, again we can see that as with $push - we can't use $first and $last in project stages. Because again, project stages are not designed to accumulate values based on multiple documents. Rather they're designed to reshape documents one at a time. Total number of rounds is calculated using the $sum operator. The value 1 simply counts the number of documents passed through that group together with each document that matches or is grouped under a given _id value. The project may seem complex, but it's just making the output pretty. It's just that it's including num_rounds and total_raised from the previous document.