How to use Mongo aggregate to create distinct items from array - mongodb

I am having trouble understanding how to use aggregate pipelines in Mongo.
Given the list following documents:
db.dishes.insertMany([
{_id: "Vanilla Sundae", keywords: ["vanilla", "ice cream", "desert"] },
{_id: "Vanilla Cake", keywords: ["vanilla", "cake", "baking", "desert"] },
{_id: "Chocolate Cake", keywords: ["chocolate", "cake", "baking", "desert"] }
])
How do I create an aggregate that would return a list of distinct keywords and counts of docs by keywords:
[
{"_id": "vanilla", "count": 2},
{"_id": "ice cream", "count": 1},
{"_id": "desert", "count": 3},
{"_id": "baking", "count": 2},
{"_id": "cake", "count": 2},
{"_id": "chocolate", "count": 1}
]

You can use $unwind and $group to deconstruct and reconstruct the array
db.collection.aggregate([
{ $unwind: "$keywords" },
{
$group: {
_id: "$keywords",
count: { $sum: 1 }
}
}
])
Working Mongo playground

You can use unwind combined with group operator to achieve this.
db.collection.aggregate([ { "$unwind": { path: "$keywords" } }, { "$group": { "_id": "$keywords", "count": { $sum: 1 } } }, ])
This should do the trick! :)
I'm attaching the MongoDB playground here.

Related

Mongodb Aggregations - Group by date including condition

I have a series of documents gathered by aggregation grouping. This is the result for one document:
{
"_id": {
"ip": "79.xxx.xxx.117",
"myDate": "2022-10-19"
},
"date": "2022-10-19",
"allVisitedPages": [
{
"page": "/",
"time": {
"time": "2022-10-19T11:35:44.655Z",
"tz": "-120",
"_id": "634fe1100a011986b7137da0"
}
},
{
"page": "/2",
"time": {
"time": "2022-10-19T12:14:29.536Z",
"tz": "-120",
"_id": "634fea257acb264f23d421f1"
}
},
{
"page": "/",
"time": {
"time": "2022-10-19T15:37:30.002Z",
"tz": "-120",
"_id": "634fea266001ea364eeb38ea"
}
},
],
"visitedPages": 3,
"createdAt": "2022-10-19T11:35:44.920Z"
},
I want to get this (in this case 2 documents as the time difference between array position 2 and 3 is greater than 2 hours):
{
"_id": {
"ip": "79.xxx.xxx.117",
"myDate": "2022-10-19"
},
"date": "2022-10-19",
"allVisitedPages": [
{
"page": "/",
"durationInMinutes": "39",
"time": {
"time": "2022-10-19T11:35:44.655Z",
"tz": "-120",
"_id": "634fe1100a011986b7137da0"
}
},
{
"page": "/2",
"durationInMinutes": "2",
"time": {
"time": "2022-10-19T12:14:29.536Z",
"tz": "-120",
"_id": "634fea257acb264f23d421f1"
}
}
],
"visitedPages": 2,
},
{
"_id": {
"ip": "79.xxx.xxx.117",
"myDate": "2022-10-19"
},
"date": "2022-10-19",
"allVisitedPages": [
{
"page": "/",
"durationInMinutes": "2",
"time": {
"time": "2022-10-19T15:37:30.002Z",
"tz": "-120",
"_id": "634fea266001ea364eeb38ea"
}
},
],
"visitedPages": 1,
},
I want to get a new grouping document if the time between an array position and the following array position is greater than 2 hours. On the last array position it show always show "2".
I tried $divide and $datediff. But this is not possible on the group stage as it's an unary operator. An approach I tried is to calculate the sum of start and end time by dividing. But how to execute this on an array level on the group stage? Maybe someone could point me in the right direction if possible at all?
You can group and then reduce, but another option is to use $setWindowFields to calculate your grouping index before grouping:
db.collection.aggregate([
{$setWindowFields: {
partitionBy: {$concat: ["$ip", "$date"]},
sortBy: {"time.time": 1},
output: {prevtime: {
$push: "$time.time",
window: {documents: [-1, "current"]}
}}
}},
{$addFields: {
minutesDiff: {
$toInt: {
$dateDiff: {
startDate: {$first: "$prevtime"},
endDate: {$last: "$prevtime"},
unit: "minute"
}
}
}
}},
{$addFields: {deltaIndex: {$cond: [{$gt: ["$minutesDiff", 120]}, 1, 0]}}},
{$setWindowFields: {
partitionBy: {$concat: ["$ip", "$date"]},
sortBy: {"time.time": 1},
output: {
groupIndex: {
$sum: "$deltaIndex",
window: {documents: ["unbounded", "current"]}
},
duration: {
$push: "$minutesDiff",
window: {documents: ["current", 1]}
}
}
}
},
{$set: {
duration: {
$cond: [
{$and: [
{$eq: [{$size: "$duration"}, 2]},
{$lte: [{$last: "$duration"}, 120]}
]},
{$last: "$duration"},
2
]
}
}},
{$group: {
_id: {ip: "$ip", myDate: "$date", groupIndex: "$groupIndex"},
date: {$first: "$date"},
allVisitedPages: {$push: {page: "$page", time: "$time", duration: "$duration"}},
visitedPages: {$sum: 1}
}},
{$unset: "_id.groupIndex"}
])
See how it works on the playground example

Filter nested objects

I have a collection of docs like
{'id':1, 'score': 1, created_at: ISODate(...)}
{'id':1, 'score': 2, created_at: ISODate(...)}
{'id':2, 'score': 1, created_at: ISODate(...)}
{'id':2, 'score': 20, created_at: ISODate(...)}
etc.
Does anyone know how to find docs that were created within the past 24hrs where the difference of the score value between the two most recent docs of the same id is less than 5?
So far I can only find all docs created within the past 24hrs:
[{
$project: {
_id: 0,
score: 1,
created_at: 1
}
}, {
$match: {
$expr: {
$gte: [
'$created_at',
{
$subtract: [
'$$NOW',
86400000
]
}
]
}
}
}]
Any advice appreciated.
Edit: By the two most recent docs, the oldest of the two can be created more than 24hrs ago. So the most recent doc would be created within the past 24hrs, but the oldest doc could be created over 24hrs ago.
If I understand you correctly, you want something like:
db.collection.aggregate([
{$match: {$expr: {$gte: ["$created_at", {$subtract: ["$$NOW", 86400000]}]}}},
{$sort: {created_at: -1}},
{$group: {_id: "$id", data: {$push: "$$ROOT"}}},
{$project: {pair: {$slice: ["$data", 0, 2]}, scores: {$slice: ["$data.score", 0, 2]}}},
{$match: {$expr: {
$lte: [{$abs: {$subtract: [{$first: "$scores"}, {$last: "$scores"}]}}, 5]
}}},
{$unset: "scores"}
])
See how it works on the playground example
EDIT:
according to you comment, one option is:
db.collection.aggregate([
{$setWindowFields: {
partitionBy: "$id",
sortBy: {created_at: -1},
output: {data: {$push: "$$ROOT", window: {documents: ["current", 1]}}}
}},
{$group: {
_id: "$id",
created_at: {$first: "$created_at"},
pair: {$first: "$data"}
}},
{$match: {$expr: {$and: [
{$gte: ["$created_at", {$dateAdd: {startDate: "$$NOW", unit: "day", amount: -1}},
{$eq: [{$size: "$pair"}, 2]},
{$lte: [{$abs: {$subtract: [{$first: "$pair.score"},
{$last: "$pair.score"}]}}, 5]}
]}}},
{$project: {_id: 0, pair: 1}}
])
See how it works on the playground example
If I've understood correctly you can try this query:
First the $match as you have to get documents since a day ago.
Then $sort by the date to ensure the most recent are on top.
$group by the id, and how the most recent were on top, using $push will be the two first elements in the array.
So now you only need to $sum these two values.
And filter again with these one that are less than ($lt) 5.
db.collection.aggregate([
{
$match: {
$expr: {
$gte: [
"$created_at",
{
$subtract: [
"$$NOW",
86400000
]
}
]
}
}
},
{
"$sort": {
"created_at": -1
}
},
{
"$group": {
"_id": "$id",
"score": {
"$push": "$score"
}
}
},
{
"$project": {
"score": {
"$sum": {
"$firstN": {
"n": 2,
"input": "$score"
}
}
}
}
},
{
"$match": {
"score": {
"$lt": 5
}
}
}
])
Example here
Edit: $firstN is new in version 5.2. Other way you can use $slice in this way.

Count the documents and sum of values of fields in all documents of a mongodb

I have a set of documents modified from mongodb using
[{"$project":{"pred":1, "base-url":1}},
{"$group":{
"_id":"$base-url",
"invalid":{"$sum": { "$cond": [{ "$eq": ["$pred", "invalid"] }, 1, 0] }},
"pending":{"$sum": { "$cond": [{ "$eq": ["$pred", "null"] }, 1, 0] }},
}},
]
to get the below documents
[{'_id': 'https://www.example1.org/', 'invalid': 3, 'pending': 6},
{'_id': 'https://example2.com/', 'invalid': 10, 'pending': 4},
{'_id': 'https://www.example3.org/', 'invalid': 2, 'pending': 6}]
How to get the count of documents and sum of other fields to obtain the following result
{"count":3, "invalid":15,"pending":16}
you just need a $group stage with $sum
playground
The $sum docs and here has good examples
db.collection.aggregate([
{
$group: {
_id: null,
pending: {
$sum: "$pending"
},
invalid: {
$sum: "$invalid"
},
count: {
$sum: 1 //counting each record
}
}
},
{
$project: {
_id: 0 //removing _id field from the final output
}
}
])

How can I project top 5 counts and sum the rest in MongoDB?

I have the following documents:
_id: "Team 1"
count: 1200
_id: "Team 2"
count: 1170
_id: "Team 3"
count: 1006
_id: "Team 4"
count: 932
_id: "Team 5"
count: 931
_id: "Team 6"
count: 899
_id: "Team 7"
count: 895
The list is already sorted and everything, I just need to project this as an array of top 5 based on count and then the rest should be summed as 'others'. If possible I'd like to also add the percentage that each element in the list makes up of the full count. Like this:
[
{"name":"Team 1", "count":1200, "percent":25},
{"name":"Team 2", "count":1170,"percent":15},
{"name":"Team 3", "count":1006,"percent":10},
{"name":"Team 4", "count":932,"percent":5},
{"name":"Team 5", "count":931,"percent":5},
{"name":"Other", "count":1794, "percent":40}]
]
Query
$setWindowFields to sort and add the sort-rank to each document
group by null with 3 accumulators
push the first 5 documents unchanged
sum the count of the rest (rank>5)
total sum
$map to divide the counts with the total sum for the 5 top documents, to get the percentage also
add also the percentage for the rest of documents
unwind and replace the root, with those documents that have count and percentage
Playmongo (put the mouse at the end of each stage to see the stage in and out)
aggregate(
[{"$setWindowFields":
{"output": {"rank": {"$rank": {}}}, "sortBy": {"count": -1}}},
{"$group":
{"_id": null,
"top5":
{"$push": {"$cond": [{"$lte": ["$rank", 5]}, "$$ROOT", "$$REMOVE"]}},
"other": {"$sum": {"$cond": [{"$lte": ["$rank", 5]}, 0, "$count"]}},
"all": {"$sum": "$count"}}},
{"$project":
{"_id": 0,
"docs":
{"$concatArrays":
[{"$map":
{"input": "$top5",
"in":
{"name": "$$this._id",
"count": "$$this.count",
"percentage":
{"$multiply": [{"$divide": ["$$this.count", "$all"]}, 100]}}}},
[{"name": "other",
"count": "$other",
"percentage":
{"$multiply": [{"$divide": ["$other", "$all"]}, 100]}}]]}}},
{"$unwind": "$docs"}, {"$replaceRoot": {"newRoot": "$docs"}}])
another way to do it using $facet since $setWindowFields only works with mongodb v5 or later
mongoPlayground
db.collection.aggregate([
{ $sort: { count: -1 } },
{
"$facet": {
others: [
{ "$skip": 5 },
{
"$group": {
"_id": "others",
"count": { "$sum": "$count" }
}
}
],
top5: [ { "$limit": 5 } ]
}
},
{
"$project": { result: { "$concatArrays": [ "$others", "$top5" ] } }
},
{
"$addFields": { totalCount: { "$sum": "$result.count" } }
},
{ $unwind: "$result" },
{
$project: {
_id: "$result._id",
count: "$result.count",
percent: {
$round: [
{ "$multiply": [ { $divide: [ "$result.count", "$totalCount" ] }, 100 ] },
0
]
}
}
}
])
If you have mongoDB version 5.0 or higher you can use $setWindowFields like in #Takis nice answer. Otherwise, you can group, $slice and $reduce your way to the answer:
$sort to have the highest count on top and group to put them all in one array called all and to $sum up.
$slice the all array to keep only the top N.
$reduce the top N to sum them up.
Add the others to the top N array with count sum-sum(topN)
$unwind and format
db.collection.aggregate([
{$sort: {count: -1}},
{$group: {_id: null, all: {$push: "$$ROOT"}, sum: {$sum: "$count"}}},
{$project: {_id: null, sum: 1, res: {$slice: ["$all", 5]}}},
{$project: {sum: 1, res: 1, topN: {
$reduce: {
input: "$res",
initialValue: 0,
in: {$add: ["$$value", "$$this.count"]}
}
}
}
},
{
$project: {_id: 0, sum: 1, res: {
$concatArrays: [
[{_id: "other", count: {$subtract: ["$sum", "$topN"]}}],
"$res"
]
}
}
},
{$unwind: "$res"},
{$project: {_id: "$res._id", count: "$res.count",
percent: { $round: [{$multiply:
[{$divide: ["$res.count", "$sum"]}, 100]}, 0]
}
}
}
])
Playground example

Single MongoDB query to aggregate count

I have a collection peopleColl containing records with people data. Each record is uniquely indexed by id and has a managers field of type array.
Example:
{
id: 123,
managers: [456, 789]
},
{
id: 321,
managers: [555, 789]
}
I want to write a single query to find all people with the same manager, for several ids (managers). So given [456, 555, 789] the desired output would be:
{
456: 1,
555: 1,
789: 2
}
I can do it (slowly) in a for-loop in Python as follows:
idToCount = {id: peopleColl.count({"managers": id}) for id in ids}
Edit: I am primarily interested in solutions <= MongoDB 3.4
You can try below aggregation in mongodb 3.4.4 and above
db.collection.aggregate([
{ "$unwind": "$managers" },
{ "$group": { "_id": "$managers", "count": { "$sum": 1 }}},
{ "$group": {
"_id": null,
"data": {
"$push": {
"k": { "$toLower": "$_id" },
"v": "$count"
}
}
}},
{ "$replaceRoot": { "newRoot": { "$arrayToObject": "$data" }}}
])
Output
[
{
"456": 1,
"555": 1,
"789": 2
}
]
You can try below pipeline.
db.collection.aggregate([
{ "$unwind": "$managers" },
{ "$group": { "_id": "$managers", "count": { "$sum": 1 }}}
])
Output:
{'_id': 456, 'count': 1},
{'_id': 555, 'count': 1},
{'_id': 789, 'count': 2}
So you can loop through and create the Id-Count mapping
result = db.collection.aggregate([
{ "$unwind": "$managers" },
{ "$group": { "_id": "$managers", "count": { "$sum": 1 }}}
])
iD_Count = {}
result.forEach(function(d, i) {
iD_Count[d._id] = d.count;
})
iD_Count:
{
456: 1,
555: 1,
789: 2
}
You can try below aggregation in 3.6.
db.colname.aggregate([
{"$unwind":"$managers"},
{"$group":{"_id":"$managers","count":{"$sum":1}}},
{"$group":{
"_id":null,
"managerandcount":{"$mergeObjects":{"$arrayToObject":[[["$_id","$count"]]]}}
}},
{"$replaceRoot":{"newRoot":"$managerandcount"}}
])