MongoDB -- Find duplicate documents by multiple keys - mongodb

I have a collection with documents that look like the following:
{
"_id" : ObjectId("55b377cb66b393427367c3e2"),
"comment" : "This is a comment",
"url_key" : "55b377cb66b393427367c3df", //This is an ObjectId from another record in a different collection
}
I need to find records in this collection that contain duplicate values for the both the comment AND the url_key.
I can easily generate (using aggregate) duplicate records for the same, single, key (eg: comment), but I can't figure out how to group by/aggregate for multiple keys.
Here's my current aggregation pipeline:
db.comments.aggregate([ { $group: { _id: { comment: "$comment" }, uniqueIds: { $addToSet: "$_id" }, count: { $sum: 1 } } }, { $match: { count: { $gte: 2 } } }, { $sort: { count : -1} }, {$limit 10 } ]);

Is it as simple as grouping by multiple keys or did I misunderstand your question?
...
{ $group: { _id: { id: "$_id", comment: "$comment" }, count: { $sum: 1 } } },
{ $match: { count: { $gte: 2 } } },
...

Related

Sum value when satisfy condition in MongoDB

I am trying to get sum of values when certain condition is satisfied in the document.
In the below query i want to get sum of currentValue only when componentId = "ABC"
db.Pointnext_Activities.aggregate(
{ $project: {
_id: 0,
componentId:1,
currentValue:1
}
},
{ $group:
{ _id: "$componentId",
total: { $sum: "$currentValue" }
}
}
)
Please try this :
db.Pointnext_Activities.aggregate([{ $match: { componentId: 'ABC' } },
{
$group:
{
_id: "$componentId",
total: { $sum: "$currentValue" }
}
}, { $project: { 'componentId': '$_id', total: 1, _id: 0 } }])
If you just need the total value & doesn't care about componentId to be returned try this :
db.Pointnext_Activities.aggregate([{ $match: { componentId: 'ABC' } },
{
$group:
{
_id: "",
total: { $sum: "$currentValue" }
}
}, {$project :{total :1, _id:0}}])
It would be ideal in aggregation, if you always start with filter operation i.e; $match, as it would persist only needed documents for further steps.

MongoDB - aggregating with nested objects, and changeable keys

I have a document which describes counts of different things observed by a camera within a 15 minute period. It looks like this:
{
"_id" : ObjectId("5b1a709a83552d002516ac19"),
"start" : ISODate("2018-06-08T11:45:00.000Z"),
"end" : ISODate("2018-06-08T12:00:00.000Z"),
"recording" : ObjectId("5b1a654683552d002516ac16"),
"data" : {
"counts" : {
"5b434d05da1f0e00252566be" : 12,
"5b434d05da1f0e00252566cc" : 4,
"5b434d05da1f0e00252566ca" : 1
}
}
}
The keys inside the data.counts object change with each document and refer to additional data that is fetched at a later date. There are unlimited number of keys inside data.counts (but usually about 20)
I am trying to aggregate all these 15 minute documents up to daily aggregated documents.
I have this query at the moment to do that:
db.getCollection("segments").aggregate([
{$match:{
"recording": ObjectId("5bf7f68ad8293a00261dd83f")
}},
{$project:{
"start": 1,
"recording": 1,
"data": 1
}},
{$group:{
_id: { $dateToString: { format: "%Y-%m-%d", date: "$start" } },
"segments": { $push: "$$ROOT" }
}},
{$sort: {_id: -1}},
]);
This does the grouping and returns all the segments in an array.
I want to also aggregate the information inside data.counts, so that I get the sum of values for all keys that are the same within the daily group.
This would save me from having another service loop through each 15 minute segment summing values with the same keys. E.g. the query would return something like this:
{
"_id" : "2019-02-27",
"counts" : {
"5b434d05da1f0e00252566be" : 351,
"5b434d05da1f0e00252566cc" : 194,
"5b434d05da1f0e00252566ca" : 111
... any other keys that were found within a day
}
}
How might I amend the query I already have, or use a different query?
Thanks!
You could use the $facet pipeline stage to create two sub-pipelines; one for segments and another for counts. These sub-pipelines can be joined by using $zip to stitch them together and $map to merge each 2-element array produced from zip. Note this will only work correctly if the sub-pipelines output sorted arrays of the same size, which is why we group and sort by start_date in each sub-pipeline.
Here's the query:
db.getCollection("segments").aggregate([{
$match: {
recording: ObjectId("5b1a654683552d002516ac16")
}
}, {
$project: {
start: 1,
recording: 1,
data: 1,
start_date: { $dateToString: { format: "%Y-%m-%d", date: "$start" }}
}
}, {
$facet: {
segments_pipeline: [{
$group: {
_id: "$start_date",
segments: {
$push: {
start: "$start",
recording: "$recording",
data: "$data"
}
}
}
}, {
$sort: {
_id: -1
}
}],
counts_pipeline: [{
$project: {
start_date: "$start_date",
count: { $objectToArray: "$data.counts" }
}
}, {
$unwind: "$count"
}, {
$group: {
_id: {
start_date: "$start_date",
count_id: "$count.k"
},
count_sum: { $sum: "$count.v" }
}
}, {
$group: {
_id: "$_id.start_date",
counts: {
$push: {
$arrayToObject: [[{
k: "$_id.count_id",
v: "$count_sum"
}]]
}
}
}
}, {
$project: {
counts: { $mergeObjects: "$counts" }
}
}, {
$sort: {
_id: -1
}
}]
}
}, {
$project: {
result: {
$map: {
input: { $zip: { inputs: ["$segments_pipeline", "$counts_pipeline"] }},
in: { $mergeObjects: "$$this" }
}
}
}
}, {
$unwind: "$result"
}, {
$replaceRoot: {
newRoot: "$result"
}
}])
Try it out here: Mongoplayground.

Mongo aggregation pipeline, finding out the total number of entries in an array per user

I have a collection, lets call it 'user'. In this collection there is a property entries, which holds a variably sized array of strings,
I want to find out the total number of these strings across my collection.
db.users.find()
> [{ entries: [] }, { entries: ['entry1','entry2']}, {entries: ['entry1']}]
So far I have have made many attempts here are some of my closest.
db.users.aggregate([
{ $project:
{ numberOfEntries:
{ $size: "$entries" } }
},
{ $group:
{_id: { total_entries: { $sum: "$entries"}
}
}
}
])
What this gives me is a list of the users with the total number of entries, now what I want is each of the total_entries figures added up to get my total. Any ideas of what I am doing wrong. Or if there is a better way to start this?
A possible solution could be:
db.users.aggregate([{
$group: {
_id: 'some text here',
count: {$sum: {$size: '$entries'}}
}
}]);
This will give you the total count of all entries across all users and look like
[
{
_id: 'some text here',
count: 3
}
]
I would use $unwind in the case that you want individual entry counts.
That would look like
db.users.aggregate([
{ $unwind: '$entries' },
{$group: {
_id: '$entries',
count: {$sum: 1}
}
])
and this will give you something along the lines of:
[
{
_id: 'entry1',
count: 2
},
{
_id: 'entry2',
count: 1
}
]
In case you want the overall distinct nbr of entries:
> db.users.aggregate([
{ $unwind: "$entries" },
{ $group: { _id: "$entries" } },
{ $count: "total" }
])
{ "total" : 2 }
In case you want the overall nbr of entries:
> db.users.aggregate( [ { $unwind: "$entries" }, { $count: "total" } ] )
{ "total" : 3 }
This makes use of the "unwind" operator which flattens elements of an array from records:
> db.users.aggregate( [ { $unwind: "$entries" } ] )
{ "_id" : ObjectId("5a81a7a1318e1cfc10250430"), "entries" : "entry1" }
{ "_id" : ObjectId("5a81a7a1318e1cfc10250430"), "entries" : "entry2" }
{ "_id" : ObjectId("5a81a7a1318e1cfc10250431"), "entries" : "entry1" }
You were in the right direction though you just needed to specify an _id value of null in the $group stage to calculate accumulated values for all the input documents as a whole i.e.
db.users.aggregate([
{
"$project": {
"numberOfEntries": {
"$size": {
"$ifNull": ["$entries", []]
}
}
}
},
{
"$group": {
"_id": null, /* _id of null to get the accumulated values for all the docs */
"totalEntries": { "$sum": "$numberOfEntries" }
}
}
])
Or with just a single pipeline as:
db.users.aggregate([
{
"$group": {
"_id": null, /* _id of null to get the accumulated values for all the docs */
"totalEntries": {
"$sum": {
"$size": {
"$ifNull": ["$entries", []]
}
}
}
}
}
])

Mongodb : Search for entries having one to many mapping between 2 fields

I have a mongodb database, containing entities of ECommerceProducts. There are two fields, "productId" and "skuId". The thing is many of the records are duplicated, i.e., it is possible that two entries have same "productId" as well as same "skuId".
I want to find the set of productIds that have multiple (distinct) skuIds present.
This is what I have till now:
db.urls.aggregate([
{ $group: {
_id: { productId: "$productId" },
count: { $sum: 1 }
} },
{ $match: {
count: { $gte: 2 }
} },
{ $sort : { count : -1} },
{ $limit : 10 }
]);
This code gives me the list of Duplicate productIds and how many times they have occurred. How can I also get the list of different skuIds these contain?
You can use the $addToSet accumulator
db.urls.aggregate([
{ $group: {
_id: { productId: "$productId" },
skuId: {$addToSet: "$skuId"},
count: { $sum: 1 }
} },
{ $match: {
count: { $gte: 2 }
} },
{ $sort : { count : -1} },
{ $limit : 10 }
]);
This will return all product IDs that appear more than once with a distinct set of all skuId used by them.

Convert to lowercase in group aggregation

I want to return an aggregate of blog post tags and their total count. My blog posts are stored like so:
{
"_id" : ObjectId("532c323bb07ab5aace243c8e"),
"title" : "Fitframe.js - Responsive iframes made easy",
"tags" : [
"JavaScript",
"jQuery",
"RWD"
]
}
I'm then executing the following pipeline:
printjson(db.posts.aggregate(
{
$project: {
tags: 1,
count: { $add: 1 }
}
},
{
$unwind: '$tags'
},
{
$group: {
_id: '$tags',
count: {
$sum: '$count'
},
tags_lower: { $toLower: '$tags' }
}
},
{
$sort: {
_id: 1
}
}
));
So that the results are sorted correctly I need to sort on a lowercase version of each tag. However, when executing the above code I get the following error:
aggregate failed: {
"errmsg" : "exception: unknown group operator '$toLower'",
"code" : 15952,
"ok" : 0
}
Do I need to do another projection to add the lowercase tag?
Yes, you must add it to the projection. It will not work in the group, only specific operators like $sum ( http://docs.mongodb.org/manual/reference/operator/aggregation-group/ ) are counted as $group operators and capable of being used on that level of the group
You don't need to add another projection ... you could fix it when you do the $group:
db.posts.aggregate(
{
$project: {
tags: 1,
count: { $add: 1 }
}
},
{
$unwind: '$tags'
},
{
$group: {
_id: { tag: '$tags', lower: { $toLower : '$tags' } },
count: {
$sum: '$count'
}
}
},
{
$sort: {
"_id.lower": 1
}
}
)
In the above example, I've preserved the original name and added the lower case version to the _id.
Add another projection step between $unwind and $grop:
...
{$project: {
tags: {$toLower: '$tags'},
count: 1
}}
...
And remove tags_lower from $group