How to find duplicate entries in collection using aggregation

How to find duplicate entries in collection using aggregation - mongodb

{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":1, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":2, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":3, "name" : "baz"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":4, "name" : "foo"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":5, "name" : "bar"}
{ "_id" : ObjectId("4f127fa55e7242718200002d"), "id":6, "name" : "bar"}
I want to find all the duplicated entries in this collection by the "name" field using aggregation. E.g. "foo" appears twice and "bar" appears 3 times.

You can use group stage in aggrgation
db.collection.aggregate([{
$group: {
_id: "$name",
count: { $sum: 1 },
name: { $first: "$name" }
}
}])

You can group by name and count. And then filter with a count greater that 1.
db.collection.aggregate([
{
$group: {
_id: "$name",
count: { $sum: 1 }
}
},
{
$match:{count:{$gt:1}}
}
])
Output:
{ "_id" : "foo", "count":2}
{ "_id" : "bar", "count":3}

Related

Mongo Aggregate Group List / Filtered Results

Here my records for mongdb
{ "_id" : 1,"userId" : "x", "name" : "Central", "borough": "Manhattan"},
{ "_id" : 2,"userId" : "x", "name" : "Rock", "borough" : "Queens"},
{ "_id" : 3,"userId" : "y", "name" : "Empire", "borough" : "Brooklyn"},
{ "_id" : 4,"userId" : "y", "name" : "Stana", "borough" : "Manhattan"},
{ "_id" : 5,"userId" : "y", "name" : "Jane", "borough" :"Brooklyn"}
how can we take result with aggregate by userId field like this
[
{
x : [{"_id":1,"name":"Central"},{"_id":2,"name":"Rock"}],
y:[{ "_id" : 3,"name":"Empire"},{ "_id" : 4,"name":"Stana"},{ "_id" : 5,"name":"Jane"}]
}
]

$group by userId and construct docs array with required fields
$group by null and construct array of docs in key-value format
$arrayToObject convert docs array to object
$replaceRoot to replace above converted object to root
db.collection.aggregate([
{
$group: {
_id: "$userId",
docs: {
$push: {
_id: "$_id",
name: "$name"
}
}
}
},
{
$group: {
_id: null,
docs: {
$push: {
k: "$_id",
v: "$docs"
}
}
}
},
{ $replaceRoot: { newRoot: { $arrayToObject: "$docs" } } }
])
Playground

Is there a way to add a counter to mongodb query?

I would like to add a counter to the documents that match my query. E.g., 1st document has counter = 1, 2nd document has counter = 2, and so on.
Here's a snippet of the data:
"_id": ObjectId("5d1b9aea5c1dd54e8c773f42")
"timestamp":
[
"systemTimestamp": 2019-07-02T17:56:53.765+00:00
"serverTimestamp": 0001-01-01T00:00:00.000+00:00
"systemTimeZone": "System.CurrentSystemTimeZone"
]
"urlData":
[0]:
"fullUrl":"https://imgur.com/gallery/EfaQnPY"
"UID":"00000-W3W6C42GWTRE960"
"safety": "safe"
My query (this is copied from the Compass UI):
$match:
{
$and: [{"UID": "00000-WVUCW3JW7OTHDVE"},
{"timestamp.serverTimestamp":
{
$gte:ISODate("2019-08-01T00:00"),
$lte:ISODate("2019-09-30T00:00")
}}]
}
$unwind:
{
path: "$urlData",
includeArrayIndex: 'index'
}
$match:
{
"index": 0
}
$project:
{
_id: 0,
date: { $dateToString: {
format: "%Y-%m-%d",
date: "$timestamp.serverTimestamp"}},
safety: "$safety",
url: "$urlData.fullUrl",
UID: "$UID"
}
Is there any way to add something to $project to include a counter?

The answer is there in your question itself. We can get the expected output if the output of the last pipeline is added into an array and again unwinded with the included index.
Let's say, I have the following data:
{
"_id" : ObjectId("5d81c3b7a832f81a9e02337b"),
"first" : "John",
"last" : "Smith"
}
{
"_id" : ObjectId("5d81c3b7a832f81a9e02337c"),
"first" : "Alice",
"last" : "Johnson"
}
{
"_id" : ObjectId("5d81c3b7a832f81a9e02337d"),
"first" : "Bob",
"last" : "Williams"
}
On running the following query:
db.collection.aggregate([
{
$group:{
"_id":null,
"data":{
$push:"$$ROOT"
}
}
},
{
$unwind:{
"path":"$data",
includeArrayIndex: 'counter'
}
},
{
$addFields:{
"data.counter":{
$sum:["$counter",1]
}
}
},
{
$replaceRoot:{
"newRoot":"$data"
}
}
]).pretty()
Output would be:
{
"_id" : ObjectId("5d81c3b7a832f81a9e02337b"),
"first" : "John",
"last" : "Smith",
"counter" : 1
}
{
"_id" : ObjectId("5d81c3b7a832f81a9e02337c"),
"first" : "Alice",
"last" : "Johnson",
"counter" : 2
}
{
"_id" : ObjectId("5d81c3b7a832f81a9e02337d"),
"first" : "Bob",
"last" : "Williams",
"counter" : 3
}

Count of MongoDB aggregation match results

I'm working with a MongoDB collection that has a lot of duplicate keys. I regularly do aggregation queries to find out what those duplicates are, so that I can dig in and find out what is and isn't different about them.
Unfortunately the database is huge and duplicates are often intentional. What I'd like to do is to find the count of keys that have duplicates, instead of printing a result with thousands of lines of output. Is this possible?
(Side Note: I do all of my querying through the shell, so solutions that don't require external tools or a lot of code would be preferred, but I understand that's not always possible.)
Example Records:
{ "_id" : 1, "type" : "example", "key" : "111111", "value" : "abc" }
{ "_id" : 2, "type" : "example", "key" : "222222", "value" : "def" }
{ "_id" : 3, "type" : "example", "key" : "222222", "value" : "ghi" }
{ "_id" : 4, "type" : "example", "key" : "333333", "value" : "jkl" }
{ "_id" : 5, "type" : "example", "key" : "333333", "value" : "mno" }
{ "_id" : 6, "type" : "example", "key" : "333333", "value" : "pqr" }
{ "_id" : 7, "type" : "example", "key" : "444444", "value" : "stu" }
{ "_id" : 8, "type" : "example", "key" : "444444", "value" : "vwx" }
{ "_id" : 9, "type" : "example", "key" : "444444", "value" : "yz1" }
{ "_id" : 10, "type" : "example", "key" : "444444", "value" : "234" }
Here is the query that I've been using to find duplicates based on key:
db.collection.aggregate([
{
$match: {
type: "example"
}
},
{
$group: {
_id: "$key",
count: {
$sum: 1
}
}
},
{
$match: {
count: {
$gt: 1
}
}
}
])
Which gives me an output of:
{
"_id": "222222",
"count": 2
},
{
"_id": "333333",
"count": 3
},
{
"_id": "444444",
"count": 4
}
The result I want to get instead:
3

You are almost there, just missing the last $count:
db.collection.aggregate([
{
$match: {
type: "example"
}
},
{
$group: {
_id: "$key",
count: {
$sum: 1
}
}
},
{
$match: {
count: {
$gt: 1
}
}
},
{
$count: "count"
}
])

Akrion's answer seems to be correct, but I can't test it because we're on an older version of MongoDB. A coworker gave me an alternative solution that works on 3.2 (not sure about other versions).
Adding .toArray() will convert the results to an array, and you can then get the size of the array using .length.
db.collection.aggregate([
{
$match: {
type: "example"
}
},
{
$group: {
_id: "$key",
count: {
$sum: 1
}
}
},
{
$match: {
count: {
$gt: 1
}
}
}
]).toArray().length

mongodb count number of documents for every category

My collection looks like this:
{
"_id":ObjectId("5744b6cd9c408cea15964d18"),
"uuid":"bbde4bba-062b-4024-9bb0-8b12656afa7e",
"version":1,
"categories":["sport"]
},
{
"_id":ObjectId("5745d2bab047379469e10e27"),
"uuid":"bbde4bba-062b-4024-9bb0-8b12656afa7e",
"version":2,
"categories":["sport", "shopping"]
},
{
"_id":ObjectId("5744b6359c408cea15964d15"),
"uuid":"561c3705-ba6d-432b-98fb-254483fcbefa",
"version":1,
"categories":["politics"]
}
I want to count the number of documents for every category. To do this, I unwind the categories array:
db.collection.aggregate(
{$unwind: '$categories'},
{$group: {_id: '$categories', count: {$sum: 1}} }
)
Result:
{ "_id" : "sport", "count" : 2 }
{ "_id" : "shopping", "count" : 1 }
{ "_id" : "politics", "count" : 1 }
Now I want to count the number of documents for every category, but where document version is the latest version.
This is where I am stuck.

It's ugly but I think this gives you what you're after:
db.collection.aggregate(
{ $unwind : "$categories" },
{ $group :
{ "_id" : { "uuid" : "$uuid" },
"doc" : { $push : { "version" : "$version", "category" : "$categories" } },
"maxVersion" : { $max : "$version" }
}
},
{ $unwind : "$doc" },
{ $project : { "_id" : 0, "uuid" : "$id.uuid", "category" : "$doc.category", "isCurrentVersion" : { $eq : [ "$doc.version", "$maxVersion" ] } } },
{ $match : { "isCurrentVersion" : true }},
{ $group : { "_id" : "$category", "count" : { $sum : 1 } } }
)

You can do this by first grouping the denormalized documents (from the $unwind operator step) by two keys, i.e. the categories and version fields. This is necessary for the preceding pipeline step which orders the grouped documents and their accumulated counts by the version (desc) and categories (asc) keys respectively using the $sort operator.
Another grouping will be required to get the top documents in each categories group after ordering using the $first operator. The following shows this
db.collection.aggregate(
{ "$unwind": "$categories" },
{
"$group": {
"_id": {
'categories': '$categories',
'version': '$version'
},
"count": { "$sum": 1 }
}
},
{ "$sort": { "_id.version": -1, "_id.categories": 1 } },
{
"$group": {
"_id": "$_id.categories",
"count": { "$first": "$count" },
"version": { "$first": "$_id.version" }
}
}
)
Sample Output
{ "_id" : "shopping", "count" : 1, "version" : 2 }
{ "_id" : "sport", "count" : 1, "version" : 2 }
{ "_id" : "politics", "count" : 1, "version" : 1 }

MongoDb get distinct items after grouping

I'm using mongodb with the following collection sample
{
"_id" : ObjectId("5703750ca9c436386c4814c9"),
"user_id" : NumberLong(17),
"activitytype_id" : NumberLong(1),
"created_date" : ISODate("2015-10-03T03:52:03.000Z")
},
{
"_id" : ObjectId("5703750ca9c436386c4814ca"),
"s_id" : NumberLong(132919),
"user_id" : NumberLong(17),
"activitytype_id" : NumberLong(4),
"created_date" : ISODate("2016-03-18T17:13:43.000Z")
},
{
"_id" : ObjectId("5703750ca9c436386c4814cb"),
"s_id" : NumberLong(215283),
"user_id" : NumberLong(17),
"activitytype_id" : NumberLong(4),
"created_date" : ISODate("2015-10-03T04:12:33.000Z")
}
,
{
"_id" : ObjectId("5703750ca9c436386c4814cc"),
"s_id" : NumberLong(360888),
"user_id" : NumberLong(17),
"activitytype_id" : NumberLong(4),
"created_date" : ISODate("2015-10-03T04:12:41.000Z")
}
This is my aggregation pipeline
db.activitylogs.aggregate([
{ $group: {
_id: {
user_id: "$user_id",
activitytype_id: "$activitytype_id"
},
activity_log_docs: {
$addToSet: {
s_id: "$s_id",
friend_id: "$friend_id",
playlist_id: "$playlist_id",
created_date:"$created_date"
}
}
}},
])
I need to get distinct s_id in activity_log_docs.
here is a screenshot for the result,
screen shot for the result
i need to avoid duplicated s_id in activity_log_docs array, so i will get distinct s_id

I think something like this should do :
db.activitylogs.aggregate([
{ $group: {
_id: {
user_id: "$user_id",
activitytype_id: "$activitytype_id" ,
s_id:"$s_id"
},
friend_id: {$first:"$friend_id"}}},
playlist_id: {$first:"$playlist_id"}}},
created_date: {$first:"$created_date"}}},
{ $group: {
_id: {
user_id: "$_id.user_id",
activitytype_id: "$_id.activitytype_id"
},
activity_log_docs: {
$addToSet: {
s_id: "$_id.s_id",
friend_id: "$friend_id",
playlist_id: "$playlist_id",
created_date:"$created_date"
}
}
}},
])
But please double check your own field's name.