Find duplicate records in MongoDB - mongodb

How would I find duplicate fields in a mongo collection.
I'd like to check if any of the "name" fields are duplicates.
{
"name" : "ksqn291",
"__v" : 0,
"_id" : ObjectId("540f346c3e7fc1054ffa7086"),
"channel" : "Sales"
}
Many thanks!

Use aggregation on name and get name with count > 1:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
{"$project": {"name" : "$_id", "_id" : 0} }
]);
To sort the results by most to least duplicates:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
{"$sort": {"count" : -1} },
{"$project": {"name" : "$_id", "_id" : 0} }
]);
To use with another column name than "name", change "$name" to "$column_name"

You can find the list of duplicate names using the following aggregate pipeline:
Group all the records having similar name.
Match those groups having records greater than 1.
Then group again to project all the duplicate names as an array.
The Code:
db.collection.aggregate([
{$group:{"_id":"$name","name":{$first:"$name"},"count":{$sum:1}}},
{$match:{"count":{$gt:1}}},
{$project:{"name":1,"_id":0}},
{$group:{"_id":null,"duplicateNames":{$push:"$name"}}},
{$project:{"_id":0,"duplicateNames":1}}
])
o/p:
{ "duplicateNames" : [ "ksqn291", "ksqn29123213Test" ] }

The answer anhic gave can be very inefficient if you have a large database and the attribute name is present only in some of the documents.
To improve efficiency you can add a $match to the aggregation.
db.collection.aggregate(
{"$match": {"name" :{ "$ne" : null } } },
{"$group" : {"_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"count" : {"$gt": 1} } },
{"$project": {"name" : "$_id", "_id" : 0} }
)

Another option is to use $sortByCount stage.
db.collection.aggregate([
{ $sortByCount: '$name' }
]
This is the combination of $group & $sort.
The $sortByCount stage is equivalent to the following $group + $sort sequence:
{ $group: { _id: <expression>, count: { $sum: 1 } } },
{ $sort: { count: -1 } }

db.getCollection('orders').aggregate([
{$group: {
_id: {name: "$name"},
uniqueIds: {$addToSet: "$_id"},
count: {$sum: 1}
}
},
{$match: {
count: {"$gt": 1}
}
}
])
First Group Query the group according to the fields.
Then we check the unique Id and count it, If count is greater then 1 then the field is duplicate in the entire collection so that thing is to be handle by $match query.

this is how we can achieve this in mongoDB compass

In case you need to see all duplicated rows:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 },"data": { "$push": "$$ROOT" }}},
{"$unwind": "$data"},
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
]);

If somebody is looking for a query for duplicates with an extra "$and" where clause, like "and where someOtherField is true"
The trick is to start with that other $match, because after the grouping you don't have all the data available anymore
// Do a first match before the grouping
{ $match: { "someOtherField": true }},
{ $group: {
_id: { name: "$name" },
count: { $sum: 1 }
}},
{ $match: { count: { $gte: 2 } }},
I searched for a very long time to find this notation, hope I can help somebody with the same problem

Search for duplicates in Compass Mongo db using $sortByCount
[screenshot]: https://i.stack.imgur.com/L85QV.png

Sometimes you want to find duplicates regardless the case, when you want to create a case insensitive index for instance. In this case you can use this aggregation pipeline
db.collection.aggregate([
{'$group': {'_id': {'$toLower': '$name'}, 'count': { '$sum': 1 }, 'duplicates': { '$push': '$$ROOT' } } },
{'$match': { 'count': { '$gt': 1 } }
]);
Explanation:
group by name but first change the case to lower case and push the docs to the duplicates array.
match those groups having records greater than 1 (the duplicates).

Related

Check duplicates of certain field for documents array with inner array

I have 2 objects,
{
_id: ObjectId("5cd9010310b80b3e38cd3f88")
subGroup: [
bookList: [
{
title: "A good book",
id: "abc123"
}
]
]
}
{
_id: ObjectId("5cd9010710b80b3e38cd3f89")
subGroup: [
bookList: [
{
title: "A good book",
id: "abc123"
}
]
These are 2 different objects. I would like to detect the occurence of these 2 objects where the title is duplicated (eg the same).
I tried this query
db.scope.aggregate({"$unwind": "$subGroup.bookList"}, {"$group" : { "_id": "$title", "count": { "$sum": 1 } } }, {"$match": {"id" :{ "$ne" : null } , "count" : {"$gt": 1} } })
which i looked at other threads on stackoverflow. However, it does not return me anything. How can i solve this?
There are few issues here:
$unwind should be run on subGroup and on subGroup.bookList separately
when specifying _id for $group stage you should use full path (subGroup.bookList.title)
in your $match stage you want to check if _id (not id) is $ne null
Try:
db.col.aggregate([
{"$unwind": "$subGroup"},
{"$unwind": "$subGroup.bookList"},
{"$group" : { "_id": "$subGroup.bookList.title", "count": { "$sum": 1 } } },
{"$match": { "_id" :{ "$ne" : null } , "count" : { "$gt": 1} } }
])
Mongo playground

MongoDB: Null check in between Pipeline Stages

If I create a collection like so:
db.People.insert({"Name": "John"})
and run a simple mongo aggregate, like so:
db.People.aggregate([{$match: {Name: "John"}}, {$group: {_id: "null", count: {$sum: 1}}}])
This counts all the Johns in the collection and returns this
{ "_id" : "null", "count" : 1 }
Which is nice. But if I search for the name "Clarice" that does not exist at all, it returns null.
I would like it to return
{ "_id" : "null", "count" : 0 }
I have not found a way to achieve this. I would have to include some kind of null-check between the $match- and $group-stage.
Have have to use $facet aggregation along with the operator $ifNull. e.g:
db.People.aggregate([
{ "$facet": {
"array": [
{ "$match": { Name:"John" }},
{ "$group": {
"_id": null,
"count": { "$sum": 1 }
}},
{ "$project": { "_id": 0, "count": 1 }}
]
}},
{ "$project": {
"count": {
"$ifNull": [{ "$arrayElemAt": ["$array.count", 0] }, 0 ]
}
}}
])
Output:
{ "count" : 1 }
For other name, it should be as follow:
{ "count" : 0 }
Similar ans at $addFields when no $match found
Simply use count
db. People.count({Name:"John"})
This will return the exact number.
Otherwise You need to check the result wether it is a empty array. Below are the code for node using loopback,
db.People.aggregate([
{$match: {Name: "John"}},
{$group: {_id: "null", count: {$sum: 1}}}
],(err,res)=>{
if(err) return cb(err)
if(res.length) return cb(err,res)
else return cb(err,{_id:null,count:0})
})
You can use $ifNull in your $match stage.
If you can provide an collecion of examples it's more easy to elaborare an answer on it.
Edit: if you group by Name, result for "John" is one, for "Clarice" is an empty array that is correct, here the aggregation query:
db.People.aggregate([
{
$match: { Name: "John" }
},
{
$group: { _id: "$Name", count: { $sum: 1 } }
}
])

MongoDB group and only show results whose count is greater than 1 [duplicate]

How would I find duplicate fields in a mongo collection.
I'd like to check if any of the "name" fields are duplicates.
{
"name" : "ksqn291",
"__v" : 0,
"_id" : ObjectId("540f346c3e7fc1054ffa7086"),
"channel" : "Sales"
}
Many thanks!
Use aggregation on name and get name with count > 1:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
{"$project": {"name" : "$_id", "_id" : 0} }
]);
To sort the results by most to least duplicates:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
{"$sort": {"count" : -1} },
{"$project": {"name" : "$_id", "_id" : 0} }
]);
To use with another column name than "name", change "$name" to "$column_name"
You can find the list of duplicate names using the following aggregate pipeline:
Group all the records having similar name.
Match those groups having records greater than 1.
Then group again to project all the duplicate names as an array.
The Code:
db.collection.aggregate([
{$group:{"_id":"$name","name":{$first:"$name"},"count":{$sum:1}}},
{$match:{"count":{$gt:1}}},
{$project:{"name":1,"_id":0}},
{$group:{"_id":null,"duplicateNames":{$push:"$name"}}},
{$project:{"_id":0,"duplicateNames":1}}
])
o/p:
{ "duplicateNames" : [ "ksqn291", "ksqn29123213Test" ] }
The answer anhic gave can be very inefficient if you have a large database and the attribute name is present only in some of the documents.
To improve efficiency you can add a $match to the aggregation.
db.collection.aggregate(
{"$match": {"name" :{ "$ne" : null } } },
{"$group" : {"_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"count" : {"$gt": 1} } },
{"$project": {"name" : "$_id", "_id" : 0} }
)
Another option is to use $sortByCount stage.
db.collection.aggregate([
{ $sortByCount: '$name' }
]
This is the combination of $group & $sort.
The $sortByCount stage is equivalent to the following $group + $sort sequence:
{ $group: { _id: <expression>, count: { $sum: 1 } } },
{ $sort: { count: -1 } }
db.getCollection('orders').aggregate([
{$group: {
_id: {name: "$name"},
uniqueIds: {$addToSet: "$_id"},
count: {$sum: 1}
}
},
{$match: {
count: {"$gt": 1}
}
}
])
First Group Query the group according to the fields.
Then we check the unique Id and count it, If count is greater then 1 then the field is duplicate in the entire collection so that thing is to be handle by $match query.
this is how we can achieve this in mongoDB compass
In case you need to see all duplicated rows:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 },"data": { "$push": "$$ROOT" }}},
{"$unwind": "$data"},
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
]);
If somebody is looking for a query for duplicates with an extra "$and" where clause, like "and where someOtherField is true"
The trick is to start with that other $match, because after the grouping you don't have all the data available anymore
// Do a first match before the grouping
{ $match: { "someOtherField": true }},
{ $group: {
_id: { name: "$name" },
count: { $sum: 1 }
}},
{ $match: { count: { $gte: 2 } }},
I searched for a very long time to find this notation, hope I can help somebody with the same problem
Search for duplicates in Compass Mongo db using $sortByCount
[screenshot]: https://i.stack.imgur.com/L85QV.png
Sometimes you want to find duplicates regardless the case, when you want to create a case insensitive index for instance. In this case you can use this aggregation pipeline
db.collection.aggregate([
{'$group': {'_id': {'$toLower': '$name'}, 'count': { '$sum': 1 }, 'duplicates': { '$push': '$$ROOT' } } },
{'$match': { 'count': { '$gt': 1 } }
]);
Explanation:
group by name but first change the case to lower case and push the docs to the duplicates array.
match those groups having records greater than 1 (the duplicates).

Find number of duplicates documents

I had a bug on my code while developing that created some duplicated users on my MongoDB.
Collection example:
"_id" : ObjectId("5abb9d72b884fb00389efeef"),
"user" : ObjectId("5abb9d72b884fb00389efee5"),
"displayName" : "test",
"fullName" : "test test test",
"email" : "test#mail.com",
"phoneNumber" : "99999999999",
"createdAt" : ISODate("2016-05-18T13:49:38.533Z")
I was able to find the duplicated users with this query:
db.users.aggregate([{$group: {_id: "$user", "Total": {$sum: 1}}}, {
$match: { "Total": {$gt: 1}}}])
And count them with this one:
db.users.aggregate([{$group: {_id: "$user", "Total": {$sum: 1}}}, {
$match: { "Total": {$gt: 1}}}, { $count: "Total"}])
I want to know how many users I'll need to delete, but the second query only returns me the total of unique users affected.
How can I get a sum of duplicated users? Or a sum of "Total".
Expected result:
{ "Total" : **** }
Well, you can do this with the following pipeline
[
{ $group: {
_id: null,
uniqueValues: { $addToSet: "$user" },
count: { $sum: 1 }
}},
{ $project: {
total: { $subtract: [ "$count", { $size: "$uniqueValues" } ] }
}}
]
Don't have your data set, so didnt test this in my local. Try this query:
db.users.aggregate([
{$group: {_id: "$user", Total: {$sum: 1}}}, //group by user and count each.
{$addFields: {Total: {$subtract:["$Total",1]}}}, // you need duplicate count, so forget first instance of it.
{$group:{_id:null, Total: {$sum:"$Total"}}}, // your _id is unique, perform a sum out of it
{$project:{_id:0, Total:1}} // at the end the result is total number of 'duplicate' users.
])

mongodb $aggregate empty array and multiple documents

mongodb has below document:
> db.test.find({name:{$in:["abc","abc2"]}})
{ "_id" : 1, "name" : "abc", "scores" : [ ] }
{ "_id" : 2, "name" : "abc2", "scores" : [ 10, 20 ] }
I want get scores array length for each document, how should I do?
Tried below command:
db.test.aggregate({$match:{name:"abc2"}}, {$unwind: "$scores"}, {$group: {_id:null, count:{$sum:1}}} )
Result:
{ "_id" : null, "count" : 2 }
But below command:
db.test.aggregate({$match:{name:"abc"}}, {$unwind: "$scores"}, {$group: {_id:null, count:{$sum:1}}} )
Return Nothing. Question:
How should I get each lenght of scores in 2 or more document in one
command?
Why the result of second command return nothing? and how
should I check if the array is empty?
So this is actually a common problem. The result of the $unwind phase in an aggregation pipeline where the array is "empty" is to "remove" to document from the pipeline results.
In order to return a count of "0" for such an an "empty" array then you need to do something like the following.
In MongoDB 2.6 or greater, just use $size:
db.test.aggregate([
{ "$match": { "name": "abc" } },
{ "$group": {
"_id": null,
"count": { "$sum": { "$size": "$scores" } }
}}
])
In earlier versions you need to do this:
db.test.aggregate([
{ "$match": { "name": "abc" } },
{ "$project": {
"name": 1,
"scores": {
"$cond": [
{ "$eq": [ "$scores", [] ] },
{ "$const": [false] },
"$scores"
]
}
}},
{ "$unwind": "$scores" },
{ "$group": {
"_id": null,
"count": { "$sum": {
"$cond": [
"$scores",
1,
0
]
}}
}}
])
The modern operation is simple since $size will just "measure" the array. In the latter case you need to "replace" the array with a single false value when it is empty to avoid $unwind "destroying" this for an "empty" statement.
So replacing with false allows the $cond "trinary" to choose whether to add 1 or 0 to the $sum of the overall statement.
That is how you get the length of "empty arrays".
To get the length of scores in 2 or more documents you just need to change the _id value in the $group pipeline which contains the distinct group by key, so in this case you need to group by the document _id.
Your second aggregation returns nothing because the $match query pipeline passed a document which had an empty scores array. To check if the array is empty, your match query should be
{'scores.0': {$exists: true}} or {scores: {$not: {$size: 0}}}
Overall, your aggregation should look like this:
db.test.aggregate([
{ "$match": {"scores.0": { "$exists": true } } },
{ "$unwind": "$scores" },
{
"$group": {
"_id": "$_id",
"count": { "$sum": 1 }
}
}
])