I want to find duplicate documents in my mongodb database , and i have also achieved a portion of it , lets say my document is something like this
{
"_id" : ObjectId("5900b01b2ce12a2383328e61"),
"Bank Name" : "Seaway Bank and Trust Company",
"City" : "Chicago",
"ST" : "IL",
"CERT" : 19328,
"Acquiring Institution" : "State Bank of Texas",
"Closing Date" : "27-Jan-17",
"Updated Date" : "17-Feb-17"
}
and i have written query like this :
db.list.aggregate([
{$group: {
_id: {CERT: "$CERT"},
uniqueIds: {$addToSet: "$_id"},
count: {$sum: 1}
}
},
{$match: {
count: {"$gt": 1}
}
},
{$sort: {
count: -1
}
}
]);
so this gives me ids of all the documents where CERT is repeating in more than one document which is correct , but in addition to this , i want to add and where ST not equals to IL. how can i do that .
Please Help !
You can just add another $match with ST not equals to IL, before executing the $group, which will ignore the transactions with "ST" == "IL":
Final Query:
db.list.aggregate([
{
$match : {
"ST" : {$ne : "IL"}
}
},
{
$group: {
_id: {CERT: "$CERT"},
uniqueIds: {$addToSet: "$_id"},
count: {$sum : 1}
}
},
{
$match: {
count: {"$gt": 1}
}
},
{
$sort: {
count: -1
}
}
]);
Hope this Helps!
You can use this
db.list.aggregate([
{$group: {
_id: {CERT: "$CERT",ST:{$ne:"IL"}},
uniqueIds: {$addToSet: "$_id"},
count: {$sum: 1}
}
},
{$match: {
count: {"$gt": 1}
}
},
{$sort: {
count: -1
}
}
]);
Let me know if it did not worked or you need some more help
Related
I have the following MongoDB query:
db.my_collection.aggregate([
{
$group: {"_id":"$day", count: { $sum: "$myValue" }
}}])
It returns the following result:
{
"_id" : ISODate("2020-02-10T00:00:00.000+01:00"),
"count" : 10
},
{
"_id" : ISODate("2020-02-01T00:00:00.000+01:00"),
"count" : 2
}
Is it possible to make two arrays from this result as below?
{
"days": [ISODate("2020-02-10T00:00:00.000+01:00"), ISODate("2020-02-01T00:00:00.000+01:00")],
"values": [10, 2]
}
Yes, just add another $group stage:
db.my_collection.aggregate([
{
$group: {
"_id": "$day", count: {$sum: "$myValue"}
}
},
{
$group: {
"_id": null,
days: {$push: "$_id"},
values: {$push: "$count"}
}
}
])
How would I find duplicate fields in a mongo collection.
I'd like to check if any of the "name" fields are duplicates.
{
"name" : "ksqn291",
"__v" : 0,
"_id" : ObjectId("540f346c3e7fc1054ffa7086"),
"channel" : "Sales"
}
Many thanks!
Use aggregation on name and get name with count > 1:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
{"$project": {"name" : "$_id", "_id" : 0} }
]);
To sort the results by most to least duplicates:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
{"$sort": {"count" : -1} },
{"$project": {"name" : "$_id", "_id" : 0} }
]);
To use with another column name than "name", change "$name" to "$column_name"
You can find the list of duplicate names using the following aggregate pipeline:
Group all the records having similar name.
Match those groups having records greater than 1.
Then group again to project all the duplicate names as an array.
The Code:
db.collection.aggregate([
{$group:{"_id":"$name","name":{$first:"$name"},"count":{$sum:1}}},
{$match:{"count":{$gt:1}}},
{$project:{"name":1,"_id":0}},
{$group:{"_id":null,"duplicateNames":{$push:"$name"}}},
{$project:{"_id":0,"duplicateNames":1}}
])
o/p:
{ "duplicateNames" : [ "ksqn291", "ksqn29123213Test" ] }
The answer anhic gave can be very inefficient if you have a large database and the attribute name is present only in some of the documents.
To improve efficiency you can add a $match to the aggregation.
db.collection.aggregate(
{"$match": {"name" :{ "$ne" : null } } },
{"$group" : {"_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"count" : {"$gt": 1} } },
{"$project": {"name" : "$_id", "_id" : 0} }
)
Another option is to use $sortByCount stage.
db.collection.aggregate([
{ $sortByCount: '$name' }
]
This is the combination of $group & $sort.
The $sortByCount stage is equivalent to the following $group + $sort sequence:
{ $group: { _id: <expression>, count: { $sum: 1 } } },
{ $sort: { count: -1 } }
db.getCollection('orders').aggregate([
{$group: {
_id: {name: "$name"},
uniqueIds: {$addToSet: "$_id"},
count: {$sum: 1}
}
},
{$match: {
count: {"$gt": 1}
}
}
])
First Group Query the group according to the fields.
Then we check the unique Id and count it, If count is greater then 1 then the field is duplicate in the entire collection so that thing is to be handle by $match query.
this is how we can achieve this in mongoDB compass
In case you need to see all duplicated rows:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 },"data": { "$push": "$$ROOT" }}},
{"$unwind": "$data"},
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
]);
If somebody is looking for a query for duplicates with an extra "$and" where clause, like "and where someOtherField is true"
The trick is to start with that other $match, because after the grouping you don't have all the data available anymore
// Do a first match before the grouping
{ $match: { "someOtherField": true }},
{ $group: {
_id: { name: "$name" },
count: { $sum: 1 }
}},
{ $match: { count: { $gte: 2 } }},
I searched for a very long time to find this notation, hope I can help somebody with the same problem
Search for duplicates in Compass Mongo db using $sortByCount
[screenshot]: https://i.stack.imgur.com/L85QV.png
Sometimes you want to find duplicates regardless the case, when you want to create a case insensitive index for instance. In this case you can use this aggregation pipeline
db.collection.aggregate([
{'$group': {'_id': {'$toLower': '$name'}, 'count': { '$sum': 1 }, 'duplicates': { '$push': '$$ROOT' } } },
{'$match': { 'count': { '$gt': 1 } }
]);
Explanation:
group by name but first change the case to lower case and push the docs to the duplicates array.
match those groups having records greater than 1 (the duplicates).
I had a bug on my code while developing that created some duplicated users on my MongoDB.
Collection example:
"_id" : ObjectId("5abb9d72b884fb00389efeef"),
"user" : ObjectId("5abb9d72b884fb00389efee5"),
"displayName" : "test",
"fullName" : "test test test",
"email" : "test#mail.com",
"phoneNumber" : "99999999999",
"createdAt" : ISODate("2016-05-18T13:49:38.533Z")
I was able to find the duplicated users with this query:
db.users.aggregate([{$group: {_id: "$user", "Total": {$sum: 1}}}, {
$match: { "Total": {$gt: 1}}}])
And count them with this one:
db.users.aggregate([{$group: {_id: "$user", "Total": {$sum: 1}}}, {
$match: { "Total": {$gt: 1}}}, { $count: "Total"}])
I want to know how many users I'll need to delete, but the second query only returns me the total of unique users affected.
How can I get a sum of duplicated users? Or a sum of "Total".
Expected result:
{ "Total" : **** }
Well, you can do this with the following pipeline
[
{ $group: {
_id: null,
uniqueValues: { $addToSet: "$user" },
count: { $sum: 1 }
}},
{ $project: {
total: { $subtract: [ "$count", { $size: "$uniqueValues" } ] }
}}
]
Don't have your data set, so didnt test this in my local. Try this query:
db.users.aggregate([
{$group: {_id: "$user", Total: {$sum: 1}}}, //group by user and count each.
{$addFields: {Total: {$subtract:["$Total",1]}}}, // you need duplicate count, so forget first instance of it.
{$group:{_id:null, Total: {$sum:"$Total"}}}, // your _id is unique, perform a sum out of it
{$project:{_id:0, Total:1}} // at the end the result is total number of 'duplicate' users.
])
I have the following dataset:
{
patientId: 228,
medication: {
atHome : [
{
"drug" : "tylenol",
"start" : "3",
"stop" : "7"
},
{
"drug" : "advil",
"start" : "0",
"stop" : "2"
},
{
"drug" : "vitaminK",
"start" : "0",
"stop" : "11"
}
],
}
}
When I execute the following aggregate everything looks great.
db.test01.aggregate(
[
{$match: {patientId: 228}},
{$project: {
patientId: 1,
"medication.atHome.drug": 1
}
},
]);
Results (Exactly what I wanted):
{
"_id" : ObjectId("5a57b7d17af6772ebf647939"),
"patientId" : NumberInt(228),
"medication" : {
"atHome" : [
{"drug" : "tylenol"},
{"drug" : "advil"},
{"drug" : "vitaminK"}
]}
}
We then wanted to add ifNull to change nulls to a default value, but this bungled the results.
db.test01.aggregate(
[
{$match: {patientId: 228}},
{$project: {
patientId: {$ifNull: ["$patientId", NumberInt(-1)]},
"medication.atHome.drug": {$ifNull: ["$medication.atHome.drug", "Unknown"]}
}
},
]);
Results from ifNull (Not what I was hoping for):
{
"_id" : ObjectId("5a57b7d17af6772ebf647939"),
"patientId" : NumberInt(228),
"medication" : {
"atHome" : [
{"drug" : ["tylenol", "advil", "vitaminK"]},
{"drug" : ["tylenol", "advil", "vitaminK"]},
{"drug" : ["tylenol", "advil", "vitaminK"]},
]}
}
What am I missing or not understanding?
To set attributes of documents that are elements of an array to default values you need to $unwind the array and then to group everything up after you check the attributes for null. Here is the query:
db.test01.aggregate([
// unwind to evaluete the array elements
{$unwind: "$medication.atHome"},
{$project: {
patientId: {$ifNull: ["$patientId", -1]},
"medication.atHome.drug": {$ifNull: ["$medication.atHome.drug", "Unknown"]}
}
},
// group to put atHome documents to an array again
{$group: {
_id: {_id: "$_id", patientId: "$patientId"},
"atHome": {$push: "$medication.atHome" }
}
},
// project to get a document of required format
{$project: {
_id: "$_id._id",
patientId: "$_id.patientId",
"medication.atHome": "$atHome"
}
}
])
UPDATE:
There is another more neat query to achieve the same. It uses the map operator to evaluate each array element thus does not require unwinding.
db.test01.aggregate([
{$project:
{
patientId: {$ifNull: ["$patientId", -1]},
"medication.atHome": {
$map: {
input: "$medication.atHome",
as: "e",
in: { $cond: {
if: {$eq: ["$$e.drug", null]},
then: {drug: "Unknown"},
else: {drug: "$$e.drug"}
}
}
}
}
}
}
])
How would I find duplicate fields in a mongo collection.
I'd like to check if any of the "name" fields are duplicates.
{
"name" : "ksqn291",
"__v" : 0,
"_id" : ObjectId("540f346c3e7fc1054ffa7086"),
"channel" : "Sales"
}
Many thanks!
Use aggregation on name and get name with count > 1:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
{"$project": {"name" : "$_id", "_id" : 0} }
]);
To sort the results by most to least duplicates:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
{"$sort": {"count" : -1} },
{"$project": {"name" : "$_id", "_id" : 0} }
]);
To use with another column name than "name", change "$name" to "$column_name"
You can find the list of duplicate names using the following aggregate pipeline:
Group all the records having similar name.
Match those groups having records greater than 1.
Then group again to project all the duplicate names as an array.
The Code:
db.collection.aggregate([
{$group:{"_id":"$name","name":{$first:"$name"},"count":{$sum:1}}},
{$match:{"count":{$gt:1}}},
{$project:{"name":1,"_id":0}},
{$group:{"_id":null,"duplicateNames":{$push:"$name"}}},
{$project:{"_id":0,"duplicateNames":1}}
])
o/p:
{ "duplicateNames" : [ "ksqn291", "ksqn29123213Test" ] }
The answer anhic gave can be very inefficient if you have a large database and the attribute name is present only in some of the documents.
To improve efficiency you can add a $match to the aggregation.
db.collection.aggregate(
{"$match": {"name" :{ "$ne" : null } } },
{"$group" : {"_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"count" : {"$gt": 1} } },
{"$project": {"name" : "$_id", "_id" : 0} }
)
Another option is to use $sortByCount stage.
db.collection.aggregate([
{ $sortByCount: '$name' }
]
This is the combination of $group & $sort.
The $sortByCount stage is equivalent to the following $group + $sort sequence:
{ $group: { _id: <expression>, count: { $sum: 1 } } },
{ $sort: { count: -1 } }
db.getCollection('orders').aggregate([
{$group: {
_id: {name: "$name"},
uniqueIds: {$addToSet: "$_id"},
count: {$sum: 1}
}
},
{$match: {
count: {"$gt": 1}
}
}
])
First Group Query the group according to the fields.
Then we check the unique Id and count it, If count is greater then 1 then the field is duplicate in the entire collection so that thing is to be handle by $match query.
this is how we can achieve this in mongoDB compass
In case you need to see all duplicated rows:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 },"data": { "$push": "$$ROOT" }}},
{"$unwind": "$data"},
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
]);
If somebody is looking for a query for duplicates with an extra "$and" where clause, like "and where someOtherField is true"
The trick is to start with that other $match, because after the grouping you don't have all the data available anymore
// Do a first match before the grouping
{ $match: { "someOtherField": true }},
{ $group: {
_id: { name: "$name" },
count: { $sum: 1 }
}},
{ $match: { count: { $gte: 2 } }},
I searched for a very long time to find this notation, hope I can help somebody with the same problem
Search for duplicates in Compass Mongo db using $sortByCount
[screenshot]: https://i.stack.imgur.com/L85QV.png
Sometimes you want to find duplicates regardless the case, when you want to create a case insensitive index for instance. In this case you can use this aggregation pipeline
db.collection.aggregate([
{'$group': {'_id': {'$toLower': '$name'}, 'count': { '$sum': 1 }, 'duplicates': { '$push': '$$ROOT' } } },
{'$match': { 'count': { '$gt': 1 } }
]);
Explanation:
group by name but first change the case to lower case and push the docs to the duplicates array.
match those groups having records greater than 1 (the duplicates).