Find number of duplicates documents - mongodb

I had a bug on my code while developing that created some duplicated users on my MongoDB.
Collection example:
"_id" : ObjectId("5abb9d72b884fb00389efeef"),
"user" : ObjectId("5abb9d72b884fb00389efee5"),
"displayName" : "test",
"fullName" : "test test test",
"email" : "test#mail.com",
"phoneNumber" : "99999999999",
"createdAt" : ISODate("2016-05-18T13:49:38.533Z")
I was able to find the duplicated users with this query:
db.users.aggregate([{$group: {_id: "$user", "Total": {$sum: 1}}}, {
$match: { "Total": {$gt: 1}}}])
And count them with this one:
db.users.aggregate([{$group: {_id: "$user", "Total": {$sum: 1}}}, {
$match: { "Total": {$gt: 1}}}, { $count: "Total"}])
I want to know how many users I'll need to delete, but the second query only returns me the total of unique users affected.
How can I get a sum of duplicated users? Or a sum of "Total".
Expected result:
{ "Total" : **** }

Well, you can do this with the following pipeline
[
{ $group: {
_id: null,
uniqueValues: { $addToSet: "$user" },
count: { $sum: 1 }
}},
{ $project: {
total: { $subtract: [ "$count", { $size: "$uniqueValues" } ] }
}}
]

Don't have your data set, so didnt test this in my local. Try this query:
db.users.aggregate([
{$group: {_id: "$user", Total: {$sum: 1}}}, //group by user and count each.
{$addFields: {Total: {$subtract:["$Total",1]}}}, // you need duplicate count, so forget first instance of it.
{$group:{_id:null, Total: {$sum:"$Total"}}}, // your _id is unique, perform a sum out of it
{$project:{_id:0, Total:1}} // at the end the result is total number of 'duplicate' users.
])

Related

A mongo DB query with a MAX(DATE)

Hello I have a BD with many fields where an user can enter many times, I need to create a query where I can Group by user and bring me the last entry date in the system, but other additional data such as previous and the ID of the transaction, the date is createdAT, it brings me the date but not the last one ... here the code:
db.getCollection("usersos").aggregate(
[
{
"$group" : {
"_id" : {
"_id" : "$_id",
"user" : "$user",
"previo" : "$previo"
},
"MAX(createdAt)" : {
"$max" : "$createdAt"
}
}
},
{
"$project" : {
"user" : "$_id.user",
"MAX(createdAt)" : "$MAX(createdAt)",
"_id" : "$_id._id",
"previo" : "$_id.previo"
}
}
]
);
Im staring in nosql, some help thank.....and excuseme the mstake....
Grouping by $_id will mean that every input document is a separate group, i.e. no grouping will really happen.
You could try pre-sorting by createdAt, which might be helped by an index on that field, then the group can select $first to get the first entry for each field that you care about.
db.usersos.aggregate([
{$sort: {createdAt: -1}},
{$group: {
_id:"$user",
docId: {$first: "$_id"},
previo: {$first: "$previo"},
createdAt: {$first: "$createdAt"}
}},
{ $project: {
user: "$_id",
_id: "$docId",
previo: 1,
createdAt: 1
}}
])

Mongodb find duplicates where second column matches

I want to find duplicate documents in my mongodb database , and i have also achieved a portion of it , lets say my document is something like this
{
"_id" : ObjectId("5900b01b2ce12a2383328e61"),
"Bank Name" : "Seaway Bank and Trust Company",
"City" : "Chicago",
"ST" : "IL",
"CERT" : 19328,
"Acquiring Institution" : "State Bank of Texas",
"Closing Date" : "27-Jan-17",
"Updated Date" : "17-Feb-17"
}
and i have written query like this :
db.list.aggregate([
{$group: {
_id: {CERT: "$CERT"},
uniqueIds: {$addToSet: "$_id"},
count: {$sum: 1}
}
},
{$match: {
count: {"$gt": 1}
}
},
{$sort: {
count: -1
}
}
]);
so this gives me ids of all the documents where CERT is repeating in more than one document which is correct , but in addition to this , i want to add and where ST not equals to IL. how can i do that .
Please Help !
You can just add another $match with ST not equals to IL, before executing the $group, which will ignore the transactions with "ST" == "IL":
Final Query:
db.list.aggregate([
{
$match : {
"ST" : {$ne : "IL"}
}
},
{
$group: {
_id: {CERT: "$CERT"},
uniqueIds: {$addToSet: "$_id"},
count: {$sum : 1}
}
},
{
$match: {
count: {"$gt": 1}
}
},
{
$sort: {
count: -1
}
}
]);
Hope this Helps!
You can use this
db.list.aggregate([
{$group: {
_id: {CERT: "$CERT",ST:{$ne:"IL"}},
uniqueIds: {$addToSet: "$_id"},
count: {$sum: 1}
}
},
{$match: {
count: {"$gt": 1}
}
},
{$sort: {
count: -1
}
}
]);
Let me know if it did not worked or you need some more help

MongoDB group and only show results whose count is greater than 1 [duplicate]

How would I find duplicate fields in a mongo collection.
I'd like to check if any of the "name" fields are duplicates.
{
"name" : "ksqn291",
"__v" : 0,
"_id" : ObjectId("540f346c3e7fc1054ffa7086"),
"channel" : "Sales"
}
Many thanks!
Use aggregation on name and get name with count > 1:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
{"$project": {"name" : "$_id", "_id" : 0} }
]);
To sort the results by most to least duplicates:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
{"$sort": {"count" : -1} },
{"$project": {"name" : "$_id", "_id" : 0} }
]);
To use with another column name than "name", change "$name" to "$column_name"
You can find the list of duplicate names using the following aggregate pipeline:
Group all the records having similar name.
Match those groups having records greater than 1.
Then group again to project all the duplicate names as an array.
The Code:
db.collection.aggregate([
{$group:{"_id":"$name","name":{$first:"$name"},"count":{$sum:1}}},
{$match:{"count":{$gt:1}}},
{$project:{"name":1,"_id":0}},
{$group:{"_id":null,"duplicateNames":{$push:"$name"}}},
{$project:{"_id":0,"duplicateNames":1}}
])
o/p:
{ "duplicateNames" : [ "ksqn291", "ksqn29123213Test" ] }
The answer anhic gave can be very inefficient if you have a large database and the attribute name is present only in some of the documents.
To improve efficiency you can add a $match to the aggregation.
db.collection.aggregate(
{"$match": {"name" :{ "$ne" : null } } },
{"$group" : {"_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"count" : {"$gt": 1} } },
{"$project": {"name" : "$_id", "_id" : 0} }
)
Another option is to use $sortByCount stage.
db.collection.aggregate([
{ $sortByCount: '$name' }
]
This is the combination of $group & $sort.
The $sortByCount stage is equivalent to the following $group + $sort sequence:
{ $group: { _id: <expression>, count: { $sum: 1 } } },
{ $sort: { count: -1 } }
db.getCollection('orders').aggregate([
{$group: {
_id: {name: "$name"},
uniqueIds: {$addToSet: "$_id"},
count: {$sum: 1}
}
},
{$match: {
count: {"$gt": 1}
}
}
])
First Group Query the group according to the fields.
Then we check the unique Id and count it, If count is greater then 1 then the field is duplicate in the entire collection so that thing is to be handle by $match query.
this is how we can achieve this in mongoDB compass
In case you need to see all duplicated rows:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 },"data": { "$push": "$$ROOT" }}},
{"$unwind": "$data"},
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
]);
If somebody is looking for a query for duplicates with an extra "$and" where clause, like "and where someOtherField is true"
The trick is to start with that other $match, because after the grouping you don't have all the data available anymore
// Do a first match before the grouping
{ $match: { "someOtherField": true }},
{ $group: {
_id: { name: "$name" },
count: { $sum: 1 }
}},
{ $match: { count: { $gte: 2 } }},
I searched for a very long time to find this notation, hope I can help somebody with the same problem
Search for duplicates in Compass Mongo db using $sortByCount
[screenshot]: https://i.stack.imgur.com/L85QV.png
Sometimes you want to find duplicates regardless the case, when you want to create a case insensitive index for instance. In this case you can use this aggregation pipeline
db.collection.aggregate([
{'$group': {'_id': {'$toLower': '$name'}, 'count': { '$sum': 1 }, 'duplicates': { '$push': '$$ROOT' } } },
{'$match': { 'count': { '$gt': 1 } }
]);
Explanation:
group by name but first change the case to lower case and push the docs to the duplicates array.
match those groups having records greater than 1 (the duplicates).

$group after $lookup is taking way too long

I have following mongo collection:
{
"_id" : "22pTvYLd7azAAPL5T",
"plate" : "ABC-123",
"company": "AMZ",
"_portfolioType" : "account"
},
{
"_id" : "22pTvYLd7azAAPL5T",
"plate" : "ABC-123",
"_portfolioType" : "sale",
"price": 87.3
},
{
"_id" : "22pTvYLd7azAAPL5T",
"plate" : "ABC-123",
"_portfolioType" : "sale",
"price": 88.9
}
And I am trying to aggregate all documents which have same value in plate field. Below is the query I have written so far:
db.getCollection('temp').aggregate([
{
$lookup: {
from: 'temp',
let: { 'p': '$plate', 't': '$_portfolioType' },
pipeline: [{
'$match': {
'_portfolioType': 'sale',
'$expr': { '$and': [
{ '$eq': [ '$plate', '$$p' ] },
{ '$eq': [ '$$t', 'account' ] }
]}
}
}],
as: 'revenues'
},
},
{
$project: {
plate: 1,
company: 1,
totalTrades: { $arrayElemAt: ['$revenues', 0] },
},
},
{
$addFields: {
revenue: { $add: [{ $multiply: ['$totalTrades.price', 100] }, 99] },
},
},
{
$group: {
_id: '$company',
revenue: { $sum: '$revenue' },
}
}
])
Query works fine if I remove $group stage, however, as soon as I add $group stage mongo starts an infinite processing. I tried adding $match as the first stage so to limit number of documents to process but without any luck. E.g:
{
$match: { $or: [{ _portfolioType: 'account' }, { _portfolioType: 'sale' }] }
},
I also tried using { explain: true } but it doesn't return anything helpful.
As Neil Lunn noticed, you very likely don't need the lookup to reach your "end goal", which is still quite vague.
Please read comments and adjust as needed:
db.temp.aggregate([
{$group:{
// Get unique plates
_id: "$plate",
// Not clear what you expect if there are documents with
// different company, and the same plate.
// Assuming "it never happens"
// You may need to $cond it here with {$eq: ["$_portfolioType", "account"]}
// but you never voiced it.
company: {$first:"$company"},
// Not exactly all documents with _portfolioType: sale,
// but rather price from all documents for this plate.
// Assuming price field is available only in documents
// with "_portfolioType" : "sale". Otherwise add a $cond here.
// If you really need "all documents", push $$ROOT instead.
prices: {$push: "$price"}
}},
{$project: {
company: 1,
// Apply your math here, or on the previous stage
// to calculate revenue per plate
revenue: "$prices"
}}
{$group: {
// Get document for each "company"
_id: "$company",
// Revenue associated with plate
revenuePerPlate: {$push: {"k":"$_id", "v":"$revenue"}}
}},
{$project:{
_id: 0,
company: "$_id",
// Count of unique plate
platesCnt: {$size: "$revenuePerPlate"},
// arrayToObject if you wish plate names as properties
revenuePerPlate: {$arrayToObject: "$revenuePerPlate"}
}}
])

Find duplicate records in MongoDB

How would I find duplicate fields in a mongo collection.
I'd like to check if any of the "name" fields are duplicates.
{
"name" : "ksqn291",
"__v" : 0,
"_id" : ObjectId("540f346c3e7fc1054ffa7086"),
"channel" : "Sales"
}
Many thanks!
Use aggregation on name and get name with count > 1:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
{"$project": {"name" : "$_id", "_id" : 0} }
]);
To sort the results by most to least duplicates:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
{"$sort": {"count" : -1} },
{"$project": {"name" : "$_id", "_id" : 0} }
]);
To use with another column name than "name", change "$name" to "$column_name"
You can find the list of duplicate names using the following aggregate pipeline:
Group all the records having similar name.
Match those groups having records greater than 1.
Then group again to project all the duplicate names as an array.
The Code:
db.collection.aggregate([
{$group:{"_id":"$name","name":{$first:"$name"},"count":{$sum:1}}},
{$match:{"count":{$gt:1}}},
{$project:{"name":1,"_id":0}},
{$group:{"_id":null,"duplicateNames":{$push:"$name"}}},
{$project:{"_id":0,"duplicateNames":1}}
])
o/p:
{ "duplicateNames" : [ "ksqn291", "ksqn29123213Test" ] }
The answer anhic gave can be very inefficient if you have a large database and the attribute name is present only in some of the documents.
To improve efficiency you can add a $match to the aggregation.
db.collection.aggregate(
{"$match": {"name" :{ "$ne" : null } } },
{"$group" : {"_id": "$name", "count": { "$sum": 1 } } },
{"$match": {"count" : {"$gt": 1} } },
{"$project": {"name" : "$_id", "_id" : 0} }
)
Another option is to use $sortByCount stage.
db.collection.aggregate([
{ $sortByCount: '$name' }
]
This is the combination of $group & $sort.
The $sortByCount stage is equivalent to the following $group + $sort sequence:
{ $group: { _id: <expression>, count: { $sum: 1 } } },
{ $sort: { count: -1 } }
db.getCollection('orders').aggregate([
{$group: {
_id: {name: "$name"},
uniqueIds: {$addToSet: "$_id"},
count: {$sum: 1}
}
},
{$match: {
count: {"$gt": 1}
}
}
])
First Group Query the group according to the fields.
Then we check the unique Id and count it, If count is greater then 1 then the field is duplicate in the entire collection so that thing is to be handle by $match query.
this is how we can achieve this in mongoDB compass
In case you need to see all duplicated rows:
db.collection.aggregate([
{"$group" : { "_id": "$name", "count": { "$sum": 1 },"data": { "$push": "$$ROOT" }}},
{"$unwind": "$data"},
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } },
]);
If somebody is looking for a query for duplicates with an extra "$and" where clause, like "and where someOtherField is true"
The trick is to start with that other $match, because after the grouping you don't have all the data available anymore
// Do a first match before the grouping
{ $match: { "someOtherField": true }},
{ $group: {
_id: { name: "$name" },
count: { $sum: 1 }
}},
{ $match: { count: { $gte: 2 } }},
I searched for a very long time to find this notation, hope I can help somebody with the same problem
Search for duplicates in Compass Mongo db using $sortByCount
[screenshot]: https://i.stack.imgur.com/L85QV.png
Sometimes you want to find duplicates regardless the case, when you want to create a case insensitive index for instance. In this case you can use this aggregation pipeline
db.collection.aggregate([
{'$group': {'_id': {'$toLower': '$name'}, 'count': { '$sum': 1 }, 'duplicates': { '$push': '$$ROOT' } } },
{'$match': { 'count': { '$gt': 1 } }
]);
Explanation:
group by name but first change the case to lower case and push the docs to the duplicates array.
match those groups having records greater than 1 (the duplicates).