Mongodb: Deduplicate collection - mongodb

I'm working with mongo and node. I have a collection with a large number of records an unknown number of which are duplicates. I'm trying to remove dups following Remove duplicate records from mongodb 4.0 and https://docs.mongodb.com/manual/aggregation/ .
I am using the mongodb compass tool. I am able to run the code in the mongodb Shell at the bottom of this tool.
So far I have:
db.hayes.aggregate([
... {"$group" : {_id:"$PropertyId", count:{$sum:1}}}
... ]
... );
{ "_id" : "R135418", "count" : 10 }
{ "_id" : "R47410", "count" : 17 }
{ "_id" : "R130794", "count" : 10 }
{ "_id" : "R92923", "count" : 18 }
{ "_id" : "R107811", "count" : 11 }
{ "_id" : "R91389", "count" : 15 }
{ "_id" : "R22047", "count" : 12 }
{ "_id" : "R103664", "count" : 10 }
{ "_id" : "R121349", "count" : 12 }
{ "_id" : "R143168", "count" : 8 }
{ "_id" : "R85918", "count" : 13 }
{ "_id" : "R41641", "count" : 13 }
{ "_id" : "R160910", "count" : 11 }
{ "_id" : "R48919", "count" : 11 }
{ "_id" : "M119387", "count" : 10 }
{ "_id" : "R161734", "count" : 12 }
{ "_id" : "R41259", "count" : 13 }
{ "_id" : "R156538", "count" : 7 }
{ "_id" : "R60868", "count" : 10 }
How do I now select 1 of each of the groups to avoid duplicates . (I can see that loading it into a new collection will likely involve: {$out: "theCollectionWithoutDuplicates"}
)
edit:
The output from db.hayes.aggregate([
{
$group: {
_id: "$PropertyId",
count: {
$sum: 1
},
ids: {
$addToSet: "$_id"
}
}
},
{
$match: {
count: {
$gt: 1
}
}
}
]) gives output looking like:
{ "_id" : "M118975", "count" : 8, "ids" : [ ObjectId("60147f84e9fdd41da73272d6"), ObjectId("601427ac432deb152a70b8fd"), ObjectId("6014639be210571a70d1118f"), ObjectId("60145e9ae210571a70d0062f"), ObjectId("60145545b6f7a917817e9519"), ObjectId("6014619be210571a70d0a091"), ObjectId("60145dc3d5a2811a459b4e07"), ObjectId("60146641e210571a70d1a3cd") ] }
{ "_id" : "R88986", "count" : 10, "ids" : [ ObjectId("60131752de3d3a09bc1eb04b"), ObjectId("6013348385dcda0eb5b8d40c"), ObjectId("60145297b6f7a917817e1928"), ObjectId("601458eeb08c4919df85f63d"), ObjectId("601462f4e210571a70d0e961"), ObjectId("60142ad9c0db1716068a612e"), ObjectId("601425263df18a145b2fd0a8"), ObjectId("60145be5d5a2811a459aea7e"), ObjectId("6014634ce210571a70d0fe5c"), ObjectId("60131a1ab7335806a1816b95") ] }
{ "_id" : "P119977", "count" : 11, "ids" : [ ObjectId("601468b9597abd1bfd0798a4"), ObjectId("60144b7dbfa28016887b0e8f"), ObjectId("60147094c4bca31cfdb12d1d"), ObjectId("60144de7bfa28016887b698b"), ObjectId("60135aa63674d90dffec3759"), ObjectId("60135f552441920e97e858a3"), ObjectId("601428b3432deb152a70f32e"), ObjectId("60141b222ac11f13055725a5"), ObjectId("60145326b6f7a917817e38b6"), ObjectId("6014882c5322582035e83f63"), ObjectId("6014741ae9fdd41da7313a44") ] }
However when I run the foreach loop it runs for minutes and crashes
Originally the database mydb had 0.173 GB but now 0.368 GB
any idea what is wrong?
edit 2:
I rebooted, then reran your entire script. this time it completed in a 3-4 minutes. No errors.
> show dbs
admin 0.000GB
config 0.000GB
local 0.000GB
myNewDatabase 0.000GB
mydb 0.396GB
> db.hayes.aggregate([ {"$group" : {_id:"$PropertyId", count:{$sum:1}}},{$count:"total"} ]);
{ "total" : 103296 }
> db.hayes.aggregate([ {"$group" : {_id:"$PropertyId", count:{$sum:1}}} ]);
{ "_id" : "R96274", "count" : 1 }
{ "_id" : "R106186", "count" : 1 }
{ "_id" : "R169417", "count" : 1 }
{ "_id" : "R140542", "count" : 1 }
So it looks like it worked this time, but why is 'mydb' getting larger?

Here is how to keep single document from every duplicated list and remove the rest:
db.test.aggregate([
{
$group: {
_id: "$PropertyId",
count: {
$sum: 1
},
ids: {
$addToSet: "$_id"
}
}
},
{
$match: {
count: {
$gt: 1
}
}
}
]).forEach(function(d){
d.ids.shift();
printjson(d.ids);
db.test.remove({
_id: {
$in: d.ids
}
})
})
Explained:
You group by PropertyId and preserve every document _id in ids array
You filter only groups that have more then 1 document ( the duplicates )
You loop over all groups and remove 1st _id from ids array (the group for deletion) and remove the duplicates.
You can execute multiple times , if there is no duplicates no deletion will be executed ...

Related

Get cursor count in Mongodb with aggregation framework

I'm working with mongo and node. I have a collection with a large number of records an unknown number of which are duplicates. I'm trying to remove dups following Remove duplicate records from mongodb 4.0 and https://docs.mongodb.com/manual/aggregation/ .
So far I have:
db.hayes.aggregate([
... {"$group" : {_id:"$PropertyId", count:{$sum:1}}}
... ]
... );
{ "_id" : "R135418", "count" : 10 }
{ "_id" : "R47410", "count" : 17 }
{ "_id" : "R130794", "count" : 10 }
{ "_id" : "R92923", "count" : 18 }
{ "_id" : "R107811", "count" : 11 }
{ "_id" : "R91389", "count" : 15 }
{ "_id" : "R22047", "count" : 12 }
{ "_id" : "R103664", "count" : 10 }
{ "_id" : "R121349", "count" : 12 }
{ "_id" : "R143168", "count" : 8 }
{ "_id" : "R85918", "count" : 13 }
{ "_id" : "R41641", "count" : 13 }
{ "_id" : "R160910", "count" : 11 }
{ "_id" : "R48919", "count" : 11 }
{ "_id" : "M119387", "count" : 10 }
{ "_id" : "R161734", "count" : 12 }
{ "_id" : "R41259", "count" : 13 }
{ "_id" : "R156538", "count" : 7 }
{ "_id" : "R60868", "count" : 10 }
to get the number of groups I tried in the mongo shell:
> const cursor = db.hayes.aggregate([{"$group" :
{_id:"$PropertyId", count:{$sum:1}}} ]);
> cursor.count()
uncaught exception: TypeError: cursor.count is not a function :
#(shell):1:1
Apparently this works with the db.cllection.find statement. How do I do this with the aggregate framework?
Add the following stage after the group stage to see the groups count:
{$count:"Total"}
Count method on the cursor changes the query being sent from find to count. This only works if you are sending a find query to begin with, i.e., not when you are aggregating.
See https://docs.mongodb.com/manual/reference/method/cursor.count/#cursor.count which includes guidance for how to count when aggregating.

Does MongoDB support aggregate queries on the result of the aggregate query?

I have an aggregate query that returns the count of records a property has.
db.collection.aggregate([
{
$group : {
_id : "$propertyId",
count: { $sum: 1 }
}
},
{
$sort : { count: 1 }
}
],
{
allowDiskUse:true
});
This gives me a result that looks like this.
{ "_id" : 1234, "count" : 1 }
{ "_id" : 1235, "count" : 1 }
{ "_id" : 1236, "count" : 2 }
{ "_id" : 1237, "count" : 3 }
{ "_id" : 1238, "count" : 3 }
Now I want to count the counts. So the above result would turn into this.
{ "_id" : 1, "count" : 2 }
{ "_id" : 2, "count" : 1 }
{ "_id" : 3, "count" : 2 }
Is this possible to do with a query, or do I need to write some code to get this done?
I updated the query to have another "step" that counts the counts. This is how it looks.
db.collection.aggregate([
{
$group : {
_id : "$propertyId",
count: { $sum: 1 }
}
},
{
$group : {
_id : "$count",
countOfCounts: { $sum: 1 }
}
},
{
$sort : { countOfCounts: 1 }
}
],
{
allowDiskUse:true
});

Counting how many times unique values occur in an array across a MongoDB collection

So I have a collection of users. The user document is a very simple document, and looks like this:
{
username: "player101",
badges: ["score10", "score100"]
}
So how can I query to see how many times each unique value in the badges array occurs across the entire collection?
Use aggregation with $unwind and $group stages, where you can sum badges with $sum arithmetic operator
db.players.aggregate([
{
$unwind: "$badges"
},
{
$group:
{
_id: "$badges",
count: { $sum: 1 }
}
}
]);
on collection players with documents
{ "username" : "player101", "badges" : [ "score10", "score100" ] }
{ "username" : "player102", "badges" : [ "score11", "score100" ] }
{ "username" : "player103", "badges" : [ "score11", "score101" ] }
{ "username" : "player104", "badges" : [ "score12", "score100" ] }
gives you the result
{ "_id" : "score101", "count" : 1 }
{ "_id" : "score11", "count" : 2 }
{ "_id" : "score12", "count" : 1 }
{ "_id" : "score100", "count" : 3 }
{ "_id" : "score10", "count" : 1 }

Sum the different grades by date in MongoDB [duplicate]

This question already has an answer here:
Mongodb count distinct with multiple group fields
(1 answer)
Closed 6 years ago.
I'm using the restaurants dataset from the MongoDB website. A document has arrays like the following:
{
"grades" : [
{
"date" : ISODate("2014-06-10T00:00:00.000Z"),
"grade" : "A"
},
{
"date" : ISODate("2013-06-05T00:00:00.000Z"),
"grade" : "B",
"score" : 7
},
{
"date" : ISODate("2012-04-13T00:00:00.000Z"),
"grade" : "A"
},
{
"date" : ISODate("2011-10-12T00:00:00.000Z"),
"grade" : "A"
}
]
}
I'm trying to get a list of all dates, with a count of how many of each grade there was on that day.
I've got this far:
db.restaurants.aggregate([{
$unwind : {
path: '$grades'
}
}, {
$group: {
_id: '$grades.date',
grades: {
$push: '$grades.grade'
}
}
}])
Which gives me each date and the grades on that date.
How do I now count the number of each unique grade?
Figured it out with thanks to this question.
The solution is actually much simpler than I was thinking:
db.restaurants.aggregate([{
$unwind : {
path: '$grades'
}
}, {
$group: {
_id: {
date: '$grades.date',
grade: '$grades.grade'
},
count: {
$sum: 1
}
}
}])
This gives a result like:
/* 1 */
{
"_id" : {
"date" : ISODate("2014-06-23T00:00:00.000Z"),
"grade" : "C"
},
"count" : 4
}
/* 2 */
{
"_id" : {
"date" : ISODate("2011-11-01T00:00:00.000Z"),
"grade" : "C"
},
"count" : 3
}
/* 3 */
{
"_id" : {
"date" : ISODate("2014-05-06T00:00:00.000Z"),
"grade" : "A"
},
"count" : 121
}
/* 4 */
{
"_id" : {
"date" : ISODate("2012-08-21T00:00:00.000Z"),
"grade" : "C"
},
"count" : 5
}
/* 5 */
{
"_id" : {
"date" : ISODate("2013-09-04T00:00:00.000Z"),
"grade" : "C"
},
"count" : 4
}

mongodb aggregation find min value and other fields in nested array

Is it possible to find in a nested array the max date and show its price then show the parent field like the actual price.
The result I want it to show like this :
{
"_id" : ObjectId("5547e45c97d8b2c816c994c8"),
"actualPrice":19500,
"lastModifDate" :ISODate("2015-05-04T22:53:50.583Z"),
"price":"16000"
}
The data :
db.adds.findOne()
{
"_id" : ObjectId("5547e45c97d8b2c816c994c8"),
"addTitle" : "Clio pack luxe",
"actualPrice" : 19500,
"fistModificationDate" : ISODate("2015-05-03T22:00:00Z"),
"addID" : "1746540",
"history" : [
{
"price" : 18000,
"modifDate" : ISODate("2015-05-04T22:01:47.272Z"),
"_id" : ObjectId("5547ec4bfeb20b0414e8e51b")
},
{
"price" : 16000,
"modifDate" : ISODate("2015-05-04T22:53:50.583Z"),
"_id" : ObjectId("5547f87e83a1dae00bc033fa")
},
{
"price" : 19000,
"modifDate" : ISODate("2015-04-04T22:53:50.583Z"),
"_id" : ObjectId("5547f87e83a1dae00bc033fe")
}
],
"__v" : 1
}
my query
db.adds.aggregate(
[
{ $match:{addID:"1746540"}},
{ $unwind:"$history"},
{ $group:{
_id:0,
lastModifDate:{$max:"$historique.modifDate"}
}
}
])
I dont know how to include other fields I used $project but I get errors
thanks for helping
You could try the following aggregation pipeline which does not need to make use of the $group operator stage as the $project operator takes care of the fields projection:
db.adds.aggregate([
{
"$match": {"addID": "1746540"}
},
{
"$unwind": "$history"
},
{
"$project": {
"actualPrice": 1,
"lastModifDate": "$history.modifDate",
"price": "$history.price"
}
},
{
"$sort": { "lastModifDate": -1 }
},
{
"$limit": 1
}
])
Output
/* 1 */
{
"result" : [
{
"_id" : ObjectId("5547e45c97d8b2c816c994c8"),
"actualPrice" : 19500,
"lastModifDate" : ISODate("2015-05-04T22:53:50.583Z"),
"price" : 16000
}
],
"ok" : 1
}