Mongodb: Deduplicate collection

Mongodb: Deduplicate collection - mongodb

I'm working with mongo and node. I have a collection with a large number of records an unknown number of which are duplicates. I'm trying to remove dups following Remove duplicate records from mongodb 4.0 and https://docs.mongodb.com/manual/aggregation/ .
I am using the mongodb compass tool. I am able to run the code in the mongodb Shell at the bottom of this tool.
So far I have:
db.hayes.aggregate([
... {"$group" : {_id:"$PropertyId", count:{$sum:1}}}
... ]
... );
{ "_id" : "R135418", "count" : 10 }
{ "_id" : "R47410", "count" : 17 }
{ "_id" : "R130794", "count" : 10 }
{ "_id" : "R92923", "count" : 18 }
{ "_id" : "R107811", "count" : 11 }
{ "_id" : "R91389", "count" : 15 }
{ "_id" : "R22047", "count" : 12 }
{ "_id" : "R103664", "count" : 10 }
{ "_id" : "R121349", "count" : 12 }
{ "_id" : "R143168", "count" : 8 }
{ "_id" : "R85918", "count" : 13 }
{ "_id" : "R41641", "count" : 13 }
{ "_id" : "R160910", "count" : 11 }
{ "_id" : "R48919", "count" : 11 }
{ "_id" : "M119387", "count" : 10 }
{ "_id" : "R161734", "count" : 12 }
{ "_id" : "R41259", "count" : 13 }
{ "_id" : "R156538", "count" : 7 }
{ "_id" : "R60868", "count" : 10 }
How do I now select 1 of each of the groups to avoid duplicates . (I can see that loading it into a new collection will likely involve: {$out: "theCollectionWithoutDuplicates"}
)
edit:
The output from db.hayes.aggregate([
{
$group: {
_id: "$PropertyId",
count: {
$sum: 1
},
ids: {
$addToSet: "$_id"
}
}
},
{
$match: {
count: {
$gt: 1
}
}
}
]) gives output looking like:
{ "_id" : "M118975", "count" : 8, "ids" : [ ObjectId("60147f84e9fdd41da73272d6"), ObjectId("601427ac432deb152a70b8fd"), ObjectId("6014639be210571a70d1118f"), ObjectId("60145e9ae210571a70d0062f"), ObjectId("60145545b6f7a917817e9519"), ObjectId("6014619be210571a70d0a091"), ObjectId("60145dc3d5a2811a459b4e07"), ObjectId("60146641e210571a70d1a3cd") ] }
{ "_id" : "R88986", "count" : 10, "ids" : [ ObjectId("60131752de3d3a09bc1eb04b"), ObjectId("6013348385dcda0eb5b8d40c"), ObjectId("60145297b6f7a917817e1928"), ObjectId("601458eeb08c4919df85f63d"), ObjectId("601462f4e210571a70d0e961"), ObjectId("60142ad9c0db1716068a612e"), ObjectId("601425263df18a145b2fd0a8"), ObjectId("60145be5d5a2811a459aea7e"), ObjectId("6014634ce210571a70d0fe5c"), ObjectId("60131a1ab7335806a1816b95") ] }
{ "_id" : "P119977", "count" : 11, "ids" : [ ObjectId("601468b9597abd1bfd0798a4"), ObjectId("60144b7dbfa28016887b0e8f"), ObjectId("60147094c4bca31cfdb12d1d"), ObjectId("60144de7bfa28016887b698b"), ObjectId("60135aa63674d90dffec3759"), ObjectId("60135f552441920e97e858a3"), ObjectId("601428b3432deb152a70f32e"), ObjectId("60141b222ac11f13055725a5"), ObjectId("60145326b6f7a917817e38b6"), ObjectId("6014882c5322582035e83f63"), ObjectId("6014741ae9fdd41da7313a44") ] }
However when I run the foreach loop it runs for minutes and crashes
Originally the database mydb had 0.173 GB but now 0.368 GB
any idea what is wrong?
edit 2:
I rebooted, then reran your entire script. this time it completed in a 3-4 minutes. No errors.
> show dbs
admin 0.000GB
config 0.000GB
local 0.000GB
myNewDatabase 0.000GB
mydb 0.396GB
> db.hayes.aggregate([ {"$group" : {_id:"$PropertyId", count:{$sum:1}}},{$count:"total"} ]);
{ "total" : 103296 }
> db.hayes.aggregate([ {"$group" : {_id:"$PropertyId", count:{$sum:1}}} ]);
{ "_id" : "R96274", "count" : 1 }
{ "_id" : "R106186", "count" : 1 }
{ "_id" : "R169417", "count" : 1 }
{ "_id" : "R140542", "count" : 1 }
So it looks like it worked this time, but why is 'mydb' getting larger?

Here is how to keep single document from every duplicated list and remove the rest:
db.test.aggregate([
{
$group: {
_id: "$PropertyId",
count: {
$sum: 1
},
ids: {
$addToSet: "$_id"
}
}
},
{
$match: {
count: {
$gt: 1
}
}
}
]).forEach(function(d){
d.ids.shift();
printjson(d.ids);
db.test.remove({
_id: {
$in: d.ids
}
})
})
Explained:
You group by PropertyId and preserve every document _id in ids array
You filter only groups that have more then 1 document ( the duplicates )
You loop over all groups and remove 1st _id from ids array (the group for deletion) and remove the duplicates.
You can execute multiple times , if there is no duplicates no deletion will be executed ...

Related

Get cursor count in Mongodb with aggregation framework

I'm working with mongo and node. I have a collection with a large number of records an unknown number of which are duplicates. I'm trying to remove dups following Remove duplicate records from mongodb 4.0 and https://docs.mongodb.com/manual/aggregation/ .
So far I have:
db.hayes.aggregate([
... {"$group" : {_id:"$PropertyId", count:{$sum:1}}}
... ]
... );
{ "_id" : "R135418", "count" : 10 }
{ "_id" : "R47410", "count" : 17 }
{ "_id" : "R130794", "count" : 10 }
{ "_id" : "R92923", "count" : 18 }
{ "_id" : "R107811", "count" : 11 }
{ "_id" : "R91389", "count" : 15 }
{ "_id" : "R22047", "count" : 12 }
{ "_id" : "R103664", "count" : 10 }
{ "_id" : "R121349", "count" : 12 }
{ "_id" : "R143168", "count" : 8 }
{ "_id" : "R85918", "count" : 13 }
{ "_id" : "R41641", "count" : 13 }
{ "_id" : "R160910", "count" : 11 }
{ "_id" : "R48919", "count" : 11 }
{ "_id" : "M119387", "count" : 10 }
{ "_id" : "R161734", "count" : 12 }
{ "_id" : "R41259", "count" : 13 }
{ "_id" : "R156538", "count" : 7 }
{ "_id" : "R60868", "count" : 10 }
to get the number of groups I tried in the mongo shell:
> const cursor = db.hayes.aggregate([{"$group" :
{_id:"$PropertyId", count:{$sum:1}}} ]);
> cursor.count()
uncaught exception: TypeError: cursor.count is not a function :
#(shell):1:1
Apparently this works with the db.cllection.find statement. How do I do this with the aggregate framework?

Add the following stage after the group stage to see the groups count:
{$count:"Total"}

Count method on the cursor changes the query being sent from find to count. This only works if you are sending a find query to begin with, i.e., not when you are aggregating.
See https://docs.mongodb.com/manual/reference/method/cursor.count/#cursor.count which includes guidance for how to count when aggregating.

Does MongoDB support aggregate queries on the result of the aggregate query?

I have an aggregate query that returns the count of records a property has.
db.collection.aggregate([
{
$group : {
_id : "$propertyId",
count: { $sum: 1 }
}
},
{
$sort : { count: 1 }
}
],
{
allowDiskUse:true
});
This gives me a result that looks like this.
{ "_id" : 1234, "count" : 1 }
{ "_id" : 1235, "count" : 1 }
{ "_id" : 1236, "count" : 2 }
{ "_id" : 1237, "count" : 3 }
{ "_id" : 1238, "count" : 3 }
Now I want to count the counts. So the above result would turn into this.
{ "_id" : 1, "count" : 2 }
{ "_id" : 2, "count" : 1 }
{ "_id" : 3, "count" : 2 }
Is this possible to do with a query, or do I need to write some code to get this done?

I updated the query to have another "step" that counts the counts. This is how it looks.
db.collection.aggregate([
{
$group : {
_id : "$propertyId",
count: { $sum: 1 }
}
},
{
$group : {
_id : "$count",
countOfCounts: { $sum: 1 }
}
},
{
$sort : { countOfCounts: 1 }
}
],
{
allowDiskUse:true
});

Counting how many times unique values occur in an array across a MongoDB collection

So I have a collection of users. The user document is a very simple document, and looks like this:
{
username: "player101",
badges: ["score10", "score100"]
}
So how can I query to see how many times each unique value in the badges array occurs across the entire collection?

Use aggregation with $unwind and $group stages, where you can sum badges with $sum arithmetic operator
db.players.aggregate([
{
$unwind: "$badges"
},
{
$group:
{
_id: "$badges",
count: { $sum: 1 }
}
}
]);
on collection players with documents
{ "username" : "player101", "badges" : [ "score10", "score100" ] }
{ "username" : "player102", "badges" : [ "score11", "score100" ] }
{ "username" : "player103", "badges" : [ "score11", "score101" ] }
{ "username" : "player104", "badges" : [ "score12", "score100" ] }
gives you the result
{ "_id" : "score101", "count" : 1 }
{ "_id" : "score11", "count" : 2 }
{ "_id" : "score12", "count" : 1 }
{ "_id" : "score100", "count" : 3 }
{ "_id" : "score10", "count" : 1 }

Sum the different grades by date in MongoDB [duplicate]

This question already has an answer here:
Mongodb count distinct with multiple group fields
(1 answer)
Closed 6 years ago.
I'm using the restaurants dataset from the MongoDB website. A document has arrays like the following:
{
"grades" : [
{
"date" : ISODate("2014-06-10T00:00:00.000Z"),
"grade" : "A"
},
{
"date" : ISODate("2013-06-05T00:00:00.000Z"),
"grade" : "B",
"score" : 7
},
{
"date" : ISODate("2012-04-13T00:00:00.000Z"),
"grade" : "A"
},
{
"date" : ISODate("2011-10-12T00:00:00.000Z"),
"grade" : "A"
}
]
}
I'm trying to get a list of all dates, with a count of how many of each grade there was on that day.
I've got this far:
db.restaurants.aggregate([{
$unwind : {
path: '$grades'
}
}, {
$group: {
_id: '$grades.date',
grades: {
$push: '$grades.grade'
}
}
}])
Which gives me each date and the grades on that date.
How do I now count the number of each unique grade?

Figured it out with thanks to this question.
The solution is actually much simpler than I was thinking:
db.restaurants.aggregate([{
$unwind : {
path: '$grades'
}
}, {
$group: {
_id: {
date: '$grades.date',
grade: '$grades.grade'
},
count: {
$sum: 1
}
}
}])
This gives a result like:
/* 1 */
{
"_id" : {
"date" : ISODate("2014-06-23T00:00:00.000Z"),
"grade" : "C"
},
"count" : 4
}
/* 2 */
{
"_id" : {
"date" : ISODate("2011-11-01T00:00:00.000Z"),
"grade" : "C"
},
"count" : 3
}
/* 3 */
{
"_id" : {
"date" : ISODate("2014-05-06T00:00:00.000Z"),
"grade" : "A"
},
"count" : 121
}
/* 4 */
{
"_id" : {
"date" : ISODate("2012-08-21T00:00:00.000Z"),
"grade" : "C"
},
"count" : 5
}
/* 5 */
{
"_id" : {
"date" : ISODate("2013-09-04T00:00:00.000Z"),
"grade" : "C"
},
"count" : 4
}

mongodb aggregation find min value and other fields in nested array

Is it possible to find in a nested array the max date and show its price then show the parent field like the actual price.
The result I want it to show like this :
{
"_id" : ObjectId("5547e45c97d8b2c816c994c8"),
"actualPrice":19500,
"lastModifDate" :ISODate("2015-05-04T22:53:50.583Z"),
"price":"16000"
}
The data :
db.adds.findOne()
{
"_id" : ObjectId("5547e45c97d8b2c816c994c8"),
"addTitle" : "Clio pack luxe",
"actualPrice" : 19500,
"fistModificationDate" : ISODate("2015-05-03T22:00:00Z"),
"addID" : "1746540",
"history" : [
{
"price" : 18000,
"modifDate" : ISODate("2015-05-04T22:01:47.272Z"),
"_id" : ObjectId("5547ec4bfeb20b0414e8e51b")
},
{
"price" : 16000,
"modifDate" : ISODate("2015-05-04T22:53:50.583Z"),
"_id" : ObjectId("5547f87e83a1dae00bc033fa")
},
{
"price" : 19000,
"modifDate" : ISODate("2015-04-04T22:53:50.583Z"),
"_id" : ObjectId("5547f87e83a1dae00bc033fe")
}
],
"__v" : 1
}
my query
db.adds.aggregate(
[
{ $match:{addID:"1746540"}},
{ $unwind:"$history"},
{ $group:{
_id:0,
lastModifDate:{$max:"$historique.modifDate"}
}
}
])
I dont know how to include other fields I used $project but I get errors
thanks for helping

You could try the following aggregation pipeline which does not need to make use of the $group operator stage as the $project operator takes care of the fields projection:
db.adds.aggregate([
{
"$match": {"addID": "1746540"}
},
{
"$unwind": "$history"
},
{
"$project": {
"actualPrice": 1,
"lastModifDate": "$history.modifDate",
"price": "$history.price"
}
},
{
"$sort": { "lastModifDate": -1 }
},
{
"$limit": 1
}
])
Output
/* 1 */
{
"result" : [
{
"_id" : ObjectId("5547e45c97d8b2c816c994c8"),
"actualPrice" : 19500,
"lastModifDate" : ISODate("2015-05-04T22:53:50.583Z"),
"price" : 16000
}
],
"ok" : 1
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Mongodb: Deduplicate collection - mongodb

Related

Get cursor count in Mongodb with aggregation framework

Does MongoDB support aggregate queries on the result of the aggregate query?

Counting how many times unique values occur in an array across a MongoDB collection

Sum the different grades by date in MongoDB [duplicate]

mongodb aggregation find min value and other fields in nested array

Categories

Resources