I'm working with mongo and node. I have a collection with a large number of records an unknown number of which are duplicates. I'm trying to remove dups following Remove duplicate records from mongodb 4.0 and https://docs.mongodb.com/manual/aggregation/ .
So far I have:
db.hayes.aggregate([
... {"$group" : {_id:"$PropertyId", count:{$sum:1}}}
... ]
... );
{ "_id" : "R135418", "count" : 10 }
{ "_id" : "R47410", "count" : 17 }
{ "_id" : "R130794", "count" : 10 }
{ "_id" : "R92923", "count" : 18 }
{ "_id" : "R107811", "count" : 11 }
{ "_id" : "R91389", "count" : 15 }
{ "_id" : "R22047", "count" : 12 }
{ "_id" : "R103664", "count" : 10 }
{ "_id" : "R121349", "count" : 12 }
{ "_id" : "R143168", "count" : 8 }
{ "_id" : "R85918", "count" : 13 }
{ "_id" : "R41641", "count" : 13 }
{ "_id" : "R160910", "count" : 11 }
{ "_id" : "R48919", "count" : 11 }
{ "_id" : "M119387", "count" : 10 }
{ "_id" : "R161734", "count" : 12 }
{ "_id" : "R41259", "count" : 13 }
{ "_id" : "R156538", "count" : 7 }
{ "_id" : "R60868", "count" : 10 }
to get the number of groups I tried in the mongo shell:
> const cursor = db.hayes.aggregate([{"$group" :
{_id:"$PropertyId", count:{$sum:1}}} ]);
> cursor.count()
uncaught exception: TypeError: cursor.count is not a function :
#(shell):1:1
Apparently this works with the db.cllection.find statement. How do I do this with the aggregate framework?
Add the following stage after the group stage to see the groups count:
{$count:"Total"}
Count method on the cursor changes the query being sent from find to count. This only works if you are sending a find query to begin with, i.e., not when you are aggregating.
See https://docs.mongodb.com/manual/reference/method/cursor.count/#cursor.count which includes guidance for how to count when aggregating.
Related
I'm working with mongo and node. I have a collection with a large number of records an unknown number of which are duplicates. I'm trying to remove dups following Remove duplicate records from mongodb 4.0 and https://docs.mongodb.com/manual/aggregation/ .
I am using the mongodb compass tool. I am able to run the code in the mongodb Shell at the bottom of this tool.
So far I have:
db.hayes.aggregate([
... {"$group" : {_id:"$PropertyId", count:{$sum:1}}}
... ]
... );
{ "_id" : "R135418", "count" : 10 }
{ "_id" : "R47410", "count" : 17 }
{ "_id" : "R130794", "count" : 10 }
{ "_id" : "R92923", "count" : 18 }
{ "_id" : "R107811", "count" : 11 }
{ "_id" : "R91389", "count" : 15 }
{ "_id" : "R22047", "count" : 12 }
{ "_id" : "R103664", "count" : 10 }
{ "_id" : "R121349", "count" : 12 }
{ "_id" : "R143168", "count" : 8 }
{ "_id" : "R85918", "count" : 13 }
{ "_id" : "R41641", "count" : 13 }
{ "_id" : "R160910", "count" : 11 }
{ "_id" : "R48919", "count" : 11 }
{ "_id" : "M119387", "count" : 10 }
{ "_id" : "R161734", "count" : 12 }
{ "_id" : "R41259", "count" : 13 }
{ "_id" : "R156538", "count" : 7 }
{ "_id" : "R60868", "count" : 10 }
How do I now select 1 of each of the groups to avoid duplicates . (I can see that loading it into a new collection will likely involve: {$out: "theCollectionWithoutDuplicates"}
)
edit:
The output from db.hayes.aggregate([
{
$group: {
_id: "$PropertyId",
count: {
$sum: 1
},
ids: {
$addToSet: "$_id"
}
}
},
{
$match: {
count: {
$gt: 1
}
}
}
]) gives output looking like:
{ "_id" : "M118975", "count" : 8, "ids" : [ ObjectId("60147f84e9fdd41da73272d6"), ObjectId("601427ac432deb152a70b8fd"), ObjectId("6014639be210571a70d1118f"), ObjectId("60145e9ae210571a70d0062f"), ObjectId("60145545b6f7a917817e9519"), ObjectId("6014619be210571a70d0a091"), ObjectId("60145dc3d5a2811a459b4e07"), ObjectId("60146641e210571a70d1a3cd") ] }
{ "_id" : "R88986", "count" : 10, "ids" : [ ObjectId("60131752de3d3a09bc1eb04b"), ObjectId("6013348385dcda0eb5b8d40c"), ObjectId("60145297b6f7a917817e1928"), ObjectId("601458eeb08c4919df85f63d"), ObjectId("601462f4e210571a70d0e961"), ObjectId("60142ad9c0db1716068a612e"), ObjectId("601425263df18a145b2fd0a8"), ObjectId("60145be5d5a2811a459aea7e"), ObjectId("6014634ce210571a70d0fe5c"), ObjectId("60131a1ab7335806a1816b95") ] }
{ "_id" : "P119977", "count" : 11, "ids" : [ ObjectId("601468b9597abd1bfd0798a4"), ObjectId("60144b7dbfa28016887b0e8f"), ObjectId("60147094c4bca31cfdb12d1d"), ObjectId("60144de7bfa28016887b698b"), ObjectId("60135aa63674d90dffec3759"), ObjectId("60135f552441920e97e858a3"), ObjectId("601428b3432deb152a70f32e"), ObjectId("60141b222ac11f13055725a5"), ObjectId("60145326b6f7a917817e38b6"), ObjectId("6014882c5322582035e83f63"), ObjectId("6014741ae9fdd41da7313a44") ] }
However when I run the foreach loop it runs for minutes and crashes
Originally the database mydb had 0.173 GB but now 0.368 GB
any idea what is wrong?
edit 2:
I rebooted, then reran your entire script. this time it completed in a 3-4 minutes. No errors.
> show dbs
admin 0.000GB
config 0.000GB
local 0.000GB
myNewDatabase 0.000GB
mydb 0.396GB
> db.hayes.aggregate([ {"$group" : {_id:"$PropertyId", count:{$sum:1}}},{$count:"total"} ]);
{ "total" : 103296 }
> db.hayes.aggregate([ {"$group" : {_id:"$PropertyId", count:{$sum:1}}} ]);
{ "_id" : "R96274", "count" : 1 }
{ "_id" : "R106186", "count" : 1 }
{ "_id" : "R169417", "count" : 1 }
{ "_id" : "R140542", "count" : 1 }
So it looks like it worked this time, but why is 'mydb' getting larger?
Here is how to keep single document from every duplicated list and remove the rest:
db.test.aggregate([
{
$group: {
_id: "$PropertyId",
count: {
$sum: 1
},
ids: {
$addToSet: "$_id"
}
}
},
{
$match: {
count: {
$gt: 1
}
}
}
]).forEach(function(d){
d.ids.shift();
printjson(d.ids);
db.test.remove({
_id: {
$in: d.ids
}
})
})
Explained:
You group by PropertyId and preserve every document _id in ids array
You filter only groups that have more then 1 document ( the duplicates )
You loop over all groups and remove 1st _id from ids array (the group for deletion) and remove the duplicates.
You can execute multiple times , if there is no duplicates no deletion will be executed ...
I have the following dataset after completing some aggregation magic:
{ "_id" : "5700edfe03fcdb000347bebb", "comment" : { "commentor" : "56f3f70d4de8c74a69d1d5e1", "id" : ObjectId("570175e6c002e46edb922aa1")}, "max" : ObjectId("570175e6c002e46edb922aa3")}
{ "_id" : "5700edfe03fcdb000347bebb", "comment" : { "commentor" : "56f3f70d4de8c74a69d1d5e6", "id" : ObjectId("570175e6c002e46edb922aa2")}, "max" : ObjectId("570175e6c002e46edb922aa3")}
{ "_id" : "5700edfe03fcdb000347bebb", "comment" : { "commentor" : "56f3f70d4de8c74a69d1d5e1", "id" : ObjectId("570175e6c002e46edb922aa3")}, "max" : ObjectId("570175e6c002e46edb922aa3")}
The _id represents a post and in the post, there are comments. In this case, there are 3 comments; 2 by the same commentor ("56f3f70d4de8c74a69d1d5e1") and one by another commentor ("56f3f70d4de8c74a69d1d5e6").
I want to write an aggregation query that would count up all the unique comments by commentor ("56f3f70d4de8c74a69d1d5e1") only and return that the commentor commented twice on post "5700edfe03fcdb000347bebb".
I tried the following:
{ "$group" : { "_id" : "$_id", "count" : { "$sum" : "$comment.commentor" } } }
The results were:
{ "_id" : "5700edfe03fcdb000347bebb", "count" : 0 }
Please note that I'm not trying to count up all the comments by all the commentors in that post so I'm not trying to do this:
{ "$group" : { "_id" : "$_id", "count" : { "$sum" : 1 } } }
Would result in:
{ "_id" : "5700edfe03fcdb000347bebb", "count" : 3 }
I just want the count of post by user ("56f3f70d4de8c74a69d1d5e1")
EDIT:
After some research, I see that $sum only works on numeric fields and not non-numeric fields: https://docs.mongodb.com/manual/reference/operator/aggregation/sum/#grp._S_sum
Is there any way I can get the number of comments posted by user ("56f3f70d4de8c74a69d1d5e1") per post "5700edfe03fcdb000347bebb"?
So after a bit of trial and error, I managed to figure it out.
group2 = {
"$group" : {
"_id" : "$_id",
"count" : {
"$sum" : {"$cond" : [ {"$eq" : ["$comms.c", "56f3f70d4de8c74a69d1d5e1"] }, 1 ,0 ] }
}
}
}
We are summing up the 1's on condition that comms.c equals to user "56f3f70d4de8c74a69d1d5e1".
Result:
{ "_id" : "5700edfe03fcdb000347bebb", "count" : 2 }
I'm doing a fuzzy match to an input sentence, and I currently have a step in the AF like this:
{ "$group" : { "_id" : "$_id" , "score" : { "$sum" : 1}}}
but I'd like to be able to score shorter matches higher and want to do something like:
{ "$group" : { "_id" : "$_id" , "score" : { "$sum" : "1 / $length"}}}
Is something like this possible?
Yes, it should be possible (assuming $length is a field name in your documents), but the command should look like this:
{ "$group" : { "_id" : "$_id" , "score" : { $sum : {$divide: [1, "$length"]}}}}
You can find more details about possible math expressions on this page.
I guess you want something like this. the default "_id" values are unique, so probably you want to group some other parameter. So I used another parameter idd, instead of _id here.
> db.tmp1.aggregate({'$group':{'_id':'$idd', 'count':{'$sum':1} }},{ $project : { _id: 1, suminv :{$divide:[1, '$count'] } } } );
{ "_id" : 2, "suminv" : 0.3333333333333333 }{ "_id" : 1, "suminv" : 0.5 }
> db.tmp1.find();
{ "_id" : ObjectId("572a5a74024dc1f2fe4b432b"), "idd" : 1, "score" : 2 }
{ "_id" : ObjectId("572a5a79024dc1f2fe4b432c"), "idd" : 1, "score" : 1 }
{ "_id" : ObjectId("572a5a82024dc1f2fe4b432d"), "idd" : 2, "score" : 1 }
{ "_id" : ObjectId("572a5a86024dc1f2fe4b432e"), "idd" : 2, "score" : 2 }
{ "_id" : ObjectId("572a5a8e024dc1f2fe4b432f"), "idd" : 2, "score" : 3 }
I have the following collection in mongodb.
{ "_id" : ObjectId("519a35ee8f2ceda43f42add5"), "articulo" : "Sobre mongodb", "autor" : "xxxx1", "calificacion" : 3 }
{ "_id" : ObjectId("519a360b8f2ceda43f42add6"), "articulo" : "Aggregation framework", "autor" : "xxxx1", "calificacion" : 5 }
{ "_id" : ObjectId("519a361b8f2ceda43f42add7"), "articulo" : "Sobre journal", "autor" : "xxxx2", "calificacion" : 4 }
{ "_id" : ObjectId("519a362e8f2ceda43f42add8"), "articulo" : "Manipulando datos", "autor" : "xxxx1", "calificacion" : 2 }
{ "_id" : ObjectId("519a36418f2ceda43f42add9"), "articulo" : "MongoDB for dba", "autor" : "xxxx2", "calificacion" : 5 }
{ "_id" : ObjectId("519a4aa18f2ceda43f42adda"), "articulo" : "ejemplo2", "autor" : "xxxx1", "calificacion" : 5 }
I want to count the number of the articles (articulos) with max grade (calificacion) by author(autor).
xxxx1 has 2 articles with grade of 5
xxxx2 has 1 articles with grade of 5
(I don't know what's the max grade)
I've tried this:
db.ejemplo.aggregate([
{$group:{_id: "$autor" , calificacion:{$max:"$calificacion" }}}
])
but I only get the authors with max grade. Could I do it with Aggregation Framework?
You can try the aggregation operation like this:
db.ejemplo.aggregate([
{ $group : { _id : { autor : "$autor",
calificacion : "$calificacion" },
articulos : { $sum : 1 },
}},
{ $sort : { "_id.calificacion" : -1 }},
{ $group : { _id : "$_id.autor",
calificacion : { $first : "$_id.calificacion" },
articulos : { $first : "$articulos" },
}}
])
And the result is like this:
{
"result" : [
{
"_id" : "xxxx1",
"calificacion" : 5,
"articulos" : 2
},
{
"_id" : "xxxx2",
"calificacion" : 5,
"articulos" : 1
}
],
"ok" : 1
}
Thanks,
Linda
Here is the query:
db.posts.find({"project.id":5,"project.sections":6,"reading":0,"publicate":1},{"date":1}).sort({"date":-1}).limit(20)
And here is the output for it:
{ "_id" : ObjectId("51342351b6f8f38564000001"), "date" : ISODate("2013-03-05T12:38:41.731Z") }
{ "_id" : ObjectId("510ff98da80f733357000002"), "date" : ISODate("2013-02-04T19:20:25.618Z") }
{ "_id" : ObjectId("50fe4bafb6f8f3a14d000002"), "date" : ISODate("2013-01-22T08:45:16.590Z") }
{ "_id" : ObjectId("50fada8ea80f737202000039"), "date" : ISODate("2013-01-19T19:16:23.294Z") }
{ "_id" : ObjectId("50e0101fa80f73d664000002"), "date" : ISODate("2012-12-30T09:58:33.881Z") }
{ "_id" : ObjectId("50dd54d4b6f8f3923d000014"), "date" : ISODate("2012-12-30T09:52:30.993Z") }
{ "_id" : ObjectId("50ccd4a0a80f73b742000008"), "date" : ISODate("2012-12-15T20:58:18.946Z") }
{ "_id" : ObjectId("50c0e38eb6f8f32121000018"), "date" : ISODate("2012-12-06T18:35:43.098Z") }
{ "_id" : ObjectId("50314562b6f8f3f844000000"), "date" : ISODate("2012-08-22T07:06:54.822Z") }
{ "_id" : ObjectId("502012f3b6f8f3df3a000001"), "date" : ISODate("2012-08-06T19:23:10.586Z") }
{ "_id" : ObjectId("4fe6ea5ab6f8f39f59000000"), "date" : ISODate("2012-06-24T10:25:32.969Z") }
{ "_id" : ObjectId("516bbcb2a80f73a55a000000"), "date" : ISODate("2013-04-15T10:36:32.688Z") }
{ "_id" : ObjectId("516a5f62a80f733e60000000"), "date" : ISODate("2013-04-14T09:00:19.459Z") }
{ "_id" : ObjectId("515e3f2ca80f738536000003"), "date" : ISODate("2013-04-05T03:07:53.960Z") }
{ "_id" : ObjectId("5155b7c4b6f8f3ad15000001"), "date" : ISODate("2013-03-29T16:18:44.228Z") }
{ "_id" : ObjectId("514009e8a80f73f429000001"), "date" : ISODate("2013-03-29T12:31:01.898Z") }
{ "_id" : ObjectId("515566d5a80f73437d000005"), "date" : ISODate("2013-03-29T10:10:15.113Z") }
{ "_id" : ObjectId("514572cbb6f8f36525000001"), "date" : ISODate("2013-03-17T07:39:33.738Z") }
{ "_id" : ObjectId("51432a77b6f8f3024d000000"), "date" : ISODate("2013-03-15T14:07:46.648Z") }
{ "_id" : ObjectId("513d4afcb6f8f3727b000000"), "date" : ISODate("2013-03-11T17:46:21.183Z") }
As you can see, the order is wrong as if sorting works in some weird way. Here is the output of explain() for that query:
"cursor" : "BtreeCursor project.id_1_project.sections_1_reading_1_publicate_1_date_-1",
"nscanned" : 929,
"nscannedObjects" : 915,
"n" : 8,
"millis" : 23,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : true,
"indexOnly" : false,
"indexBounds" : {
...}
But if I disable the index it sorts fine:
db.posts.find({"project.id":5,"project.sections":3,"reading":0,"publicate":1},{"date":1}).hint({$natural:1}).sort({"date":-1}).limit(20)
{ "_id" : ObjectId("51475ee4b6f8f3526f000004"), "date" : ISODate("2013-04-16T10:51:04.962Z") }
{ "_id" : ObjectId("5166e61fa80f73b658000001"), "date" : ISODate("2013-04-11T16:58:11.848Z") }
{ "_id" : ObjectId("514afc12a80f735162000000"), "date" : ISODate("2013-03-25T02:51:18.309Z") }
{ "_id" : ObjectId("513db351b6f8f3d601000006"), "date" : ISODate("2013-03-11T10:49:27.585Z") }
{ "_id" : ObjectId("5105ff74a80f739704000006"), "date" : ISODate("2013-02-19T11:19:57.448Z") }
{ "_id" : ObjectId("5121de84b6f8f3b20c000009"), "date" : ISODate("2013-02-18T07:58:40.779Z") }
{ "_id" : ObjectId("511dbc4ab6f8f3a550000006"), "date" : ISODate("2013-02-15T04:51:39.767Z") }
{ "_id" : ObjectId("51053aafa80f73ae74000002"), "date" : ISODate("2013-01-27T14:44:48.931Z") }
{ "_id" : ObjectId("50f1c7c4b6f8f3ed2e000003"), "date" : ISODate("2013-01-12T20:48:04.451Z") }
{ "_id" : ObjectId("50ec5111b6f8f3180e000034"), "date" : ISODate("2013-01-09T10:25:50.736Z") }
{ "_id" : ObjectId("50d36076b6f8f3707400000f"), "date" : ISODate("2012-12-20T19:14:40.412Z") }
{ "_id" : ObjectId("50b4f7b6b6f8f3d261000003"), "date" : ISODate("2012-11-27T17:52:24.675Z") }
{ "_id" : ObjectId("50a0b83eb6f8f30a74000001"), "date" : ISODate("2012-11-12T09:14:04.652Z") }
{ "_id" : ObjectId("5092746eb6f8f3c92d000000"), "date" : ISODate("2012-11-06T12:02:21.634Z") }
{ "_id" : ObjectId("50926d48b6f8f31d15000000"), "date" : ISODate("2012-11-01T13:11:40.107Z") }
{ "_id" : ObjectId("508a471cb6f8f33568000000"), "date" : ISODate("2012-10-26T19:41:50.516Z") }
{ "_id" : ObjectId("508998c5b6f8f3b977000000"), "date" : ISODate("2012-10-26T07:59:18.278Z") }
{ "_id" : ObjectId("5088c043b6f8f3442b000003"), "date" : ISODate("2012-10-25T05:08:12.372Z") }
{ "_id" : ObjectId("50857833b6f8f37770000001"), "date" : ISODate("2012-10-22T17:06:37.667Z") }
{ "_id" : ObjectId("507e2f0ab6f8f34c2d000000"), "date" : ISODate("2012-10-17T04:32:10.337Z") }
I have tried rebuilding the index for the whole collection using db.bla.reIndex(), it didn't help. All other queries on the same collection that use the same index work just fine.
MongoDB 2.0.9
What could be the reason behind this behaviour?
You are sorting by dates. Then limiting output.
MongoDB is fetching documents according to query with dates ordered. Then cutting first 20 records. Now MongoDb has 20 records of previously ordered set of ObjectIDs. MongoDB is showing these 20 set without the order. Because last "limit" line creates a set of documents and find command gets whatever in the set without ordering. You have to chain another sort command like this:
db.posts.find({"project.id":5,"project.sections":6,"reading":0,"publicate":1},{"date":1}).sort({"date":-1}).limit(20).sort({"date":-1}))
It may seem awkward but it is actually naturally ordered. Suppose that you want to get the list of documents in the natural order but you want just first 20 ones with recent dates. Your query is just doing it. But if you want a list of documents ordered by date, you should use one more sort command at the end of query to do it.