Filtering a list of votes where more than x matches are found - mongodb

I have the following vote data in a large collection:
{
"user_id" : ObjectId("53ac7bce4eaf6de4d5601c1a"),
"article_id" : ObjectId("53ab27504eaf6de4d5601be5"),
"score" : 5
},
{
"user_id" : ObjectId("53ac7bce4eaf6de4d5601c1b"),
"article_id" : ObjectId("53ab27504eaf6de4d5601be5"),
"score" : 3
},
{
"user_id" : ObjectId("53ac7bce4eaf6de4d5601c1c"),
"article_id" : ObjectId("53ab27504eaf6de4d5601be5"),
"score" : 3
},
...
I'm looking to filter this collection where more than 3 votes have been obtained for a single article (as above) and output as-is (excluding any vote entries on articles < 3 total votes).
Any help much appreciated. This collection can be huge so efficiency would be ideal.

Normally not something you do in a single operation, but you can do this if those really are your only fields and there are not too many matching documents.
db.collection.aggregate([
{ "$group": {
"_id": "$article_id",
"docs": {
"$push": {
"user_id": "$user_id",
"article_id": "$article_id",
"score": "$score"
}
},
"votes": { "$sum": 1 }
}},
{ "$match": { "votes": { "$gt": 3 } } },
{ "$unwind": "$docs" },
{ "$project": {
"user_id": "$docs.user_id",
"article_id": "$docs.article_id",
"score": "$docs.score"
}}
])
You can clean that up a little with MongoDB 2.6 and greater which provides a system variable in the pipeline for $$ROOT:
db.collection.aggregate([
{ "$group": {
"_id": "$article_id",
"docs": {
"$push": "$$ROOT"
},
"votes": { "$sum": 1 }
}},
{ "$match": { "votes": { "$gt": 3 } } },
{ "$unwind": "$docs" },
{ "$project": {
"user_id": "$docs.user_id",
"article_id": "$docs.article_id",
"score": "$docs.score"
}}
])
Otherwise you can accept that you are doing this in a few steps and process the list of "article_id" values returned with a "count" greater than three:
var ids = db.collection.aggregate([
{ "$group": {
"_id": "$article_id",
"votes": { "$sum": 1 }
}},
{ "$match": { "votes": { "$gt": 3 } } },
]).toArray().map(function(x){ return x._id });
db.collection.find({ "article_id": { "$in": ids } })
If that was a shell operation then you would use the "results" key from the array of results that was returned by default in versions earlier to 2.6.

Related

How to aggregate into a map [duplicate]

I have a data in profile collection
[
{
name: "Harish",
gender: "Male",
caste: "Vokkaliga",
education: "B.E"
},
{
name: "Reshma",
gender: "Female",
caste: "Vokkaliga",
education: "B.E"
},
{
name: "Rangnath",
gender: "Male",
caste: "Lingayath",
education: "M.C.A"
},
{
name: "Lakshman",
gender: "Male",
caste: "Lingayath",
education: "B.Com"
},
{
name: "Reshma",
gender: "Female",
caste: "Lingayath",
education: "B.E"
}
]
here I need to calculate total Number of different gender, total number of different caste and total number of different education.
Expected o/p
{
gender: [{
name: "Male",
total: "3"
},
{
name: "Female",
total: "2"
}],
caste: [{
name: "Vokkaliga",
total: "2"
},
{
name: "Lingayath",
total: "3"
}],
education: [{
name: "B.E",
total: "3"
},
{
name: "M.C.A",
total: "1"
},
{
name: "B.Com",
total: "1"
}]
}
using mongodb aggregation how can I get the expected result.
There are different approaches depending on the version available, but they all essentially break down to transforming your document fields into separate documents in an "array", then "unwinding" that array with $unwind and doing successive $group stages in order to accumulate the output totals and arrays.
MongoDB 3.4.4 and above
Latest releases have special operators like $arrayToObject and $objectToArray which can make transfer to the initial "array" from the source document more dynamic than in earlier releases:
db.profile.aggregate([
{ "$project": {
"_id": 0,
"data": {
"$filter": {
"input": { "$objectToArray": "$$ROOT" },
"cond": { "$in": [ "$$this.k", ["gender","caste","education"] ] }
}
}
}},
{ "$unwind": "$data" },
{ "$group": {
"_id": "$data",
"total": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.k",
"v": {
"$push": { "name": "$_id.v", "total": "$total" }
}
}},
{ "$group": {
"_id": null,
"data": { "$push": { "k": "$_id", "v": "$v" } }
}},
{ "$replaceRoot": {
"newRoot": {
"$arrayToObject": "$data"
}
}}
])
So using $objectToArray you make the initial document into an array of it's keys and values as "k" and "v" keys in the resulting array of objects. We apply $filter here in order to select by "key". Here using $in with a list of keys we want, but this could be more dynamically used as a list of keys to "exclude" where that was shorter. It's just using logical operators to evaluate the condition.
The end stage here uses $replaceRoot and since all our manipulation and "grouping" in between still keeps that "k" and "v" form, we then use $arrayToObject here to promote our "array of objects" in result to the "keys" of the top level document in output.
MongoDB 3.6 $mergeObjects
As an extra wrinkle here, MongoDB 3.6 includes $mergeObjects which can be used as an "accumulator" in a $group pipeline stage as well, thus replacing the $push and making the final $replaceRoot simply shifting the "data" key to the "root" of the returned document instead:
db.profile.aggregate([
{ "$project": {
"_id": 0,
"data": {
"$filter": {
"input": { "$objectToArray": "$$ROOT" },
"cond": { "$in": [ "$$this.k", ["gender","caste","education"] ] }
}
}
}},
{ "$unwind": "$data" },
{ "$group": { "_id": "$data", "total": { "$sum": 1 } }},
{ "$group": {
"_id": "$_id.k",
"v": {
"$push": { "name": "$_id.v", "total": "$total" }
}
}},
{ "$group": {
"_id": null,
"data": {
"$mergeObjects": {
"$arrayToObject": [
[{ "k": "$_id", "v": "$v" }]
]
}
}
}},
{ "$replaceRoot": { "newRoot": "$data" } }
])
This is not really that different to what is being demonstrated overall, but simply demonstrates how $mergeObjects can be used in this way and may be useful in cases where the grouping key was something different and we did not want that final "merge" to the root space of the object.
Note that the $arrayToObject is still needed to transform the "value" back into the name of the "key", but we just do it during the accumulation rather than after the grouping, since the new accumulation allows the "merge" of keys.
MongoDB 3.2
Taking it back a version or even if you have a MongoDB 3.4.x that is less than the 3.4.4 release, we can still use much of this but instead we deal with the creation of the array in a more static fashion, as well as handling the final "transform" on output differently due to the aggregation operators we don't have:
db.profile.aggregate([
{ "$project": {
"data": [
{ "k": "gender", "v": "$gender" },
{ "k": "caste", "v": "$caste" },
{ "k": "education", "v": "$education" }
]
}},
{ "$unwind": "$data" },
{ "$group": {
"_id": "$data",
"total": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.k",
"v": {
"$push": { "name": "$_id.v", "total": "$total" }
}
}},
{ "$group": {
"_id": null,
"data": { "$push": { "k": "$_id", "v": "$v" } }
}},
/*
{ "$replaceRoot": {
"newRoot": {
"$arrayToObject": "$data"
}
}}
*/
]).map( d =>
d.data.map( e => ({ [e.k]: e.v }) )
.reduce((acc,curr) => Object.assign(acc,curr),{})
)
This is exactly the same thing, except instead of having a dynamic transform of the document into the array, we actually "explicitly" assign each array member with the same "k" and "v" notation. Really just keeping those key names for convention at this point since none of the aggregation operators here depend on that at all.
Also instead of using $replaceRoot, we just do exactly the same thing as what the previous pipeline stage implementation was doing there but in client code instead. All MongoDB drivers have some implementation of cursor.map() to enable "cursor transforms". Here with the shell we use the basic JavaScript functions of Array.map() and Array.reduce() to take that output and again promote the array content to being the keys of the top level document returned.
MongoDB 2.6
And falling back to MongoDB 2.6 to cover the versions in between, the only thing that changes here is the usage of $map and a $literal for input with the array declaration:
db.profile.aggregate([
{ "$project": {
"data": {
"$map": {
"input": { "$literal": ["gender","caste", "education"] },
"as": "k",
"in": {
"k": "$$k",
"v": {
"$cond": {
"if": { "$eq": [ "$$k", "gender" ] },
"then": "$gender",
"else": {
"$cond": {
"if": { "$eq": [ "$$k", "caste" ] },
"then": "$caste",
"else": "$education"
}
}
}
}
}
}
}
}},
{ "$unwind": "$data" },
{ "$group": {
"_id": "$data",
"total": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.k",
"v": {
"$push": { "name": "$_id.v", "total": "$total" }
}
}},
{ "$group": {
"_id": null,
"data": { "$push": { "k": "$_id", "v": "$v" } }
}},
/*
{ "$replaceRoot": {
"newRoot": {
"$arrayToObject": "$data"
}
}}
*/
])
.map( d =>
d.data.map( e => ({ [e.k]: e.v }) )
.reduce((acc,curr) => Object.assign(acc,curr),{})
)
Since the basic idea here is to "iterate" a provided array of the field names, the actual assignment of values comes by "nesting" the $cond statements. For three possible outcomes this means only a single nesting in order to "branch" for each outcome.
Modern MongoDB from 3.4 have $switch which makes this branching simpler, yet this demonstrates the logic was always possible and the $cond operator has been around since the aggregation framework was introduced in MongoDB 2.2.
Again, the same transformation on the cursor result applies as there is nothing new there and most programming languages have the ability to do this for years, if not from inception.
Of course the basic process can even be done way back to MongoDB 2.2, but just applying the array creation and $unwind in a different way. But no-one should be running any MongoDB under 2.8 at this point in time, and official support even from 3.0 is even fast running out.
Output
For visualization, the output of all demonstrated pipelines here has the following form before the last "transform" is done:
/* 1 */
{
"_id" : null,
"data" : [
{
"k" : "gender",
"v" : [
{
"name" : "Male",
"total" : 3.0
},
{
"name" : "Female",
"total" : 2.0
}
]
},
{
"k" : "education",
"v" : [
{
"name" : "M.C.A",
"total" : 1.0
},
{
"name" : "B.E",
"total" : 3.0
},
{
"name" : "B.Com",
"total" : 1.0
}
]
},
{
"k" : "caste",
"v" : [
{
"name" : "Lingayath",
"total" : 3.0
},
{
"name" : "Vokkaliga",
"total" : 2.0
}
]
}
]
}
And then either by the $replaceRoot or the cursor transform as demonstrated the result becomes:
/* 1 */
{
"gender" : [
{
"name" : "Male",
"total" : 3.0
},
{
"name" : "Female",
"total" : 2.0
}
],
"education" : [
{
"name" : "M.C.A",
"total" : 1.0
},
{
"name" : "B.E",
"total" : 3.0
},
{
"name" : "B.Com",
"total" : 1.0
}
],
"caste" : [
{
"name" : "Lingayath",
"total" : 3.0
},
{
"name" : "Vokkaliga",
"total" : 2.0
}
]
}
So whilst we can put some new and fancy operators into the aggregation pipeline where we have those available, the most common use case is in these "end of pipeline transforms" in which case we may as well simply do the same transformation on each document in the cursor results returned instead.

Using the aggregation framework to compare array element overlap

I have a collections with documents structured like below:
{
carrier: "abc",
flightNumber: 123,
dates: [
ISODate("2015-01-01T00:00:00Z"),
ISODate("2015-01-02T00:00:00Z"),
ISODate("2015-01-03T00:00:00Z")
]
}
I would like to search the collection to see if there are any documents with the same carrier and flightNumber that also have dates in the dates array that over lap. For example:
{
carrier: "abc",
flightNumber: 123,
dates: [
ISODate("2015-01-01T00:00:00Z"),
ISODate("2015-01-02T00:00:00Z"),
ISODate("2015-01-03T00:00:00Z")
]
},
{
carrier: "abc",
flightNumber: 123,
dates: [
ISODate("2015-01-03T00:00:00Z"),
ISODate("2015-01-04T00:00:00Z"),
ISODate("2015-01-05T00:00:00Z")
]
}
If the above records were present in the collection I would like to return them because they both have carrier: abc, flightNumber: 123 and they also have the date ISODate("2015-01-03T00:00:00Z") in the dates array. If this date were not present in the second document then neither should be returned.
Typically I would do this by grouping and counting like below:
db.flights.aggregate([
{
$group: {
_id: { carrier: "$carrier", flightNumber: "$flightNumber" },
uniqueIds: { $addToSet: "$_id" },
count: { $sum: 1 }
}
},
{
$match: {
count: { $gt: 1 }
}
}
])
But I'm not sure how I could modify this to look for array overlap. Can anyone suggest how to achieve this?
You $unwind the array if you want to look at the contents as "grouped" within them:
db.flights.aggregate([
{ "$unwind": "$dates" },
{ "$group": {
"_id": { "carrier": "$carrier", "flightnumber": "$flightnumber", "date": "$dates" },
"count": { "$sum": 1 },
"_ids": { "$addToSet": "$_id" }
}},
{ "$match": { "count": { "$gt": 1 } } },
{ "$unwind": "$_ids" },
{ "$group": { "_id": "$_ids" } }
])
That does in fact tell you which documents where the "overlap" resides, because the "same dates" along with the other same grouping key values that you are concerned about have a "count" which occurs more than once. Indicating the overlap.
Anything after the $match is really just for "presentation" as there is no point reporting the same _id value for multiple overlaps if you just want to see the overlaps. In fact if you want to see them together it would probably be best to leave the "grouped set" alone.
Now you could add a $lookup to that if retrieving the actual documents was important to you:
db.flights.aggregate([
{ "$unwind": "$dates" },
{ "$group": {
"_id": { "carrier": "$carrier", "flightnumber": "$flightnumber", "date": "$dates" },
"count": { "$sum": 1 },
"_ids": { "$addToSet": "$_id" }
}},
{ "$match": { "count": { "$gt": 1 } } },
{ "$unwind": "$_ids" },
{ "$group": { "_id": "$_ids" } },
}},
{ "$lookup": {
"from": "flights",
"localField": "_id",
"foreignField": "_id",
"as": "_ids"
}},
{ "$unwind": "$_ids" },
{ "$replaceRoot": {
"newRoot": "$_ids"
}}
])
And even do a $replaceRoot or $project to make it return the whole document. Or you could have even done $addToSet with $$ROOT if it was not a problem for size.
But the overall point is covered in the first three pipeline stages, or mostly in just the "first". If you want to work with arrays "across documents", then the primary operator is still $unwind.
Alternately for a more "reporting" like format:
db.flights.aggregate([
{ "$addFields": { "copy": "$$ROOT" } },
{ "$unwind": "$dates" },
{ "$group": {
"_id": {
"carrier": "$carrier",
"flightNumber": "$flightNumber",
"dates": "$dates"
},
"count": { "$sum": 1 },
"_docs": { "$addToSet": "$copy" }
}},
{ "$match": { "count": { "$gt": 1 } } },
{ "$group": {
"_id": {
"carrier": "$_id.carrier",
"flightNumber": "$_id.flightNumber",
},
"overlaps": {
"$push": {
"date": "$_id.dates",
"_docs": "$_docs"
}
}
}}
])
Which would report the overlapped dates within each group and tell you which documents contained the overlap:
{
"_id" : {
"carrier" : "abc",
"flightNumber" : 123.0
},
"overlaps" : [
{
"date" : ISODate("2015-01-03T00:00:00.000Z"),
"_docs" : [
{
"_id" : ObjectId("5977f9187dcd6a5f6a9b4b97"),
"carrier" : "abc",
"flightNumber" : 123.0,
"dates" : [
ISODate("2015-01-03T00:00:00.000Z"),
ISODate("2015-01-04T00:00:00.000Z"),
ISODate("2015-01-05T00:00:00.000Z")
]
},
{
"_id" : ObjectId("5977f9187dcd6a5f6a9b4b96"),
"carrier" : "abc",
"flightNumber" : 123.0,
"dates" : [
ISODate("2015-01-01T00:00:00.000Z"),
ISODate("2015-01-02T00:00:00.000Z"),
ISODate("2015-01-03T00:00:00.000Z")
]
}
]
}
]
}

Remove duplicate documents based on field

I've seen a number of solutions on this, however they are all for Mongo v2 and are not suitable for V3.
My document looks like this:
{
"_id" : ObjectId("582c98667d81e1d0270cb3e9"),
"asin" : "B01MTKPJT1",
"url" : "https://www.amazon.com/Trump-President-Presidential-Victory-T-Shirt/dp/B01MTKPJT1%3FSubscriptionId%3DAKIAIVCW62S7NTZ2U2AQ%26tag%3Dselfbalancingscooters-21%26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN%3DB01MTKPJT1",
"image" : "http://ecx.images-amazon.com/images/I/41RvN8ud6UL.jpg",
"salesRank" : NumberInt(442137),
"title" : "Trump Wins 45th President Presidential Victory T-Shirt",
"brand" : "\"Getting Political On Me\"",
"favourite" : false,
"createdAt" : ISODate("2016-11-16T17:33:26.763+0000"),
"updatedAt" : ISODate("2016-11-16T17:33:26.763+0000")
}
and my collection contains around 500k documents. I want to remove all duplicate documents (except for 1) where the ASIN is the same
How can I achieve this?
This is something we can actually do using the aggregation framework and without client side processing.
MongoDB 3.4
db.collection.aggregate(
[
{ "$sort": { "_id": 1 } },
{ "$group": {
"_id": "$asin",
"doc": { "$first": "$$ROOT" }
}},
{ "$replaceRoot": { "newRoot": "$doc" } },
{ "$out": "collection" }
]
)
MongoDB version <= 3.2:
db.collection.aggregate(
[
{ "$sort": { "_id": 1 } },
{ "$group": {
"_id": "$asin",
"doc": { "$first": "$$ROOT" }
}},
{ "$project": {
"asin": "$doc.asin",
"url": "$doc.url",
"image": "$doc.image",
"salesRank": "$doc.salesRank",
"title": "$doc.salesRank",
"brand": "$doc.brand",
"favourite": "$doc.favourite",
"createdAt": "$doc.createdAt",
"updatedAt": "$doc.updatedAt"
}},
{ "$out": "collection" }
]
)
Use a for loop, it will take time but will do the work
db.amazon_sales.find({}, {asin:1}).sort({_id:1}).forEach(function(doc){
db.amazon_sales.remove({_id:{$gt:doc._id}, asin:doc.asin});
})
Then and this index
db.amazon_sales.createIndex( { "asin": 1 }, { unique: true } )

Mongodb aggregate collection

I'm learning aggregate in mongodb. I'm working with the collection:
{
"body" : ""
,
"email" : "oJJFLCfA#qqlBNdpY.com",
"author" : "Linnie Weigel"
},
{
"body" : ""
,
"email" : "ptHfegMX#WgxhlEeV.com",
"author" : "Dinah Sauve"
},
{
"body" : ""
,
"email" : "kfPmikkG#SBxfJifD.com",
"author" : "Zachary Langlais"
}
{
"body" : ""
,
"email" : "gqEMQEYg#iiBqZCez.com",
"author" : "Jesusa Rickenbacker"
}
]
I try to obtain the number of body of each author. But when I execute the command sum of aggregate mongodb, the result is 1(because the structure has only one element) . How can I do this operation?. I tried with $addToSet. But I don't know how to obtain each element of collection and to do the operation.
In order to count the comments by each author you want to $group by that author and $sum the occurrences. Basically just a "$sum: 1" operation. But it seems like you have "comments" as an array here based on your own comments and the closing bracket on your partial data listing. For that you need to process with $unwind first:
db.collection.aggregate([
{ "$unwind": "$comments" },
{ "$group": {
"_id": "$comments.author",
"count": { "$sum": 1 }
}}
])
That will obtain the total of all author comments by author for the entire collection. If you were just after getting the total comments by author per document ( or what looks like a blog post model ) then you use the document _id as part of the group statement:
db.collection.aggregate([
{ "$unwind": "$comments" },
{ "$group": {
"_id": {
"_id": "$_id"
"author": "$comments.author"
},
"count": { "$sum": 1 }
}}
])
And if you then want the summary of author counts per document with just a single document returned with all the authors in an array, then use $addToSet from here, with another $group pipeline stage:
db.collection.aggregate([
{ "$unwind": "$comments" },
{ "$group": {
"_id": {
"_id": "$_id"
"author": "$comments.author"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id._id",
"comments": {
"$addToSet": {
"author": "$_id.author",
"count": "$count"
}
}
}}
])
But really, the author values are already unique and "sets" are not ordered in any way, so you might change this using $push after first introducing a $sort to have the list ordered by the number of comments made:
db.collection.aggregate([
{ "$unwind": "$comments" },
{ "$group": {
"_id": {
"_id": "$_id"
"author": "$comments.author"
},
"count": { "$sum": 1 }
}},
{ "$sort": { "_id._id": 1, "count": -1 } },
{ "$group": {
"_id": "$_id._id",
"comments": {
"$push": {
"author": "$_id.author",
"count": "$count"
}
}
}}
])

Aggregate Query in Mongodb returns specific field

Document Sample:
{
"_id" : ObjectId("53329dfgg43771e49538b4567"),
"u" : {
"_id" : ObjectId("532a435gs4c771edb168c1bd7"),
"n" : "Salman khan",
"e" : "salman#gmail.com"
},
"ps" : 0,
"os" : 1,
"rs" : 0,
"cd" : 1395685800,
"ud" : 0
}
Query:
db.collectiontmp.aggregate([
{$match: {os:1}},
{$project : { name:{$toUpper:"$u.e"} , _id:0 } },
{$group: { _id: "$u._id",total: {$sum:1} }},
{$sort: {total: -1}}, { $limit: 10 }
]);
I need following things from the above query:
Group by u._id
Returns total number of records and email from the record, as shown below:
{
"result":
[
{
"email": "",
"total": ""
},
{
"email": "",
"total": ""
}
],
"ok":
1
}
The first thing you are doing wrong here is not understanding how $project is intended to work. Pipeline stages such as $project and $group will only output the fields that are "explicitly" identified. So only the fields you say to output will be available to the following pipeline stages.
Specifically here you "project" only part of the "u" field in your document and you therefore removed the other data from being available. The only present field here now is "name", which is the one you "projected".
Perhaps it was really your intention to do something like this:
db.collectiontmp.aggregate([
{ "$group": {
"_id": {
"_id": "$u._id",
"email": { "$toUpper": "$u.e" }
},
"total": { "$sum": 1 },
}},
{ "$project": {
"_id": 0,
"email": "$_id.email",
"total": 1
}},
{ "$sort": { "total": -1 } },
{ "$limit": 10 }
])
Or even:
db.collectiontmp.aggregate([
{ "$group": {
"_id": "$u._id",
"email": { "$first": { "$toUpper": "$u.e" } }
"total": { "$sum": 1 },
}},
{ "$project": {
"_id": 0,
"email": 1,
"total": 1
}},
{ "$sort": { "total": -1 } },
{ "$limit": 10 }
])
That gives you the sort of output you are looking for.
Remember that as this is a "pipeline", then only the "output" from a prior stage is available to the "next" stage. There is no "global" concept of the document as this is not a declarative statement such as in SQL, but a "pipeline".
So think Unix pipe "|" command, or otherwise look that up. Then your thinking will fall into place.