Mongodb aggregate collection - mongodb

I'm learning aggregate in mongodb. I'm working with the collection:
{
"body" : ""
,
"email" : "oJJFLCfA#qqlBNdpY.com",
"author" : "Linnie Weigel"
},
{
"body" : ""
,
"email" : "ptHfegMX#WgxhlEeV.com",
"author" : "Dinah Sauve"
},
{
"body" : ""
,
"email" : "kfPmikkG#SBxfJifD.com",
"author" : "Zachary Langlais"
}
{
"body" : ""
,
"email" : "gqEMQEYg#iiBqZCez.com",
"author" : "Jesusa Rickenbacker"
}
]
I try to obtain the number of body of each author. But when I execute the command sum of aggregate mongodb, the result is 1(because the structure has only one element) . How can I do this operation?. I tried with $addToSet. But I don't know how to obtain each element of collection and to do the operation.

In order to count the comments by each author you want to $group by that author and $sum the occurrences. Basically just a "$sum: 1" operation. But it seems like you have "comments" as an array here based on your own comments and the closing bracket on your partial data listing. For that you need to process with $unwind first:
db.collection.aggregate([
{ "$unwind": "$comments" },
{ "$group": {
"_id": "$comments.author",
"count": { "$sum": 1 }
}}
])
That will obtain the total of all author comments by author for the entire collection. If you were just after getting the total comments by author per document ( or what looks like a blog post model ) then you use the document _id as part of the group statement:
db.collection.aggregate([
{ "$unwind": "$comments" },
{ "$group": {
"_id": {
"_id": "$_id"
"author": "$comments.author"
},
"count": { "$sum": 1 }
}}
])
And if you then want the summary of author counts per document with just a single document returned with all the authors in an array, then use $addToSet from here, with another $group pipeline stage:
db.collection.aggregate([
{ "$unwind": "$comments" },
{ "$group": {
"_id": {
"_id": "$_id"
"author": "$comments.author"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id._id",
"comments": {
"$addToSet": {
"author": "$_id.author",
"count": "$count"
}
}
}}
])
But really, the author values are already unique and "sets" are not ordered in any way, so you might change this using $push after first introducing a $sort to have the list ordered by the number of comments made:
db.collection.aggregate([
{ "$unwind": "$comments" },
{ "$group": {
"_id": {
"_id": "$_id"
"author": "$comments.author"
},
"count": { "$sum": 1 }
}},
{ "$sort": { "_id._id": 1, "count": -1 } },
{ "$group": {
"_id": "$_id._id",
"comments": {
"$push": {
"author": "$_id.author",
"count": "$count"
}
}
}}
])

Related

Group sums of an attribute by the values of another array attribute

I have a collection "tagsCount" that looks like that:
{
"_id" : ObjectId("59e3a46a48507851d411ad78"),
"tags" : [ "Marketing" ],
"cpt" : 14354
},
{
"_id" : ObjectId("59e3a46a48507851d411ad79"),
"tags" : [
"chatbot",
"Content marketing",
"Intelligence artificielle",
"Marketing digital",
"Personnalisation"
],
"cpt" : 9037
}
Of course there are many more lines.
I want to get the sum of "cpt" grouped by the values of "tags".
I have come up with that:
db.tagsCount.aggregate([
{ "$project": { "tags":1 }},
{ "$unwind": "$tags"},
{ "$group": {
"_id" : "$tags",
cpt : "$cpt" ,
"count": { "$sum": "$cpt" }
}}
])
But that doesn't do the trick, I have the list of all different tags and the count have a value a 0.
Is it possible to do what I want?
The problem is that your aggregation pipeline starts with $project which selects only tags to the next stages and that's why you're executing $group on documents without cpt. Here's my working example:
db.tagsCount.aggregate([
{ "$unwind": "$tags"},
{ "$group": {
"_id": "$tags",
"count": { "$sum": "$cpt" }
}},
{ "$project": { "tag": "$_id", "_id": 0, "count": 1 }}
])

Using the aggregation framework to compare array element overlap

I have a collections with documents structured like below:
{
carrier: "abc",
flightNumber: 123,
dates: [
ISODate("2015-01-01T00:00:00Z"),
ISODate("2015-01-02T00:00:00Z"),
ISODate("2015-01-03T00:00:00Z")
]
}
I would like to search the collection to see if there are any documents with the same carrier and flightNumber that also have dates in the dates array that over lap. For example:
{
carrier: "abc",
flightNumber: 123,
dates: [
ISODate("2015-01-01T00:00:00Z"),
ISODate("2015-01-02T00:00:00Z"),
ISODate("2015-01-03T00:00:00Z")
]
},
{
carrier: "abc",
flightNumber: 123,
dates: [
ISODate("2015-01-03T00:00:00Z"),
ISODate("2015-01-04T00:00:00Z"),
ISODate("2015-01-05T00:00:00Z")
]
}
If the above records were present in the collection I would like to return them because they both have carrier: abc, flightNumber: 123 and they also have the date ISODate("2015-01-03T00:00:00Z") in the dates array. If this date were not present in the second document then neither should be returned.
Typically I would do this by grouping and counting like below:
db.flights.aggregate([
{
$group: {
_id: { carrier: "$carrier", flightNumber: "$flightNumber" },
uniqueIds: { $addToSet: "$_id" },
count: { $sum: 1 }
}
},
{
$match: {
count: { $gt: 1 }
}
}
])
But I'm not sure how I could modify this to look for array overlap. Can anyone suggest how to achieve this?
You $unwind the array if you want to look at the contents as "grouped" within them:
db.flights.aggregate([
{ "$unwind": "$dates" },
{ "$group": {
"_id": { "carrier": "$carrier", "flightnumber": "$flightnumber", "date": "$dates" },
"count": { "$sum": 1 },
"_ids": { "$addToSet": "$_id" }
}},
{ "$match": { "count": { "$gt": 1 } } },
{ "$unwind": "$_ids" },
{ "$group": { "_id": "$_ids" } }
])
That does in fact tell you which documents where the "overlap" resides, because the "same dates" along with the other same grouping key values that you are concerned about have a "count" which occurs more than once. Indicating the overlap.
Anything after the $match is really just for "presentation" as there is no point reporting the same _id value for multiple overlaps if you just want to see the overlaps. In fact if you want to see them together it would probably be best to leave the "grouped set" alone.
Now you could add a $lookup to that if retrieving the actual documents was important to you:
db.flights.aggregate([
{ "$unwind": "$dates" },
{ "$group": {
"_id": { "carrier": "$carrier", "flightnumber": "$flightnumber", "date": "$dates" },
"count": { "$sum": 1 },
"_ids": { "$addToSet": "$_id" }
}},
{ "$match": { "count": { "$gt": 1 } } },
{ "$unwind": "$_ids" },
{ "$group": { "_id": "$_ids" } },
}},
{ "$lookup": {
"from": "flights",
"localField": "_id",
"foreignField": "_id",
"as": "_ids"
}},
{ "$unwind": "$_ids" },
{ "$replaceRoot": {
"newRoot": "$_ids"
}}
])
And even do a $replaceRoot or $project to make it return the whole document. Or you could have even done $addToSet with $$ROOT if it was not a problem for size.
But the overall point is covered in the first three pipeline stages, or mostly in just the "first". If you want to work with arrays "across documents", then the primary operator is still $unwind.
Alternately for a more "reporting" like format:
db.flights.aggregate([
{ "$addFields": { "copy": "$$ROOT" } },
{ "$unwind": "$dates" },
{ "$group": {
"_id": {
"carrier": "$carrier",
"flightNumber": "$flightNumber",
"dates": "$dates"
},
"count": { "$sum": 1 },
"_docs": { "$addToSet": "$copy" }
}},
{ "$match": { "count": { "$gt": 1 } } },
{ "$group": {
"_id": {
"carrier": "$_id.carrier",
"flightNumber": "$_id.flightNumber",
},
"overlaps": {
"$push": {
"date": "$_id.dates",
"_docs": "$_docs"
}
}
}}
])
Which would report the overlapped dates within each group and tell you which documents contained the overlap:
{
"_id" : {
"carrier" : "abc",
"flightNumber" : 123.0
},
"overlaps" : [
{
"date" : ISODate("2015-01-03T00:00:00.000Z"),
"_docs" : [
{
"_id" : ObjectId("5977f9187dcd6a5f6a9b4b97"),
"carrier" : "abc",
"flightNumber" : 123.0,
"dates" : [
ISODate("2015-01-03T00:00:00.000Z"),
ISODate("2015-01-04T00:00:00.000Z"),
ISODate("2015-01-05T00:00:00.000Z")
]
},
{
"_id" : ObjectId("5977f9187dcd6a5f6a9b4b96"),
"carrier" : "abc",
"flightNumber" : 123.0,
"dates" : [
ISODate("2015-01-01T00:00:00.000Z"),
ISODate("2015-01-02T00:00:00.000Z"),
ISODate("2015-01-03T00:00:00.000Z")
]
}
]
}
]
}

Aggregate Query in Mongodb returns specific field

Document Sample:
{
"_id" : ObjectId("53329dfgg43771e49538b4567"),
"u" : {
"_id" : ObjectId("532a435gs4c771edb168c1bd7"),
"n" : "Salman khan",
"e" : "salman#gmail.com"
},
"ps" : 0,
"os" : 1,
"rs" : 0,
"cd" : 1395685800,
"ud" : 0
}
Query:
db.collectiontmp.aggregate([
{$match: {os:1}},
{$project : { name:{$toUpper:"$u.e"} , _id:0 } },
{$group: { _id: "$u._id",total: {$sum:1} }},
{$sort: {total: -1}}, { $limit: 10 }
]);
I need following things from the above query:
Group by u._id
Returns total number of records and email from the record, as shown below:
{
"result":
[
{
"email": "",
"total": ""
},
{
"email": "",
"total": ""
}
],
"ok":
1
}
The first thing you are doing wrong here is not understanding how $project is intended to work. Pipeline stages such as $project and $group will only output the fields that are "explicitly" identified. So only the fields you say to output will be available to the following pipeline stages.
Specifically here you "project" only part of the "u" field in your document and you therefore removed the other data from being available. The only present field here now is "name", which is the one you "projected".
Perhaps it was really your intention to do something like this:
db.collectiontmp.aggregate([
{ "$group": {
"_id": {
"_id": "$u._id",
"email": { "$toUpper": "$u.e" }
},
"total": { "$sum": 1 },
}},
{ "$project": {
"_id": 0,
"email": "$_id.email",
"total": 1
}},
{ "$sort": { "total": -1 } },
{ "$limit": 10 }
])
Or even:
db.collectiontmp.aggregate([
{ "$group": {
"_id": "$u._id",
"email": { "$first": { "$toUpper": "$u.e" } }
"total": { "$sum": 1 },
}},
{ "$project": {
"_id": 0,
"email": 1,
"total": 1
}},
{ "$sort": { "total": -1 } },
{ "$limit": 10 }
])
That gives you the sort of output you are looking for.
Remember that as this is a "pipeline", then only the "output" from a prior stage is available to the "next" stage. There is no "global" concept of the document as this is not a declarative statement such as in SQL, but a "pipeline".
So think Unix pipe "|" command, or otherwise look that up. Then your thinking will fall into place.

Mongodb aggregation, finding within an array of values

I have a schemea that creates documents using the following structure:
{
"_id" : "2014-07-16:52TEST",
"date" : ISODate("2014-07-16T23:52:59.811Z"),
"name" : "TEST"
"values" : [
[
1405471921000,
0.737121
],
[
1405471922000,
0.737142
],
[
1405471923000,
0.737142
],
[
1405471924000,
0.737142
]
]
}
In the values, the first index is a timestamp. What I'm trying to do is query a specific timestamp to find the closest value ($gte).
I've tried the following aggregate query:
[
{ "$match": {
"values": {
"$elemMatch": { "0": {"$gte": 1405471923000} }
},
"name" : 'TEST'
}},
{ "$project" : {
"name" : 1,
"values" : 1
}},
{ "$unwind": "$values" },
{ "$match": { "values.0": { "$gte": 1405471923000 } } },
{ "$limit" : 1 },
{ "$sort": { "values.0": -1 } },
{ "$group": {
"_id": "$name",
"values": { "$push": "$values" },
}}
]
This seems to work, but it doesn't pull the closest value. It seems to pull anything greater or equal to and the sort doesn't seem to get applied, so it will pull a timestamp that is far in the future.
Any suggestions would be great!
Thank you
There are a couple of things wrong with the approach here even though it is a fair effort. You are right that you need to $sort here, but the problem is that you cannot "sort" on an inner element with an array. In order to get a value that can be sorted you must $unwind the array first as it otherwise will not sort on an array position.
You also certainly do not want $limit in the pipeline. You might be testing this against a single document, but "limit" will actually act on the entire set of documents in the pipeline. So if more than one document was matching your condition then they would be thrown away.
The key thing you want to do here is use $first in your $group stage, which is applied once you have sorted to get the "closest" element that you want.
db.collection.aggregate([
// Documents that have an array element matching the condition
{ "$match": {
"values": { "$elemMatch": { "0": {"$gte": 1405471923000 } } }
}},
// Unwind the top level array
{ "$unwind": "$values" },
// Filter just the elements that match the condition
{ "$match": { "values.0": { "$gte": 1405471923000 } } },
// Take a copy of the inner array
{ "$project": {
"date": 1,
"name": 1,
"values": 1,
"valCopy": "$values"
}},
// Unwind the inner array copy
{ "$unwind": "$valCopy" },
// Filter the inner elements
{ "$match": { "valCopy": { "$gte": 1405471923000 } }},
// Sort on the now "timestamp" values ascending for nearest
{ "$sort": { "valCopy": 1 } },
// Take the "first" values
{ "$group": {
"_id": "$_id",
"date": { "$first": "$date" },
"name": { "$first": "$name" },
"values": { "$first": "$values" },
}},
// Optionally push back to array to match the original structure
{ "$group": {
"_id": "$_id",
"date": { "$first": "$date" },
"name": { "$first": "$name" },
"values": { "$push": "$values" },
}}
])
And this produces your document with just the "nearest" timestamp value matching the original document form:
{
"_id" : "2014-07-16:52TEST",
"date" : ISODate("2014-07-16T23:52:59.811Z"),
"name" : "TEST",
"values" : [
[
1405471923000,
0.737142
]
]
}

Filtering a list of votes where more than x matches are found

I have the following vote data in a large collection:
{
"user_id" : ObjectId("53ac7bce4eaf6de4d5601c1a"),
"article_id" : ObjectId("53ab27504eaf6de4d5601be5"),
"score" : 5
},
{
"user_id" : ObjectId("53ac7bce4eaf6de4d5601c1b"),
"article_id" : ObjectId("53ab27504eaf6de4d5601be5"),
"score" : 3
},
{
"user_id" : ObjectId("53ac7bce4eaf6de4d5601c1c"),
"article_id" : ObjectId("53ab27504eaf6de4d5601be5"),
"score" : 3
},
...
I'm looking to filter this collection where more than 3 votes have been obtained for a single article (as above) and output as-is (excluding any vote entries on articles < 3 total votes).
Any help much appreciated. This collection can be huge so efficiency would be ideal.
Normally not something you do in a single operation, but you can do this if those really are your only fields and there are not too many matching documents.
db.collection.aggregate([
{ "$group": {
"_id": "$article_id",
"docs": {
"$push": {
"user_id": "$user_id",
"article_id": "$article_id",
"score": "$score"
}
},
"votes": { "$sum": 1 }
}},
{ "$match": { "votes": { "$gt": 3 } } },
{ "$unwind": "$docs" },
{ "$project": {
"user_id": "$docs.user_id",
"article_id": "$docs.article_id",
"score": "$docs.score"
}}
])
You can clean that up a little with MongoDB 2.6 and greater which provides a system variable in the pipeline for $$ROOT:
db.collection.aggregate([
{ "$group": {
"_id": "$article_id",
"docs": {
"$push": "$$ROOT"
},
"votes": { "$sum": 1 }
}},
{ "$match": { "votes": { "$gt": 3 } } },
{ "$unwind": "$docs" },
{ "$project": {
"user_id": "$docs.user_id",
"article_id": "$docs.article_id",
"score": "$docs.score"
}}
])
Otherwise you can accept that you are doing this in a few steps and process the list of "article_id" values returned with a "count" greater than three:
var ids = db.collection.aggregate([
{ "$group": {
"_id": "$article_id",
"votes": { "$sum": 1 }
}},
{ "$match": { "votes": { "$gt": 3 } } },
]).toArray().map(function(x){ return x._id });
db.collection.find({ "article_id": { "$in": ids } })
If that was a shell operation then you would use the "results" key from the array of results that was returned by default in versions earlier to 2.6.