How to project additional data from an aggregate result with MongoDB? - mongodb

I'm learning MongoDB and try to group a collection.
What I'm looking for is to group by year, get the max "average note" field and display the field primary name of the document related to this average
For example, if I have:
Name | Average | Year
Name_01 | 7.56 | 1995
Name_02 | 8.96 | 1995
Name_03 | 3.25 | 2005
Name_04 | 4.36 | 2005
Name_05 | 7.52 | 2020
I need:
Name | Average | Year
Name_02 | 8.96 | 1995
Name_05 | 7.52 | 2020
Name_04 | 4.36 | 2005
I already did the group and the max. Here is my code:
db.foobar.aggregate([
{
$group: { _id: '$year_published', max: { $max: '$statistics.average' }}
},
{
$project: { _id: 1, max: 1 }
},
{
$sort: { max: -1 }
}
])
Which gives me this kind of result:
{
"result" : [
{
"_id" : 1999,
"max" : 8.0343000000000000
},
{
"_id" : 1985,
"max" : 7.8833299999999999
}
// An so on...
}
But I'd also like to project the primary name of the document related to the "max" to get something like:
{
"result" : [
{
"_id" : 1999,
"max" : 8.0343000000000000,
"name": "Foo Bar"
},
{
"_id" : 1985,
"max" : 7.8833299999999999,
"name": "Lorem Ipsum"
}
// An so on...
}
NB : The next part of the question add complexity regarding the name (because of my document structure). It's not my main concern right now, but I add it to the question to reflect all my problem.
The primary name is a bit tricky to get. For each document, I've got an array of objects like that:
{
"names" : [
{
"type" : "primary",
"value" : "Foo bar"
},
{
"type" : "alternate",
"value" : "Foo foo"
},
{
"type" : "alternate",
"value" : "Bar bar"
}
]
}
And what I'm trying to get is the name with "primary" type (i. e. "Foo bar" in my example).
Here is the structure of my documents:
{
"_id" : ObjectId("56338f2bdc99b8ec22a43328"),
"names" : [
{
"type" : "primary",
"value" : "Foo bar"
},
{
"type" : "alternate",
"value" : "Barr foo"
}
],
"year_published" : 1992
"statistics" : {
"average" : 6.6057699999999997
}
}
I think I'm not so far but I don't know how to do it... Could you please help me?

If you want the "paried" values out of a particular doccument with a "max" value then $max is not for you. Instead what you need to do is $sort the data first and then use the $first operator.
db.foobar.aggregate([
{ "$sort": { "year_published": 1, "statistics.average": -1 } },
{ "$group": {
"_id": "$year_published",
"max": { "$first": "$statistics.average" }},
"name": {
"$first": {
"$setDifference": [
{ "$map": {
"input": "$names",
"as": "name",
"in": {
"$cond": {
"if": { "$eq": [ "$$name.type", "primary" ] },
"then": "$$name.value",
"else": false
}
}
}},
[false]
]
}
}
}},
{ "$unwind": "$name" }
])
The $first and $last operators act on "grouping boundary" data. Which means they return data from the property that occurs either at the begining or end of the value that was used for the grouping _id.
That is why you "sort" first, so th documents are in order for selection.
By contrast $max and $min just pick the "max/min" value from anywhere in the documents in the sample. That's fine when it's all you want, but if you want "related" fields, then you must sort first.
That's the basics of it. The other part for dealing with filtering your array is most optimally done with the $map and $setDifference combination as shown. The $map allows testing of a condition via $cond on each array element "in-line", and returns the value depending on which is true or false. The result is still of course an array of equal length.
The $setDifference essentially filters out anything returned as false, so the only thing left should be the "primary". Still an array, which is why $unwind is still used, though it's only a single element array.
Future MongoDB versions will do this a little better with $filter and $arrayElemAt. Here's a glimpse:
db.foobar.aggregate([
{ "$sort": { "year_published": 1, "statistics.average": -1 } },
{ "$group": {
"_id": "$year_published",
"max": { "$first": "$statistics.average" }},
"name": {
"$first": {
"$arrayElemAt": [
{ "$filter": {
"input": "$names",
"as": "name",
"cond": {
"$eq": [ "$$name.type", "primary" ]
}
}},
0
]
}
}
}}
])
But none of this changes the basic rules of "sort first" and then just pick up the values from the grouping boundary.

Please try the below code :
You need to select the "name" filed in the group pipeline operation with the help of $First.
$First selects the value that results from applying an expression to the first document in a group of documents that share the same group by key.
db.foobar.aggregate([
{ "$unwind" : "$names" },
{ $match :
{ "$names.type" : "primary"}
} ,
{ $sort :
{ "year_published" : 1, "statistics.average" : -1 }
},
{ $group :
{ _id : "$year_published" ,
name : {
$first : "$names.value"
},
max: { $max: "$statistics.average" }
}
},
{ $sort:
{ max: -1 }
}
]).pretty();
This will give you the required result :
{
"result" : [
{
"_id" : 1999,
"max" : 8.0343000000000000,
"name": "Foo Bar"
},
{
"_id" : 1985,
"max" : 7.8833299999999999,
"name": "Lorem Ipsum"
}
// An so on...
}

Related

How to find percentage of grouping containing a specific word

I am trying to calculate the percentage of listings in a MongoDB that contain a specific word grouped by a collection's object.
I have managed to group the count of listings containing the word but not the percentage on the total count of each group's listings.
My collection looks like this:
{
"_id" : "103456",
"metadata" : {
"type" : "Bike",
"brand" : "Siamoto",
"model" : "Siamoto vespa '01 - € 550 EUR (Negotiable)"
}
},
{
"_id" : "103457",
"metadata" : {
"type" : "Bike",
"brand" : "BMW",
"model" : "BMW ADFR '06 - € 5680 EUR"
}
}
I want to project the percentage of ads per metadata.brand that contain the word "Negotiable" in metadata.model.
I have used for the count something like:
db.advertisements.aggregate([
{ $match: { $text: { $search: "Negotiable" } } },
{ $group: { _id: "$metadata.brand", Count: { $sum: 1} } }
])
and it worked but I can't find a workaround for the percentage. Thanks to all
For what you are trying to do, using a $text search or even a $regex is the wrong approach. All these can do is return the "matching" documents only from within the collection.
Using Aggregate to Count String Matches
Whist not as flexible as a regular expression ( and sadly there is no aggregation operator equivalent at this time, but there will be in future releases. See SERVER-11947 ) the better option is to use $indexOfCP in order to match the occurrence of the "string" and then count those against the "total counts" from each grouping:
db.advertisements.aggregate([
{ "$group": {
"_id": "$metadata.brand",
"totalCount": { "$sum": 1 },
"matchedCount": {
"$sum": {
"$cond": [{ "$ne": [{ "$indexOfCP": [ "$metadata.model", "Negotiable" ] }, -1 ] }, 1, 0]
}
}
}},
{ "$addFields": {
"percentage": {
"$cond": {
"if": { "$ne": [ "$matchedCount", 0 ] },
"then": {
"$multiply": [
{ "$divide": [ "$matchedCount", "$totalCount" ] },
100
]
},
"else": 0
}
}
}},
{ "$sort": { "percentage": -1 } }
])
And the results:
{ "_id" : "Siamoto", "totalCount" : 1, "matchedCount" : 1, "percentage" : 100 }
{ "_id" : "BMW", "totalCount" : 1, "matchedCount" : 0, "percentage" : 0 }
Note that the $group is used for the accumulation of both the total documents found within the "brand" as well as those where the string was matched. The $cond operator used here is a "ternary" or if/then/else statement which evaluates a boolean expression and then returns either one value where true or another where false. In this case the $indexOfCP NOT returning the -1 value or "not found".
The "percentage" is actually done in a separate stage, which in this case we use $addFields to add the "additional field". The operation is basically a $divide over the two accumulated values from the previous stage. The $cond is just applied to avoid "divide by 0" errors and the $multiply is just moving the decimal places into something that looks more like a "percentage". But the basic premise is such calculations which require "totals" to be accumulated first will always be a manipulation in a "later stage".
MongoDB 4.2 (proposed) Preview
FYI, on the current "unfinalized" syntax for $regexFind from MongoDB 4.2 (proposed, and yet to be finalized if included in that release ) and onwards this would be something like:
db.advertisements.aggregate([
{ "$group": {
"_id": "$metadata.brand",
"totalCount": { "$sum": 1 },
"matchedCount": {
"$sum": {
"$cond": {
"if": {
"$ne": [
{ "$regexFind": {
"input": "$metadata.model",
"regex": /Negotiable/i
}},
null
]
},
"then": 1,
"else": 0
}
}
}
}},
{ "$addFields": {
"percentage": {
"$cond": {
"if": { "$ne": [ "$matchedCount", 0 ] },
"then": {
"$multiply": [
{ "$divide": [ "$matchedCount", "$totalCount" ] },
100
]
},
"else": 0
}
}
}},
{ "$sort": { "percentage": -1 } }
])
Again noting strongly that the "current" implementation may be subject to change by the time it is released. This is how it works on the current 4.1.9-17-g0a856820ba development release.
Using MapReduce
An alternate approach where either your MongoDB version does not support $indexOfCP OR you need more flexibility in how you "match the string" is to use mapReduce for the aggregation instead:
db.advertisements.mapReduce(
function() {
emit(this.metadata.brand, {
totalCount: 1,
matchedCount: (/Negotiable/i.test(this.metadata.model)) ? 1 : 0
});
},
function(key,values) {
var obj = { totalCount: 0, matchedCount: 0 };
values.forEach(value => {
obj.totalCount += value.totalCount;
obj.matchedCount += value.matchedCount;
});
return obj;
},
{
"out": { "inline": 1 },
"finalize": function(key,value) {
value.percentage = (value.matchedCount != 0)
? (value.matchedCount / value.totalCount) * 100
: 0;
return value;
}
}
)
This has a similar result, but in a very "mapReduce" specific way:
{
"_id" : "BMW",
"value" : {
"totalCount" : 1,
"matchedCount" : 0,
"percentage" : 0
}
},
{
"_id" : "Siamoto",
"value" : {
"totalCount" : 1,
"matchedCount" : 1,
"percentage" : 100
}
}
The logic is pretty much the same. We "emit" using the "key" for the "brand" and then use another ternary to determine whether to count a "match" or not. In this case a regular expression test() operation, and even using "case insensitive" matching as an example.
The "reducer" part simply accumulates the values that were emitted, and the finalize function is where the "percentage" is returned by the same division and multiplication process.
The main difference between the two other than basic capabilities is that the mapReduce cannot do "further things" beyond the accumulation and basic manipulation in the finalize. The "sorting" demonstrated in the aggregation pipeline cannot be done with mapReduce without outputting to a separate collection and doing a separate find() and sort() on those documents contained.
Either way works, and it just depends on your needs and the capabilities of what you have available. Of course any aggregate() approach will be much faster than using the JavaScript evaluation of mapReduce. So you probably want aggregate() as your preference where possible.

Database sort on numeric values that are actually strings

I have below query. In that field_a is String property and field_b is an array of type Number. I want an array having property field_a and field_b with unique combination. Here field_a contains numeric value but in string format. So I want to apply natural sort in aggregation pipeline. $natural can be used only with such query db.collection.find().sort( { $natural: 1 } )
So how can I use natural sort in MongoDB or I need to depend upon JS functions or on lodash/underscore.js ?
db.collection.aggregate([
{"$group": { "_id": { field_a: "$field_a", field_b: "$field_b" } } },
{ $project: { a: "$_id" } },
{"$group": {"_id": 'a', "res": {"$addToSet": "$_id" }}},
{"$unwind": "$res"},
{"$sort": { "res": 1}},
{"$group": { "_id": null, "res": {"$push": "$res" }}}
])
I short, this is what you want to do here:
db.collection.aggregate([
{ "$group": {
"_id": {
"field_a": {
"$concat": [
{ "$substrCP": [
"0000000000",
0,
{ "$subtract": [ 10, { "$strLenCP": "$field_a" } ] }
]},
"$field_a"
]
},
"field_b": "$field_b"
}
}},
{ "$sort": { "_id": 1 } }
])
Explanation
As a basic concept, the problem you have is that "strings" sort in in a way that does not translate to how numeric sorts work.
As a brief example, these documents use a string value:
{ "_id" : ObjectId("5928276f84c3559bc2fd458b"), "a" : "5" }
{ "_id" : ObjectId("5928277484c3559bc2fd458c"), "a" : "50" }
{ "_id" : ObjectId("5928277e84c3559bc2fd458d"), "a" : "60" }
{ "_id" : ObjectId("5928278284c3559bc2fd458e"), "a" : "6" }
If you try to sort these, then the lexical order applies:
> db.list.find().sort({ "a": 1 })
{ "_id" : ObjectId("5928276f84c3559bc2fd458b"), "a" : "5" }
{ "_id" : ObjectId("5928277484c3559bc2fd458c"), "a" : "50" }
{ "_id" : ObjectId("5928278284c3559bc2fd458e"), "a" : "6" }
{ "_id" : ObjectId("5928277e84c3559bc2fd458d"), "a" : "60" }
As strings, this makes sense. Since a "50" begins with "5" and is therefore less than the "6".
Since the aggregation framework cannot "cast" these as numeric values, then your only option is to present the "strings" in a way in which they will order lexically in the same way as they would with numeric values.
In brief terms we "zero pad" them, which is essentially making the values fixed length strings which are prefixed or "padded" with 0 which makes the "strings" appear to be ordered like they would be numerically:
db.list.aggregate([
{ "$project": {
"a": {
"$concat": [
{ "$substrCP": [
"0000000000",
0,
{ "$subtract": [ 10, { "$strLenCP": "$a" } ] }
]},
"$a"
]
}
}},
{ "$sort": { "a": 1 } }
])
And this will produce a list in order like:
{ "_id" : ObjectId("5928276f84c3559bc2fd458b"), "a" : "0000000005" }
{ "_id" : ObjectId("5928278284c3559bc2fd458e"), "a" : "0000000006" }
{ "_id" : ObjectId("5928277484c3559bc2fd458c"), "a" : "0000000050" }
{ "_id" : ObjectId("5928277e84c3559bc2fd458d"), "a" : "0000000060" }
The basic premise here is that you take a "template" string, which is in this example a string of 0 which is 10 characters long. Then we look at the length of the field data to transform using $strLenCP and $subtract that length from 10 which is the length of the template string used here.
The difference in length is fed to the $substrCP operator as the number of characters to take from the template. This output is then fed to $concat in order to make a "string" which like the template is 10 characters long, but starts with zeros and ends in the initial numeric string.
In your actual usage, this will end up as part of a composite key. Yet since the transformed key is first in the key order, then simply sorting by _id considers this and primarily sorts by values in the first key and then the second.

Find MongoDB object using value of another field

I recently found difficulty in finding an object stored in a document with its key in another field of that same document.
{
list : {
"red" : 397n8,
"blue" : j3847,
"pink" : 8nc48,
"green" : 983c4,
},
result : [
{ "id" : 397n8, value : "anger" },
{ "id" : j3847, value : "water" },
{ "id" : 8nc48, value : "girl" },
{ "id" : 983c4, value : "evil" }
]
}
}
I am trying to get the value for 'blue' which has an id of 'j3847' and a value of 'water'.
db.docs.find( { result.id : list.blue }, { result.value : 1 } );
# list.blue would return water
# list.pink would return girl
# list.green would return evil
I tried many things and even found a great article on how to update a value using a value in the same document.: Update MongoDB field using value of another field which I based myself on; with no success... :/
How can I find a MongoDB object using value of another field ?
You can do it with the $filter operator within mongo aggregation. It returns an array with only those elements that match the condition:
db.docs.aggregate([
{
$project: {
result: {
$filter: {
input: "$result",
as:"item",
cond: { $eq: ["$list.blue", "$$item.id"]}
}
}
}
}
])
Output for this query looks like this:
{
"_id" : ObjectId("569415c8299692ceedf86573"),
"result" : [ { "id" : "j3847", "value" : "water" } ]
}
One way is using the $where operator though would not recommend as using it invokes a full collection scan regardless of what other conditions could possibly use an index selection and also invokes the JavaScript interpreter over each result document, which is going to be considerably slower than native code.
That being said, use the alternative .aggregate() method for this type of comparison instead which is definitely the better option:
db.docs.aggregate([
{ "$unwind": "$result" },
{
"$project": {
"result": 1,
"same": { "$eq": [ "$list.blue", "$result.id" ] }
}
},
{ "$match": { "same": true } },
{
"$project": {
"_id": 0,
"value": "$result.value"
}
}
])
When the $unwind operator is applied on the result array field, it will generate a new record for each and every element of the result field on which unwind is applied. It basically flattens the data and then in the subsequent $project step inspect each member of the array to compare if the two fields are the same.
Sample Output
{
"result" : [
{
"value" : "water"
}
],
"ok" : 1
}
Another alternative is to use the $map and $setDifference operators in a single $project step where you can avoid the use of $unwind which can be costly on very large collections and in most cases result in the 16MB BSON limit constraint:
db.docs.aggregate([
{
"$project": {
"result": {
"$setDifference": [
{
"$map": {
"input": "$result",
"as": "r",
"in": {
"$cond": [
{ "$eq": [ "$$r.id", "$list.blue" ] },
"$$r",
false
]
}
}
},
[false]
]
}
}
}
])
Sample Output
{
"result" : [
{
"_id" : ObjectId("569412e5a51a6656962af1c7"),
"result" : [
{
"id" : "j3847",
"value" : "water"
}
]
}
],
"ok" : 1
}

MongoDb aggregate and group by two fields depending on values

I want to aggregate over a collection where a type is given. If the type is foo I want to group by the field author, if the type is bar I want to group by user.
All this should happen in one query.
Example Data:
{
"_id": 1,
"author": {
"someField": "abc",
},
"type": "foo"
}
{
"_id": 2,
"author": {
"someField": "abc",
},
"type": "foo"
}
{
"_id": 3,
"user": {
"someField": "abc",
},
"type": "bar"
}
This user field is only existing if the type is bar.
So basically something like that... tried to express it with an $or.
function () {
var results = db.vote.aggregate( [
{ $or: [ {
{ $match : { type : "foo" } },
{ $group : { _id : "$author", sumAuthor : {$sum : 1} } } },
{ { $match : { type : "bar" } },
{ $group : { _id : "$user", sumUser : {$sum : 1} } }
} ] }
] );
return results;
}
Does someone have a good solution for this?
I think it can be done by
db.c.aggregate([{
$group : {
_id : {
$cond : [{
$eq : [ "$type", "foo"]
}, "author", "user"]
},
sum : {
$sum : 1
}
}
}]);
The solution below can be cleaned up a bit...
For "bar" (note: for "foo", you have to change a bit)
db.vote.aggregate(
{
$project:{
user:{ $ifNull: ["$user", "notbar"]},
type:1
}
},
{
$group:{
_id:{_id:"$user.someField"},
sumUser:{$sum:1}
}
}
)
Also note: In you final answer, anything that is not of type "bar" will have an _id=null
What you want here is the $cond operator, which is a ternary operator returning a specific value where the condition is true or false.
db.vote.aggregate([
{ "$group": {
"_id": null,
"sumUser": {
"$sum": {
"$cond": [ { "$eq": [ "$type", "user" ] }, 1, 0 ]
}
},
"sumAuhtor": {
"$sum": {
"$cond": [ { "$eq": [ "$type", "auhtor" ] }, 1, 0 ]
}
}
}}
])
This basically tests the "type" of the current document and decides whether to pass either 1 or 0 to the $sum operation.
This also avoids errant grouping should the "user" and "author" fields contain the same values as they do in your example. The end result is a single document with the count of both types.

MongoDB aggregation on another aggreatation suggestions

I have a Json file imported into MongoDB. Every line on it is a user, and I have a field product, with the name of it. I know the value of every product, they are just few.
But this information is not stored on the Json.
I was able to do aggregation to retrieve the number of time that a user bought a product, but I would like to do a query to get directly the amount of money that each user spent.
This is my query:
db.source.aggregate([
{"$match": {
"$and":[
{"productName":{
"$in":[
"product2","product2","product3",
"product4","product5","product6"
]
}},
{ "$or": [
{"appID" : "nameOfAPP"},
{"appID": "NameOfAPP2"}
]}
]
}},
{ "$group": {
"_id": {
"id_user": "$id_user",
"productName": "$productName"
},
"count": { "$sum": 1}
}},
{ "$sort" : { "count": -1 } }
])
so the output is like that:
{ "_id" : { "id_user" : "user1", "productID" : "product2" }, "count" : 433 }
{ "_id" : { "id_user" : "user2", "productID" : "product1" }, "count" : 370 }
{ "_id" : { "id_user" : "user1", "productID" : "product3" }, "count" : 300 }
{ "_id" : { "id_user" : "user3", "productID" : "product6" }, "count" : 250 }
{ "_id" : { "id_user" : "user2", "productID" : "product5" }, "count" : 140 }
{ "_id" : { "id_user" : "user3", "productID" : "product4" }, "count" : 90 }
I know that product 1 costs 20$, product 2 costs 40$, product 3 costs 55$, product 4 costs -90$, product 5 costs 110$, product 6 costs 200$.
I would like to have an output like that:
{ "_id" : { "id_user" : "user1"}, "money_spent" : 600$ }
{ "_id" : { "id_user" : "user2"}, "money_spent" : 400$ }
etc
Can you help to get that result, I am new with MongoDB.
Thanks in advance.
If you cannot go to the original source data an are only working with an import then do this:
db.source.aggregate([
{"$match": {
"$and":[
{ "productName": {
"$in":[
"product1","product2","product3",
"product4","product5","product6"
]
}},
{ "$or": [
{"appID" : "nameOfAPP"},
{"appID": "NameOfAPP2"}
]}
]
}},
{ "$group": {
"_id": "$id_user",
"cost": {
"$sum": {
"$cond": [
{ "$eq": ["$_id.productId", "product1"] },
20,
{ "$cond": [
{ "$eq": ["$productName", "product2"] },
40,
{ "$cond": [
{ "$eq": [ "$productName", "product3"] },
55,
{ "$cond": [
{ "$eq": [ "$productName", "product4" ] },
-90,
{ "$cond": [
{ "$eq": [ "$productName", "product5" ] },
110,
200
]}
]}
]}
]}
}
}
}
}}
])
The $cond operator evaluates whether your field value matches the condition and places the appropriate value simply just $sum to get your result.
$cond provides a "ternary" operator or "if .. then .. else" that is used to evaluate the condition you provide in the first argument. You construct this to "cascade" where the condition evaluates to false in order to move on to the next condition to evaluate, otherwise return the value that matches your condition.
In this way your "known" values are applied as you aggregate for your expected total.