how to count number of keys in subdocument using aggregation pipeline? - mongodb

Suppose I have a document like this:
{
"_id" : ObjectId("57eb386e37b4842ff5f386c9"),
"lesson_id" : ObjectId("57e27cd190e6993e393f5c74"),
"student_id" : ObjectId("57d3c3f590e6995fe8de7932"),
"answer_records" : {
"1" : {
"answer" : [
"A"
]
},
"3" : {
"answer" : [
"C"
]
}
}
I want to count the number of answer records in the collection. Apparently, this document contribute two answer records which are "1" and "3". So, my question is how to achieve this using aggregation pipeline.

In your case, it is far easier to just use JS.
On the mongo shell :
var json=db.sof.findOne().answer_records;
Object.keys(json).length;
Prints 2 for the number of answer records in the said document.

For MongoDB 3.6 and newer, use the $objectToArray operator within an aggregation pipeline to convert the document to an array. The return array contains an element for each field/value pair in the original document. Each element in the return array is a document that contains two fields k and v.
On getting the array, you can then leverage the use of $addFields pipeline step to create a field that holds the counts and the actual count is derived with the use of the $size operator.
All this can be done in a single pipeline by nesting the expressions as follows:
db.collection.aggregate([
{
"$addFields": {
"answers_count": {
"$size": {
"$objectToArray": "$answer_records"
}
}
}
}
])
Sample Output
{
"_id" : ObjectId("57eb386e37b4842ff5f386c9"),
"lesson_id" : ObjectId("57e27cd190e6993e393f5c74"),
"student_id" : ObjectId("57d3c3f590e6995fe8de7932"),
"answer_records" : {
"1" : {
"answer" : [
"A"
]
},
"3" : {
"answer" : [
"C"
]
}
},
"answers_count": 2
}
For MongoDB server versions which do not support the above operators, you would need to change your schema design in order to carry out efficient queries with the aggregation framework. As it is currently you'd need
to preprocess the documents either on the client or server with JavaScript thus you won't be able to fully utilise MongoDB's better infrastructure built for faster querying.
The ideal design follows:
{
"_id" : ObjectId("57eb386e37b4842ff5f386c9"),
"lesson_id" : ObjectId("57e27cd190e6993e393f5c74"),
"student_id" : ObjectId("57d3c3f590e6995fe8de7932"),
"answer_records" : [
{ "id": "1", "answer": "A" }
{ "id": "3", "answer": "C" }
]
}
which you can then simply apply the aggregation's $project pipeline that uses the $size operator to return the length of the answer_records array per document:
db.collection.aggregate([
{
"$project": {
"lesson_id": 1,
"student_id": 1,
"count": { "$size": "$answer_records" }
}
}
])
If you want the total number of answer records for the whole collection then add another $group pipeline to get the accumulated total for all the documents using an _id of null:
db.collection.aggregate([
{
"$project": {
"count": { "$size": "$answer_records" }
}
},
{
"$group": {
"_id": null,
"total_answers": { "$sum": "$count" }
}
}
])
Otherwise with the current design your only option is MapReduce which is much slower:
db.collection.mapReduce(
function() {
emit(this._id, Object.keys(this.answer_records).length);
},
function() { },
{ "out": { "inline": 1 } }
)
Sample Output:
{
"results" : [
{
"_id" : ObjectId("57eb386e37b4842ff5f386c9"),
"value" : 2
}
],
....
}
To get the total for all the documents in the collection then run this mapReduce operation:
db.collection.mapReduce(
function() {
emit(null, Object.keys(this.answer_records).length);
},
function(key, values) {
return Array.sum(values);
},
{ "out": { "inline": 1 } }
)

Related

Mongodb splitting aggregation result

I'm currently trying to split an aggregation result in two differents arrays using only mongodb.
My main goal is to create two subset of user with the same distribution regarding the number of interactions that they have made. For this I'm currently making this request:
db.getCollection('Interaction').aggregate([
{ $group : { _id : "$userId", count: { $sum: 1 }}},
{ $sort : { count : -1 }},
{ $group : { _id :{$mod : [_rand() * 2, 2]}, ids : { $push: "$_id"}}}
}
My main issue actualy is that the _rand() function is called only once during the aggregation execution to I only have all my result in a single array.
Also, a random distribution is not so good. Is there a way to use the index of each result ?
Edit 1 :
After #dnickless answer I still got an issue on distribution in the groupBy part. Ideally I would like to do something like this
db.getCollection('Interaction').aggregate([
{ $group : { _id : "$userId", count: { $sum: 1 }}},
{ $sort : { count : -1 }},
{ $bucket: {
groupBy: { $mod: [ { $indexOfArray : ??? }, 2 ] },
boundaries: [ 0, 1 ],
default: 2,
output: {
"users": { $push: "$_id"}
}
}
}
],
{ allowDiskUse: true })
That could split even index and odd index into two separated array. But I would like to apply the $indexOfArray on the current aggregation result.
To give you more context here is my Interaction object model :
{ "_id" : ObjectId("5af01..."), "name" : "WATCH", "date" : ISODate("2018-05-07T09:32:53.219Z") }
Without the bucket part I have this result :
{ "_id" : "5b1e7f...", "count" : 43.0 }
{ "_id" : "5b1e75...", "count" : 41.0 }
{ "_id" : "5b1e7a...", "count" : 40.0 }
...
I would like my answer to look like this :
{
{ "_id" : 0, "users" : [ "5b1e7f...", "5b1e7a...", ... ] }, // even index results
{ "_id" : 1, "users" : [ "5b1e75...", ... ] } // odd index results
}
My end goal is to split my users in 2 groups with evenly distributed numbers of interactions.
Edit 2 :
Finally found a solution to resolve my problem :
db.getCollection('Interaction').aggregate([
{ $group : { _id : "$userId", count: { $sum: 1 }}},
{ $sort : { count : -1 }},
{ $group : { _id : "whatever" , user : { $push : { _id : "$_id" , count : "$count"}}}},
{ $unwind : { path : "$user" , "includeArrayIndex" : "rank"}},
{ $bucket: {
groupBy: { $mod: [ "$rank" , 2 ] },
boundaries: [ 0, 1 ],
default: 2,
output: {
"users": { $push: "$user._id"}
}
}
}
],
{ allowDiskUse: true })
Probably not the most optimized solution at all, but still do the job :)
If you have any advise to improve it I'm still interested in.
I don't fuly understand what exactly you are trying to achieve here without seeing some sample input and output. However, have you tried using $bucketAuto? Something like this:
db.getCollection('Interaction').aggregate([
{ $group : { _id : "$userId", count: { $sum: 1 }}},
{ $bucketAuto : {
groupBy : "$count",
buckets : 2, // number of buckets goes here
output : {
ids : { $push : "$id" }
}
}
}])
If you want to go more sophisticated regarding the distribution aspect you could perhaps try something like this which would throw all even counts into one pot and all odd ones into another:
$bucket: {
groupBy: { $mod: [ "$count", 2 ] },
boundaries: [ 0, 1 ],
default: 2,
output: {
"docs": { $push: "$$ROOT" }
}
}
Depending on the type of your userId field you could perhaps come up with a more "random" distribution.
Lastly, I am not sure what exactly you mean by
"Is there a way to use the index of each result ?"
Perhaps something like $size, $arrayElemAt and/or $indexOfArray...?
Alternatively, you could perhaps try to $slice the sorted array into two equally sized parts (using $size $divided by 2), then $reverseArray one of them and then $zip both arrays up again which should result in something like when you shuffle a deck of playing cards. After that, you would need to flatten the nested array into a single one again (using $reduce and $concatArrays or so) and then slice the array again in two parts which should be what you are looking for if I am not too tired by now to think through the statistical parts here.

MongoDB - Operations with nested fields

I have twitter data that looks like this:
db.users.findOne()
{
"_id" : ObjectId("578ffa8e7eb9513f4f55a935"),
"user_name" : "koteras",
"retweet_count" : 0,
"tweet_followers_count" : 461,
"source" : "Twitter for iPhone",
"coordinates" : null,
"tweet_mentioned_count" : 1,
"tweet_ID" : "755891629932675072",
"tweet_text" : "RT #ochocinco: I beat them all for 10 straight hours #FIFA16KING",
"user" : {
"CreatedAt" : ISODate("2011-12-27T09:04:01Z"),
"FavouritesCount" : 5223,
"FollowersCount" : 461,
"FriendsCount" : 619,
"UserId" : 447818090,
"Location" : "501"
}
For example, I want to find the number of users that have "FollowersCount" greater than "FavouritesCount". How can I do that?
The $where operator is specifically designed for this.
db.users.find( { $where: function() { return (this.user.FollowersCount > this.user.FavouritesCount) } } );
But keep in mind that this would run single threaded JS code, and will be slower.
Another option is to use an aggregation pipeline projecting the difference, and then having a $match on the difference
db.users.aggregate([
{$project: {
diff: {$subtract: ["$user.FollowersCount", "$user.FavouritesCount"]},
// project remaining fields here
}
},
{$match: {diff: {$gt: 0}}}
])
In my experience I have found the second one to be much faster than the first.
To get the number of users that have "FollowersCount" greater than "FavouritesCount", you could use the aggregation framework which has some operators that you can apply.
Consider the first use case which looks at manipulating the comparison operators within the $project pipeline and a subsequent $match pipeline to filter documents based on the $cmp value. You can then get the final user count by applying a $group pipeline that aggregates the filtered documents:
db.users.aggregate([
{
"$project": {
"hasMoreFollowersThanFavs": {
"$cmp": [ "$user.FollowersCount", "$user.FavouritesCount" ]
}
}
},
{ "$match": { "hasMoreFollowersThanFavs": 1 } },
{
"$group": {
"_id": null,
"count": { "$sum": 1 }
}
}
])
Another option is using a single pipeline with $redact operator which incorporates the functionality of $project and $match as above and returns all documents which match a specified condition using $$KEEP system variable and discards those that don't match using the $$PRUNE system variable:
db.collection.aggregate([
{
"$redact": {
"$cond": [
{
"$eq": [
{ "$cmp": [ "$user.FollowersCount", "$user.FavouritesCount" ] },
1
]
},
"$$KEEP",
"$$PRUNE"
]
}
},
{
"$group": {
"_id": null,
"count": { "$sum": 1 }
}
}
])

Combining group and project in mongoDB aggregation framework

my document looks like this:
{
"_id" : ObjectId("5748d1e2498ea908d588b65e"),
"some_item" : {
"_id" : ObjectId("5693afb1b49eb7d5ed97de14"),
"item_property_1" : 1.0,
"item_property_2" : 2.0,
},
"timestamp" : "2016-05-28",
"price_information" : {
"arbitrary_value" : 111,
"hourly_rates" : [
{
"price" : 74.45,
"hour" : "0"
},
{
"price" : 74.45,
"hour" : "1"
},
{
"price" : 74.45,
"hour" : "2"
},
]
}
}
I did average the price per day via:
db.hourly.aggregate([
{$match: {timestamp : "2016-05-28"}},
{$unwind: "$price_information.hourly_rates"},
{$group: { _id: "$unique_item_identifier", total_price: { $avg: "$price_information.hourly_rates.price"}}}
]);
I am struggling with bringing (projecting) other params with in the result set. I would like to have also some_item and timestampin the result set. I tried to use a $project: {some_item: 1, total_price: 1, ...} within the query, but that wasn't right.
My desired output would be like:
{
"_id" : ObjectId("5693afb1b49eb7d5ed97de27"),
"someItem" : {
"_id" : ObjectId("5693afb1b49eb7d5ed97de14"),
"item_property_1" : 1.0,
"item_property_2" : 2.0,
},
"timestamp" : "2016-05-28",
"price_information" : {
"avg_price": 34
}
}
If somebody could give me a hint, how to project the grouping and the other params into the result set, I would be thankful.
Best
Rob
If using MongoDB 3.2 and newer, you can use $avg in the $project pipeline since it returns the average of the specified expression or list of expressions for each document e.g
db.hourly.aggregate([
{ "$match": { "timestamp": "2016-05-28" } },
{
"$project": {
"price_information": {
"avg_price": { "$avg": "$price_information.hourly_rates.price" }
},
"someItem": 1,
"timestamp": 1,
}
}
]);
In previous versions of MongoDB, $avg is available in the $group stage only. So to include the other fields, use the $first operator in your grouping:
db.hourly.aggregate([
{ "$match": { "timestamp": "2016-05-28" } },
{ "$unwind": "$price_information.hourly_rates" },
{
"$group": {
"_id": "$_id",
"avg_price": { "$avg": "$price_information.hourly_rates.price" },
"someItem": { "$first": "$some_item" },
"timestamp": { "$first": "$timestamp" },
}
},
{
"$project": {
"price_information": { "avg_price": "$avg_price" },
"someItem": 1
"timestamp": 1
}
}
]);
Note: Usage of the $first operator in a $group stage will largely depend on how the documents getting in that pipeline are ordered as well as the group by key. Because $first will returns the first document value in a group of documents that share the same group by key, the $group stage logically should precede a $sort stage to have the input documents in a defined order. This is only sensible to use when you know the order that the data is being processed in.
However, as the above is grouping by the main document's _id key, the $first operator when applied to non-denormalized fields (and not the flattened price_information array fields) will guarantee the original value in the result. Hence no need for a pre-sort stage to define the order since it won't be necessary in this case.

Find MongoDB object using value of another field

I recently found difficulty in finding an object stored in a document with its key in another field of that same document.
{
list : {
"red" : 397n8,
"blue" : j3847,
"pink" : 8nc48,
"green" : 983c4,
},
result : [
{ "id" : 397n8, value : "anger" },
{ "id" : j3847, value : "water" },
{ "id" : 8nc48, value : "girl" },
{ "id" : 983c4, value : "evil" }
]
}
}
I am trying to get the value for 'blue' which has an id of 'j3847' and a value of 'water'.
db.docs.find( { result.id : list.blue }, { result.value : 1 } );
# list.blue would return water
# list.pink would return girl
# list.green would return evil
I tried many things and even found a great article on how to update a value using a value in the same document.: Update MongoDB field using value of another field which I based myself on; with no success... :/
How can I find a MongoDB object using value of another field ?
You can do it with the $filter operator within mongo aggregation. It returns an array with only those elements that match the condition:
db.docs.aggregate([
{
$project: {
result: {
$filter: {
input: "$result",
as:"item",
cond: { $eq: ["$list.blue", "$$item.id"]}
}
}
}
}
])
Output for this query looks like this:
{
"_id" : ObjectId("569415c8299692ceedf86573"),
"result" : [ { "id" : "j3847", "value" : "water" } ]
}
One way is using the $where operator though would not recommend as using it invokes a full collection scan regardless of what other conditions could possibly use an index selection and also invokes the JavaScript interpreter over each result document, which is going to be considerably slower than native code.
That being said, use the alternative .aggregate() method for this type of comparison instead which is definitely the better option:
db.docs.aggregate([
{ "$unwind": "$result" },
{
"$project": {
"result": 1,
"same": { "$eq": [ "$list.blue", "$result.id" ] }
}
},
{ "$match": { "same": true } },
{
"$project": {
"_id": 0,
"value": "$result.value"
}
}
])
When the $unwind operator is applied on the result array field, it will generate a new record for each and every element of the result field on which unwind is applied. It basically flattens the data and then in the subsequent $project step inspect each member of the array to compare if the two fields are the same.
Sample Output
{
"result" : [
{
"value" : "water"
}
],
"ok" : 1
}
Another alternative is to use the $map and $setDifference operators in a single $project step where you can avoid the use of $unwind which can be costly on very large collections and in most cases result in the 16MB BSON limit constraint:
db.docs.aggregate([
{
"$project": {
"result": {
"$setDifference": [
{
"$map": {
"input": "$result",
"as": "r",
"in": {
"$cond": [
{ "$eq": [ "$$r.id", "$list.blue" ] },
"$$r",
false
]
}
}
},
[false]
]
}
}
}
])
Sample Output
{
"result" : [
{
"_id" : ObjectId("569412e5a51a6656962af1c7"),
"result" : [
{
"id" : "j3847",
"value" : "water"
}
]
}
],
"ok" : 1
}

List values in MongoDB aggregation pipeline belonging to different documents

I got this result coming from an Aggregation pipeline I did:
{
"_id" : ISODate("2015-05-14T08:22:09.441Z"),
"values" : {
"v1_min" : 15.872267760931187,
"v1" : 15.909139078185774,
"v1_max" : 20.6420184124931776
}
},
{
"_id" : ISODate("2015-05-13T08:22:09.441Z"),
"values" : {
"v1_min" : 2.872263320931187,
"v1" : 7.909132898185774,
"v1_max" :44.6498764124931776
}
},
{...}
do you think it's possible to get a structure like this one
{
"_id" : [ISODate("2015-05-14T08:22:09.441Z"),ISODate("2015-05-13T08:22:09.441Z")]
"values" : {
"v1_min" : [15.872267760931187, 2.872263320931187],
"v1" : [15.909139078185774, 7.909132898185774]
"v1_max" : [2.6420184124931776, 44.6498764124931776]
}
}
adding some others stages to my aggregation pipelines?
If so, how would you do?
I'd not like to handle this via code because I think MongoDB aggregation framework is faster and should do better than me.
Yes, it's quite possible indeed. The following aggregation pipeline will achieve the desired output:
db.collection.aggregate([
{
"$group": {
"_id": null,
"ids": {
"$addToSet": "$_id"
},
"v1_min": {
"$push": "$values.v1_min"
},
"v1": {
"$push": "$values.v1"
},
"v1_max": {
"$push": "$values.v1_max"
}
}
},
{
"$project": {
"_id": "$ids",
"values": {
"v1_min": "$v1_min",
"v1": "$v1",
"v1_max": "$v1_max"
}
}
}
]);
-- EDIT --
Use $push instead of $addToSet as the latter will only add the element if and only if the final array does not contain the element itself. (Thanks to #SylvainLeroux for the positive contributions)