How to calculate a sum of specific documents using aggregation? - mongodb

I have a schema representing a message thread. So each document in the mongo database looks something like:
{
id: "thread_id",
participants: ["user1", "user2"],
unReadMessageCounts: [
{
participant: "user1",
count: 5
},
{
participant: "user2",
count: 3
}
}
What I want to do is get a sum of all unread messages counts for a given user - say, "user2". I know I could do this by just doing a find() on the collection and then writing a function to sum up to counts for a given user. But I'd like to use mongo's aggregate functionality if possible. I know I can do a match to first select all threads in which "user2" is a participant, but then how do I construct the group and/or sum expressions to pull out the right field from the document?

Use the following aggregation pipeline to get the desired result. The initial step filters out the incoming documents to only accept "user2" participant by way of the $match operator.
The preceding pipeline stage then "denormalizes" the unReadMessageCounts array through the $unwind operator that outputs 2 documents from the array for each incumbent document (in your above sample data).
Further filtering is necessary to aggregate data for the correct participant and this is done through another $match pipeline step.
The final aggregation operation using $group specifies a group _id of null, calculating the total counts for all documents in the pipeline using the accumulator operator $sum on the "unReadMessageCounts.count" field.
So, running this aggregation pipeline on the sample data given:
db.collection.aggregate([
{
"$match": { "unReadMessageCounts.participant": "user2" }
},
{ "$unwind" : "$unReadMessageCounts" },
{
"$match": { "unReadMessageCounts.participant": "user2" }
},
{
"$group": {
"_id": null,
"total": { "$sum": "$unReadMessageCounts.count" }
}
}
])
will yield the result:
/* 0 */
{
"result" : [
{
"_id" : null,
"total" : 3
}
],
"ok" : 1
}

You can use the $redact operator to as shown here to limit the size of documents to process in the pipeline then you $unwind your documents and in the $group stage you use the $sum accumulator operator to return total of unread message for "user2".
db.collection.aggregate([
{ "$match": {
"unReadMessageCounts": {
"$elemMatch": { "participant": "user2" }
}
}},
{ "$redact": {
"$cond": [
{ "$or": [
{ "$eq": [ "$participant", "user2" ] },
{ "$not" : "$participant" }
]},
"$$DESCEND",
"$$PRUNE"
]
}},
{ "$unwind": "$unReadMessageCounts" },
{ "$group": {
"_id": null,
"total": { "$sum": "$unReadMessageCounts.count" }
}}
])

Related

MongoDB - Operations with nested fields

I have twitter data that looks like this:
db.users.findOne()
{
"_id" : ObjectId("578ffa8e7eb9513f4f55a935"),
"user_name" : "koteras",
"retweet_count" : 0,
"tweet_followers_count" : 461,
"source" : "Twitter for iPhone",
"coordinates" : null,
"tweet_mentioned_count" : 1,
"tweet_ID" : "755891629932675072",
"tweet_text" : "RT #ochocinco: I beat them all for 10 straight hours #FIFA16KING",
"user" : {
"CreatedAt" : ISODate("2011-12-27T09:04:01Z"),
"FavouritesCount" : 5223,
"FollowersCount" : 461,
"FriendsCount" : 619,
"UserId" : 447818090,
"Location" : "501"
}
For example, I want to find the number of users that have "FollowersCount" greater than "FavouritesCount". How can I do that?
The $where operator is specifically designed for this.
db.users.find( { $where: function() { return (this.user.FollowersCount > this.user.FavouritesCount) } } );
But keep in mind that this would run single threaded JS code, and will be slower.
Another option is to use an aggregation pipeline projecting the difference, and then having a $match on the difference
db.users.aggregate([
{$project: {
diff: {$subtract: ["$user.FollowersCount", "$user.FavouritesCount"]},
// project remaining fields here
}
},
{$match: {diff: {$gt: 0}}}
])
In my experience I have found the second one to be much faster than the first.
To get the number of users that have "FollowersCount" greater than "FavouritesCount", you could use the aggregation framework which has some operators that you can apply.
Consider the first use case which looks at manipulating the comparison operators within the $project pipeline and a subsequent $match pipeline to filter documents based on the $cmp value. You can then get the final user count by applying a $group pipeline that aggregates the filtered documents:
db.users.aggregate([
{
"$project": {
"hasMoreFollowersThanFavs": {
"$cmp": [ "$user.FollowersCount", "$user.FavouritesCount" ]
}
}
},
{ "$match": { "hasMoreFollowersThanFavs": 1 } },
{
"$group": {
"_id": null,
"count": { "$sum": 1 }
}
}
])
Another option is using a single pipeline with $redact operator which incorporates the functionality of $project and $match as above and returns all documents which match a specified condition using $$KEEP system variable and discards those that don't match using the $$PRUNE system variable:
db.collection.aggregate([
{
"$redact": {
"$cond": [
{
"$eq": [
{ "$cmp": [ "$user.FollowersCount", "$user.FavouritesCount" ] },
1
]
},
"$$KEEP",
"$$PRUNE"
]
}
},
{
"$group": {
"_id": null,
"count": { "$sum": 1 }
}
}
])

Mongo find duplicates for entries for two or more fields

I have documents like this:
{
"_id" : ObjectId("557eaf444ba222d545c3dffc"),
"foreing" : ObjectId("538726124ba2222c0c0248ae"),
"value" : "test",
}
I want to find all documents which have duplicated values for pair foreing & value.
You can easily identify the duplicates by running the following aggregation pipeline operation:
db.collection.aggregate([
{
"$group": {
"_id": { "foreing": "$foreing", "value": "$value" },
"uniqueIds": { "$addToSet": "$_id" },
"count": { "$sum": 1 }
}
},
{ "$match": { "count": { "$gt": 1 } } }
])
The $group operator in the first step is used to group the documents by the foreign and value key values and then create an array of _id values for each of the grouped documents as the uniqueIds field using the $addToSet operator. This gives you an array of unique expression values for each group. Get the total number of grouped documents to use in the later pipeline stages with the $sum operator.
In the second pipeline stage, use the $match operator to filter out all documents with a count of 1. The filtered-out documents represent unique index keys.
The remaining documents will be those in the collection that have duplicate key values for pair foreing & value.
We only have to group on the bases of 2 keys, and select the elements with count greater than 1, to find the duplicates.
Query :- Will be like
db.mycollection.aggregate(
{ $group: {
_id: { foreing: "$foreing", value: "$value" },
count: { $sum: 1 },
docs: { $push: "$_id" }
}},
{ $match: {
count: { $gt : 1 }
}}
)
OUTPUT :- Will be like
{
"result" : [
{
"_id" : {
"foreing" : 1,
"value" : 2
},
"count" : 2,
"docs" : [
ObjectId("34567887654345678987"),
ObjectId("34567887654345678987")
]
}
],
"ok" : 1
}
Reference Link :- How to find mongo documents with a same field
Difference between node.js require and ES6 import and export

MongoDB Count and groupby subdocument

I have the following document:
{
"id":1,
"url":"mysite.com",
"views":
[
{"ip":"1.1.1.1","date":"01-01-2015"},
{"ip":"2.2.2.2","date":"01-01-2015"},
{"ip":"1.1.1.1","date":"01-01-2015"},
{"ip":"1.1.1.1","date":"01-01-2015"}
]
}
If i want to count how many unique ips (groupBy), how can I do that with mongo?
Use the aggregation framework to get the desired result. The aggregation pipeline will have a $unwind operation as the first step which deconstructs the views array field from the input documents to output a document for each element. Each output document replaces the array with an element value. The next pipeline stage $group then groups the documents by the "views.ip" field, calculates the count field for each group, and outputs a document for each unique state.
The new per-ip documents have two fields: the _id field and the count field. The _id field contains the value of the unique IP address; i.e. the group by field. The count field is a calculated field that contains the total ip count per each unique IP. To calculate the value, $group uses the $sum operator to calculate the total number of IP addresses. So your final aggregation pipeline would look like this:
db.collection.aggregate([
{
"$unwind": "$views"
},
{
"$group": {
"_id": "$views.ip",
"count": {
"$sum": 1
}
}
}
])
Output:
/* 1 */
{
"result" : [
{
"_id" : "2.2.2.2",
"count" : 1
},
{
"_id" : "1.1.1.1",
"count" : 3
}
],
"ok" : 1
}
-- UPDATE --
To get the total of all unique IP's, you need another $group pipeline stage, this time the _id is null, that is you group all the documents from the previous pipeline stream into one, then use the same $sum operation on that group to get the total of the count. The aggregation pipeline would look like this in the end:
db.collection.aggregate([
{
"$unwind": "$views"
},
{
"$group": {
"_id": "$views.ip",
"count": {
"$sum": 1
}
}
},
{
"$group": {
"_id": null,
"total": {
"$sum": "$count"
}
}
}
])
Output:
/* 1 */
{
"result" : [
{
"_id" : null,
"total" : 4
}
],
"ok" : 1
}

mongodb $aggregate empty array and multiple documents

mongodb has below document:
> db.test.find({name:{$in:["abc","abc2"]}})
{ "_id" : 1, "name" : "abc", "scores" : [ ] }
{ "_id" : 2, "name" : "abc2", "scores" : [ 10, 20 ] }
I want get scores array length for each document, how should I do?
Tried below command:
db.test.aggregate({$match:{name:"abc2"}}, {$unwind: "$scores"}, {$group: {_id:null, count:{$sum:1}}} )
Result:
{ "_id" : null, "count" : 2 }
But below command:
db.test.aggregate({$match:{name:"abc"}}, {$unwind: "$scores"}, {$group: {_id:null, count:{$sum:1}}} )
Return Nothing. Question:
How should I get each lenght of scores in 2 or more document in one
command?
Why the result of second command return nothing? and how
should I check if the array is empty?
So this is actually a common problem. The result of the $unwind phase in an aggregation pipeline where the array is "empty" is to "remove" to document from the pipeline results.
In order to return a count of "0" for such an an "empty" array then you need to do something like the following.
In MongoDB 2.6 or greater, just use $size:
db.test.aggregate([
{ "$match": { "name": "abc" } },
{ "$group": {
"_id": null,
"count": { "$sum": { "$size": "$scores" } }
}}
])
In earlier versions you need to do this:
db.test.aggregate([
{ "$match": { "name": "abc" } },
{ "$project": {
"name": 1,
"scores": {
"$cond": [
{ "$eq": [ "$scores", [] ] },
{ "$const": [false] },
"$scores"
]
}
}},
{ "$unwind": "$scores" },
{ "$group": {
"_id": null,
"count": { "$sum": {
"$cond": [
"$scores",
1,
0
]
}}
}}
])
The modern operation is simple since $size will just "measure" the array. In the latter case you need to "replace" the array with a single false value when it is empty to avoid $unwind "destroying" this for an "empty" statement.
So replacing with false allows the $cond "trinary" to choose whether to add 1 or 0 to the $sum of the overall statement.
That is how you get the length of "empty arrays".
To get the length of scores in 2 or more documents you just need to change the _id value in the $group pipeline which contains the distinct group by key, so in this case you need to group by the document _id.
Your second aggregation returns nothing because the $match query pipeline passed a document which had an empty scores array. To check if the array is empty, your match query should be
{'scores.0': {$exists: true}} or {scores: {$not: {$size: 0}}}
Overall, your aggregation should look like this:
db.test.aggregate([
{ "$match": {"scores.0": { "$exists": true } } },
{ "$unwind": "$scores" },
{
"$group": {
"_id": "$_id",
"count": { "$sum": 1 }
}
}
])

How to optimize mongoDB query?

I am having following sample document in the mongoDB.
{
"location" : {
"language" : null,
"country" : "null",
"city" : "null",
"state" : null,
"continent" : "null",
"latitude" : "null",
"longitude" : "null"
},
"request" : [
{
"referrer" : "direct",
"url" : "http://www.google.com/"
"title" : "index page"
"currentVisit" : "1401282897"
"visitedTime" : "1401282905"
},
{
"referrer" : "direct",
"url" : "http://www.stackoverflow.com/",
"title" : "index page"
"currentVisit" : "1401282900"
"visitedTime" : "1401282905"
},
......
]
"uuid" : "109eeee0-e66a-11e3"
}
Note:
The database contains more than 10845 document
Each document contains nearly 100 request(100 object in the request array).
Technology/Language - node.js
I had setProfiling to check the execution time
First Query - 13899ms
Second Query - 9024ms
Third Query - 8310ms
Fourth Query - 6858ms
There is no much difference using indexing
Queries:
I am having the following aggregation queries to be executed to fetch the data.
var match = {"request.currentVisit":{$gte:core.getTime()[1].toString(),$lte:core.getTime()[0].toString()}};
For Example: var match = {"request.currentVisit":{$gte:"1401282905",$lte:"1401282935"}};
For third and fourth query request.visitedTime instead of request.currentVisit
First
[
{ "$project":{
"request.currentVisit":1,
"request.url":1
}},
{ "$match":{
"request.1": {$exists:true}
}},
{ "$unwind": "$request" },
{ "$match": match },
{ "$group": {
"_id": {
"url":"$request.url"
},
"count": { "$sum": 1 }
}},
{ "$sort":{ "count": -1 } }
]
Second
[
{ "$project": {
"request.currentVisit":1,
"request.url":1
}},
{ "$match": {
"request":{ "$size": 1 }
}},
{ "$unwind": "$request" },
{ "$match": match },
{ "$group": {
"_id":{
"url":"$request.url"
},
"count":{ "$sum": 1 }
}},
{ "$sort": { "count": -1} }
]
Third
[
{ "$project": {
"request.visitedTime":1,
"uuid":1
}},
{ "$match":{
"request.1": { "$exists": true }
}},
{ "$match": match },
{ "$group": {
"_id": "$uuid",
"count":{ "$sum": 1 }
}},
{ "$group": {
"_id": null,
"total": { "$sum":"$count" }}
}}
]
Forth
[
{ "$project": {
"request.visitedTime":1,
"uuid":1
}},
{ "$match":{
"request":{ "$size": 1 }
}},
{ "$match": match },
{ "$group": {
"_id":"$uuid",
"count":{ "$sum": 1 }
}},
{ "$group": {
"_id":null,
"total": { "$sum": "$count" }
}}
]
Problem:
It is taking more than 38091 ms to fetch the data.
Is there any way to optimize the query?
Any suggestion will be grateful.
Well there are a few problems and you definitely need indexes, but you cannot have compound ones. It is the "timestamp" values that you are querying within the array that you want to index. It would also be advised that you either convert these to numeric values rather than the current strings, or indeed to BSON Date types. The latter form is actually internally stored as a numeric timestamp value, so there is a general storage size reduction, which also reduces the index size as well as being more efficient to match on the numeric values.
The big problem with each query is that you are always later diving into the "array" contents after processing an $unwind and then "filtering" that with match. While this what you want to do for your result, since you have not applied the same filter at an earlier stage, you have many documents in the pipeline that do not match these conditions when you $unwind. The result is "lots" of documents you do not need being processed in this stage. And here you cannot use an index.
Where you need this match is at the start of the pipeline stages. This narrows down the documents to the "possible" matches before that acutual array is filtered.
So using the first as an example:
[
{ "$match":{
{ "request.currentVisit":{
"$gte":"1401282905", "$lte": "1401282935"
}
}},
{ "$unwind": "$request" },
{ "$match":{
{ "request.currentVisit":{
"$gte":"1401282905", "$lte": "1401282935"
}
}},
{ "$group": {
"_id": {
"url":"$request.url"
},
"count": { "$sum": 1 }
}},
{ "$sort":{ "count": -1 } }
]
So a few changes. There is a $match at the head of the pipeline. This narrows down documents and is able to use an index. That is the most important performance consideration. Golden rule, always "match" first.
The $project you had in there was redundant as you cannot project "just" the fields of an array that is yet unwound. There is also a misconception that people believe they $project first to reduce the pipeline. The effect is very minimal if in fact there is a later $project or $group statement that actually limits the fields, then this will be "forward optimized" so things do get taken out of the pipeline processing for you. Still the $match statement above does more to optimize.
Dropping the need to see if the array is actually there with the other $match stage, as you are now "implicitly" doing that at the start of the pipeline. If more conditions make you more comfortable, then add them to that initial pipeline stage.
The rest remains unchanged, as you then $unwind the array and $match to filter the items that you actually want before moving on to your remaining processing. By now, the input documents have been significantly reduced, or reduced as much as they are going to be.
The other alternative that you can do with MongoDB 2.6 and greater is "filter" the array content before you even **$unwind it. This would produce a listing like this:
[
{ "$match":{
{ "request.currentVisit":{
"$gte":"1401282905", "$lte": "1401282935"
}
}},
{ "$project": {
"request": {
"$setDifference": [
{
"$map": {
"input": "$request",
"as": "el",
"in": {
"$cond"": [
{
"$and":[
{ "$gte": [ "1401282905", "$$el.currentVisit" ] },
{ "$lt": [ "1401282935", "$$el.currentVisit" ] }
]
}
"$el",
false
]
}
}
}
[false]
]
}
}}
{ "$unwind": "$request" },
{ "$group": {
"_id": {
"url":"$request.url"
},
"count": { "$sum": 1 }
}},
{ "$sort":{ "count": -1 } }
]
That may save you some by being able to "filter" the array before the $unwind and which is possibly better than doing the $match afterwards.
But this is the general rule for all of your statements. You need usable indexes and you need to $match first.
It is possible that the actual results you really want could be obtained in a single query, but as it stands your question is not presented that way. Try changing your processing as outlined, and you should see a notable improvement.
If you are still then trying to come to terms with how this could possibly be singular, then you can always ask another question.