MongoDB - Operations with nested fields - mongodb

I have twitter data that looks like this:
db.users.findOne()
{
"_id" : ObjectId("578ffa8e7eb9513f4f55a935"),
"user_name" : "koteras",
"retweet_count" : 0,
"tweet_followers_count" : 461,
"source" : "Twitter for iPhone",
"coordinates" : null,
"tweet_mentioned_count" : 1,
"tweet_ID" : "755891629932675072",
"tweet_text" : "RT #ochocinco: I beat them all for 10 straight hours #FIFA16KING",
"user" : {
"CreatedAt" : ISODate("2011-12-27T09:04:01Z"),
"FavouritesCount" : 5223,
"FollowersCount" : 461,
"FriendsCount" : 619,
"UserId" : 447818090,
"Location" : "501"
}
For example, I want to find the number of users that have "FollowersCount" greater than "FavouritesCount". How can I do that?

The $where operator is specifically designed for this.
db.users.find( { $where: function() { return (this.user.FollowersCount > this.user.FavouritesCount) } } );
But keep in mind that this would run single threaded JS code, and will be slower.
Another option is to use an aggregation pipeline projecting the difference, and then having a $match on the difference
db.users.aggregate([
{$project: {
diff: {$subtract: ["$user.FollowersCount", "$user.FavouritesCount"]},
// project remaining fields here
}
},
{$match: {diff: {$gt: 0}}}
])
In my experience I have found the second one to be much faster than the first.

To get the number of users that have "FollowersCount" greater than "FavouritesCount", you could use the aggregation framework which has some operators that you can apply.
Consider the first use case which looks at manipulating the comparison operators within the $project pipeline and a subsequent $match pipeline to filter documents based on the $cmp value. You can then get the final user count by applying a $group pipeline that aggregates the filtered documents:
db.users.aggregate([
{
"$project": {
"hasMoreFollowersThanFavs": {
"$cmp": [ "$user.FollowersCount", "$user.FavouritesCount" ]
}
}
},
{ "$match": { "hasMoreFollowersThanFavs": 1 } },
{
"$group": {
"_id": null,
"count": { "$sum": 1 }
}
}
])
Another option is using a single pipeline with $redact operator which incorporates the functionality of $project and $match as above and returns all documents which match a specified condition using $$KEEP system variable and discards those that don't match using the $$PRUNE system variable:
db.collection.aggregate([
{
"$redact": {
"$cond": [
{
"$eq": [
{ "$cmp": [ "$user.FollowersCount", "$user.FavouritesCount" ] },
1
]
},
"$$KEEP",
"$$PRUNE"
]
}
},
{
"$group": {
"_id": null,
"count": { "$sum": 1 }
}
}
])

Related

How to find percentage of grouping containing a specific word

I am trying to calculate the percentage of listings in a MongoDB that contain a specific word grouped by a collection's object.
I have managed to group the count of listings containing the word but not the percentage on the total count of each group's listings.
My collection looks like this:
{
"_id" : "103456",
"metadata" : {
"type" : "Bike",
"brand" : "Siamoto",
"model" : "Siamoto vespa '01 - € 550 EUR (Negotiable)"
}
},
{
"_id" : "103457",
"metadata" : {
"type" : "Bike",
"brand" : "BMW",
"model" : "BMW ADFR '06 - € 5680 EUR"
}
}
I want to project the percentage of ads per metadata.brand that contain the word "Negotiable" in metadata.model.
I have used for the count something like:
db.advertisements.aggregate([
{ $match: { $text: { $search: "Negotiable" } } },
{ $group: { _id: "$metadata.brand", Count: { $sum: 1} } }
])
and it worked but I can't find a workaround for the percentage. Thanks to all
For what you are trying to do, using a $text search or even a $regex is the wrong approach. All these can do is return the "matching" documents only from within the collection.
Using Aggregate to Count String Matches
Whist not as flexible as a regular expression ( and sadly there is no aggregation operator equivalent at this time, but there will be in future releases. See SERVER-11947 ) the better option is to use $indexOfCP in order to match the occurrence of the "string" and then count those against the "total counts" from each grouping:
db.advertisements.aggregate([
{ "$group": {
"_id": "$metadata.brand",
"totalCount": { "$sum": 1 },
"matchedCount": {
"$sum": {
"$cond": [{ "$ne": [{ "$indexOfCP": [ "$metadata.model", "Negotiable" ] }, -1 ] }, 1, 0]
}
}
}},
{ "$addFields": {
"percentage": {
"$cond": {
"if": { "$ne": [ "$matchedCount", 0 ] },
"then": {
"$multiply": [
{ "$divide": [ "$matchedCount", "$totalCount" ] },
100
]
},
"else": 0
}
}
}},
{ "$sort": { "percentage": -1 } }
])
And the results:
{ "_id" : "Siamoto", "totalCount" : 1, "matchedCount" : 1, "percentage" : 100 }
{ "_id" : "BMW", "totalCount" : 1, "matchedCount" : 0, "percentage" : 0 }
Note that the $group is used for the accumulation of both the total documents found within the "brand" as well as those where the string was matched. The $cond operator used here is a "ternary" or if/then/else statement which evaluates a boolean expression and then returns either one value where true or another where false. In this case the $indexOfCP NOT returning the -1 value or "not found".
The "percentage" is actually done in a separate stage, which in this case we use $addFields to add the "additional field". The operation is basically a $divide over the two accumulated values from the previous stage. The $cond is just applied to avoid "divide by 0" errors and the $multiply is just moving the decimal places into something that looks more like a "percentage". But the basic premise is such calculations which require "totals" to be accumulated first will always be a manipulation in a "later stage".
MongoDB 4.2 (proposed) Preview
FYI, on the current "unfinalized" syntax for $regexFind from MongoDB 4.2 (proposed, and yet to be finalized if included in that release ) and onwards this would be something like:
db.advertisements.aggregate([
{ "$group": {
"_id": "$metadata.brand",
"totalCount": { "$sum": 1 },
"matchedCount": {
"$sum": {
"$cond": {
"if": {
"$ne": [
{ "$regexFind": {
"input": "$metadata.model",
"regex": /Negotiable/i
}},
null
]
},
"then": 1,
"else": 0
}
}
}
}},
{ "$addFields": {
"percentage": {
"$cond": {
"if": { "$ne": [ "$matchedCount", 0 ] },
"then": {
"$multiply": [
{ "$divide": [ "$matchedCount", "$totalCount" ] },
100
]
},
"else": 0
}
}
}},
{ "$sort": { "percentage": -1 } }
])
Again noting strongly that the "current" implementation may be subject to change by the time it is released. This is how it works on the current 4.1.9-17-g0a856820ba development release.
Using MapReduce
An alternate approach where either your MongoDB version does not support $indexOfCP OR you need more flexibility in how you "match the string" is to use mapReduce for the aggregation instead:
db.advertisements.mapReduce(
function() {
emit(this.metadata.brand, {
totalCount: 1,
matchedCount: (/Negotiable/i.test(this.metadata.model)) ? 1 : 0
});
},
function(key,values) {
var obj = { totalCount: 0, matchedCount: 0 };
values.forEach(value => {
obj.totalCount += value.totalCount;
obj.matchedCount += value.matchedCount;
});
return obj;
},
{
"out": { "inline": 1 },
"finalize": function(key,value) {
value.percentage = (value.matchedCount != 0)
? (value.matchedCount / value.totalCount) * 100
: 0;
return value;
}
}
)
This has a similar result, but in a very "mapReduce" specific way:
{
"_id" : "BMW",
"value" : {
"totalCount" : 1,
"matchedCount" : 0,
"percentage" : 0
}
},
{
"_id" : "Siamoto",
"value" : {
"totalCount" : 1,
"matchedCount" : 1,
"percentage" : 100
}
}
The logic is pretty much the same. We "emit" using the "key" for the "brand" and then use another ternary to determine whether to count a "match" or not. In this case a regular expression test() operation, and even using "case insensitive" matching as an example.
The "reducer" part simply accumulates the values that were emitted, and the finalize function is where the "percentage" is returned by the same division and multiplication process.
The main difference between the two other than basic capabilities is that the mapReduce cannot do "further things" beyond the accumulation and basic manipulation in the finalize. The "sorting" demonstrated in the aggregation pipeline cannot be done with mapReduce without outputting to a separate collection and doing a separate find() and sort() on those documents contained.
Either way works, and it just depends on your needs and the capabilities of what you have available. Of course any aggregate() approach will be much faster than using the JavaScript evaluation of mapReduce. So you probably want aggregate() as your preference where possible.

Combining group and project in mongoDB aggregation framework

my document looks like this:
{
"_id" : ObjectId("5748d1e2498ea908d588b65e"),
"some_item" : {
"_id" : ObjectId("5693afb1b49eb7d5ed97de14"),
"item_property_1" : 1.0,
"item_property_2" : 2.0,
},
"timestamp" : "2016-05-28",
"price_information" : {
"arbitrary_value" : 111,
"hourly_rates" : [
{
"price" : 74.45,
"hour" : "0"
},
{
"price" : 74.45,
"hour" : "1"
},
{
"price" : 74.45,
"hour" : "2"
},
]
}
}
I did average the price per day via:
db.hourly.aggregate([
{$match: {timestamp : "2016-05-28"}},
{$unwind: "$price_information.hourly_rates"},
{$group: { _id: "$unique_item_identifier", total_price: { $avg: "$price_information.hourly_rates.price"}}}
]);
I am struggling with bringing (projecting) other params with in the result set. I would like to have also some_item and timestampin the result set. I tried to use a $project: {some_item: 1, total_price: 1, ...} within the query, but that wasn't right.
My desired output would be like:
{
"_id" : ObjectId("5693afb1b49eb7d5ed97de27"),
"someItem" : {
"_id" : ObjectId("5693afb1b49eb7d5ed97de14"),
"item_property_1" : 1.0,
"item_property_2" : 2.0,
},
"timestamp" : "2016-05-28",
"price_information" : {
"avg_price": 34
}
}
If somebody could give me a hint, how to project the grouping and the other params into the result set, I would be thankful.
Best
Rob
If using MongoDB 3.2 and newer, you can use $avg in the $project pipeline since it returns the average of the specified expression or list of expressions for each document e.g
db.hourly.aggregate([
{ "$match": { "timestamp": "2016-05-28" } },
{
"$project": {
"price_information": {
"avg_price": { "$avg": "$price_information.hourly_rates.price" }
},
"someItem": 1,
"timestamp": 1,
}
}
]);
In previous versions of MongoDB, $avg is available in the $group stage only. So to include the other fields, use the $first operator in your grouping:
db.hourly.aggregate([
{ "$match": { "timestamp": "2016-05-28" } },
{ "$unwind": "$price_information.hourly_rates" },
{
"$group": {
"_id": "$_id",
"avg_price": { "$avg": "$price_information.hourly_rates.price" },
"someItem": { "$first": "$some_item" },
"timestamp": { "$first": "$timestamp" },
}
},
{
"$project": {
"price_information": { "avg_price": "$avg_price" },
"someItem": 1
"timestamp": 1
}
}
]);
Note: Usage of the $first operator in a $group stage will largely depend on how the documents getting in that pipeline are ordered as well as the group by key. Because $first will returns the first document value in a group of documents that share the same group by key, the $group stage logically should precede a $sort stage to have the input documents in a defined order. This is only sensible to use when you know the order that the data is being processed in.
However, as the above is grouping by the main document's _id key, the $first operator when applied to non-denormalized fields (and not the flattened price_information array fields) will guarantee the original value in the result. Hence no need for a pre-sort stage to define the order since it won't be necessary in this case.

How to calculate a sum of specific documents using aggregation?

I have a schema representing a message thread. So each document in the mongo database looks something like:
{
id: "thread_id",
participants: ["user1", "user2"],
unReadMessageCounts: [
{
participant: "user1",
count: 5
},
{
participant: "user2",
count: 3
}
}
What I want to do is get a sum of all unread messages counts for a given user - say, "user2". I know I could do this by just doing a find() on the collection and then writing a function to sum up to counts for a given user. But I'd like to use mongo's aggregate functionality if possible. I know I can do a match to first select all threads in which "user2" is a participant, but then how do I construct the group and/or sum expressions to pull out the right field from the document?
Use the following aggregation pipeline to get the desired result. The initial step filters out the incoming documents to only accept "user2" participant by way of the $match operator.
The preceding pipeline stage then "denormalizes" the unReadMessageCounts array through the $unwind operator that outputs 2 documents from the array for each incumbent document (in your above sample data).
Further filtering is necessary to aggregate data for the correct participant and this is done through another $match pipeline step.
The final aggregation operation using $group specifies a group _id of null, calculating the total counts for all documents in the pipeline using the accumulator operator $sum on the "unReadMessageCounts.count" field.
So, running this aggregation pipeline on the sample data given:
db.collection.aggregate([
{
"$match": { "unReadMessageCounts.participant": "user2" }
},
{ "$unwind" : "$unReadMessageCounts" },
{
"$match": { "unReadMessageCounts.participant": "user2" }
},
{
"$group": {
"_id": null,
"total": { "$sum": "$unReadMessageCounts.count" }
}
}
])
will yield the result:
/* 0 */
{
"result" : [
{
"_id" : null,
"total" : 3
}
],
"ok" : 1
}
You can use the $redact operator to as shown here to limit the size of documents to process in the pipeline then you $unwind your documents and in the $group stage you use the $sum accumulator operator to return total of unread message for "user2".
db.collection.aggregate([
{ "$match": {
"unReadMessageCounts": {
"$elemMatch": { "participant": "user2" }
}
}},
{ "$redact": {
"$cond": [
{ "$or": [
{ "$eq": [ "$participant", "user2" ] },
{ "$not" : "$participant" }
]},
"$$DESCEND",
"$$PRUNE"
]
}},
{ "$unwind": "$unReadMessageCounts" },
{ "$group": {
"_id": null,
"total": { "$sum": "$unReadMessageCounts.count" }
}}
])

How to optimize mongoDB query?

I am having following sample document in the mongoDB.
{
"location" : {
"language" : null,
"country" : "null",
"city" : "null",
"state" : null,
"continent" : "null",
"latitude" : "null",
"longitude" : "null"
},
"request" : [
{
"referrer" : "direct",
"url" : "http://www.google.com/"
"title" : "index page"
"currentVisit" : "1401282897"
"visitedTime" : "1401282905"
},
{
"referrer" : "direct",
"url" : "http://www.stackoverflow.com/",
"title" : "index page"
"currentVisit" : "1401282900"
"visitedTime" : "1401282905"
},
......
]
"uuid" : "109eeee0-e66a-11e3"
}
Note:
The database contains more than 10845 document
Each document contains nearly 100 request(100 object in the request array).
Technology/Language - node.js
I had setProfiling to check the execution time
First Query - 13899ms
Second Query - 9024ms
Third Query - 8310ms
Fourth Query - 6858ms
There is no much difference using indexing
Queries:
I am having the following aggregation queries to be executed to fetch the data.
var match = {"request.currentVisit":{$gte:core.getTime()[1].toString(),$lte:core.getTime()[0].toString()}};
For Example: var match = {"request.currentVisit":{$gte:"1401282905",$lte:"1401282935"}};
For third and fourth query request.visitedTime instead of request.currentVisit
First
[
{ "$project":{
"request.currentVisit":1,
"request.url":1
}},
{ "$match":{
"request.1": {$exists:true}
}},
{ "$unwind": "$request" },
{ "$match": match },
{ "$group": {
"_id": {
"url":"$request.url"
},
"count": { "$sum": 1 }
}},
{ "$sort":{ "count": -1 } }
]
Second
[
{ "$project": {
"request.currentVisit":1,
"request.url":1
}},
{ "$match": {
"request":{ "$size": 1 }
}},
{ "$unwind": "$request" },
{ "$match": match },
{ "$group": {
"_id":{
"url":"$request.url"
},
"count":{ "$sum": 1 }
}},
{ "$sort": { "count": -1} }
]
Third
[
{ "$project": {
"request.visitedTime":1,
"uuid":1
}},
{ "$match":{
"request.1": { "$exists": true }
}},
{ "$match": match },
{ "$group": {
"_id": "$uuid",
"count":{ "$sum": 1 }
}},
{ "$group": {
"_id": null,
"total": { "$sum":"$count" }}
}}
]
Forth
[
{ "$project": {
"request.visitedTime":1,
"uuid":1
}},
{ "$match":{
"request":{ "$size": 1 }
}},
{ "$match": match },
{ "$group": {
"_id":"$uuid",
"count":{ "$sum": 1 }
}},
{ "$group": {
"_id":null,
"total": { "$sum": "$count" }
}}
]
Problem:
It is taking more than 38091 ms to fetch the data.
Is there any way to optimize the query?
Any suggestion will be grateful.
Well there are a few problems and you definitely need indexes, but you cannot have compound ones. It is the "timestamp" values that you are querying within the array that you want to index. It would also be advised that you either convert these to numeric values rather than the current strings, or indeed to BSON Date types. The latter form is actually internally stored as a numeric timestamp value, so there is a general storage size reduction, which also reduces the index size as well as being more efficient to match on the numeric values.
The big problem with each query is that you are always later diving into the "array" contents after processing an $unwind and then "filtering" that with match. While this what you want to do for your result, since you have not applied the same filter at an earlier stage, you have many documents in the pipeline that do not match these conditions when you $unwind. The result is "lots" of documents you do not need being processed in this stage. And here you cannot use an index.
Where you need this match is at the start of the pipeline stages. This narrows down the documents to the "possible" matches before that acutual array is filtered.
So using the first as an example:
[
{ "$match":{
{ "request.currentVisit":{
"$gte":"1401282905", "$lte": "1401282935"
}
}},
{ "$unwind": "$request" },
{ "$match":{
{ "request.currentVisit":{
"$gte":"1401282905", "$lte": "1401282935"
}
}},
{ "$group": {
"_id": {
"url":"$request.url"
},
"count": { "$sum": 1 }
}},
{ "$sort":{ "count": -1 } }
]
So a few changes. There is a $match at the head of the pipeline. This narrows down documents and is able to use an index. That is the most important performance consideration. Golden rule, always "match" first.
The $project you had in there was redundant as you cannot project "just" the fields of an array that is yet unwound. There is also a misconception that people believe they $project first to reduce the pipeline. The effect is very minimal if in fact there is a later $project or $group statement that actually limits the fields, then this will be "forward optimized" so things do get taken out of the pipeline processing for you. Still the $match statement above does more to optimize.
Dropping the need to see if the array is actually there with the other $match stage, as you are now "implicitly" doing that at the start of the pipeline. If more conditions make you more comfortable, then add them to that initial pipeline stage.
The rest remains unchanged, as you then $unwind the array and $match to filter the items that you actually want before moving on to your remaining processing. By now, the input documents have been significantly reduced, or reduced as much as they are going to be.
The other alternative that you can do with MongoDB 2.6 and greater is "filter" the array content before you even **$unwind it. This would produce a listing like this:
[
{ "$match":{
{ "request.currentVisit":{
"$gte":"1401282905", "$lte": "1401282935"
}
}},
{ "$project": {
"request": {
"$setDifference": [
{
"$map": {
"input": "$request",
"as": "el",
"in": {
"$cond"": [
{
"$and":[
{ "$gte": [ "1401282905", "$$el.currentVisit" ] },
{ "$lt": [ "1401282935", "$$el.currentVisit" ] }
]
}
"$el",
false
]
}
}
}
[false]
]
}
}}
{ "$unwind": "$request" },
{ "$group": {
"_id": {
"url":"$request.url"
},
"count": { "$sum": 1 }
}},
{ "$sort":{ "count": -1 } }
]
That may save you some by being able to "filter" the array before the $unwind and which is possibly better than doing the $match afterwards.
But this is the general rule for all of your statements. You need usable indexes and you need to $match first.
It is possible that the actual results you really want could be obtained in a single query, but as it stands your question is not presented that way. Try changing your processing as outlined, and you should see a notable improvement.
If you are still then trying to come to terms with how this could possibly be singular, then you can always ask another question.

Mongodb $cond in aggregation framework

I have a collection with documents that look like the following:
{
ipAddr: '1.2.3.4',
"results" : [
{
"Test" : "Sight",
"Score" : "FAIL",
"Reason" : "S1002"
},
{
"Test" : "Speed",
"Score" : "FAIL",
"Reason" : "85"
},
{
"Test" : "Sound",
"Score" : "FAIL",
"Reason" : "A1001"
}
],
"finalGrade" : "FAILED"
}
Here's the aggregation query I'm trying to write, what I want to do (see commented out piece), is to create a grouped field, per ipAddr, of the
'Reason / Error' code, but only if the Reason code begins with a specific letter, and only add the code in once, I tried the following:
db.aggregate([
{$group:
{ _id: "$ipAddr",
attempts: {$sum:1},
results: {$push: "$finalGrade"},
// errorCodes: {$addToSet: {$cond: ["$results.Reason": /[A|B|S|N.*/, "$results.Reason", ""]}},
finalResult: {$last: "$finalGrade"} }
}
]);
Everything works, excluding the commented out 'errorCodes' line. The logic I'm attempting to create is:
"Add the the errorCodes set the value of the results.Reason code IF it begins with an A, B, S, or N, otherwise there is nothing to add".
For the Record above, the errorCodes set should contain:
...
errorCodes: [S1002,A1001],
...
$group cannot take conditional expressions, which is why that line is not working. $project is the phase where you can transform the original document based on $conditional expressions (among other things).
You need two steps in the aggregation pipeline before you can $group - first you need to $unwind the results array, and next you need to $match to filter out the results you don't care about.
That would do the simple thing of just throwing out the results with error codes you don't care about keeping, but it sounds like you want to count the total number of failures including all error codes, but then only add particular ones to the output array? There isn't a straight-forward way to do that, you would have to make two $group $unwind passes in the pipeline.
Something similar to this will do it:
db.aggregate([
{$unwind : "$results"},
{$group:
{ _id: "$ipAddr",
attempts: {$sum:1},
results: {$push : "$results"},
finalGrade: {$last : "$finalGrade" }
}
},
{$unwind: "$results"},
{$match: {"results.Reason":/yourMatchExpression/} },
{$group:
{ _id: "$ipAddr",
attempts: {$last:"$attempts"},
errorCodes: {$addToSet: "$results.Reason"},
finalResult: {$last: "$finalGrade"}
}
]);
If you only want to count attempts that have the matching error code then you can do that with a single $group - you will need to do $unwind, $match and $group. You could use $project with $cond as you had it, but then your array of errorCodes will have an empty string entry along with all the proper error codes.
As of Mongo 2.4, $regex can be used for pattern matching, but not as an expression returning a boolean, which is what's required by $cond
Then, you can either use a $match operator to use the $regex keyword:
http://mongotry.herokuapp.com/#?bookmarkId=52fb39e207fc4c02006fcfed
[
{
"$unwind": "$results"
},
{
"$match": {
"results.Reason": {
"$regex": "[SA].*"
}
}
},
{
"$group": {
"_id": "$ipAddr",
"attempts": {
"$sum": 1
},
"results": {
"$push": "$finalGrade"
},
"undefined": {
"$last": "$finalGrade"
},
"errorCodes": {
"$addToSet": "$results.Reason"
}
}
}
]
or you can use $substr as your pattern matching is very simple
http://mongotry.herokuapp.com/index.html#?bookmarkId=52fb47bc7f295802001baa38
[
{
"$unwind": "$results"
},
{
"$group": {
"_id": "$ipAddr",
"errorCodes": {
"$addToSet": {
"$cond": [
{
"$or": [
{
"$eq": [
{
"$substr": [
"$results.Reason",
0,
1
]
},
"A"
]
},
{
"$eq": [
{
"$substr": [
"$results.Reason",
0,
1
]
},
"S"
]
}
]
},
"$results.Reason",
"null"
]
}
}
}
}
]