MongoDB - average of a feature after slicing the max of another feature in a group of documents - mongodb

I am very new in mongodb and trying to work around a couple of queries, which I am not even sure if they 're feasible.
The structure of each document is:
{
"_id" : {
"$oid": Text
},
"grade": Text,
"type" : Text,
"score": Integer,
"info" : {
"range" : NumericText,
"genre" : Text,
"special": {keys:values}
}
};
The first query would give me:
per grade (thinking I have to group by "grade")
the highest range (thinking I have to call $max:$range, it should work with a string)
the score average (thinking I have to call $avg:$score)
I tried something like the following, which apparently is wrong:
collection.aggregate([{
'$group': {'_id':'$grade',
'highest_range': {'$max':'$info',
'average_score': {'$avg':'$score'}}}
}])
The second query would give the distinct genre records.
Any help is valuable!
ADDITION - providing an example of the document and the output:
{
"_id" : {
"$oid": '60491ea71f8'
},
"grade": D,
"type" : Shop,
"score": 4,
"info" : {
"range" : "2",
"genre" : 'Pet shop',
"special": {'ClientsParking':True,
'AcceptsCreditCard':True,
'BikeParking':False}
}
};
And the output I am looking into is something within lines:
[{grade: A, "highest_range":"4", "average_score":3.5},
{grade: B, "highest_range":"7", "average_score":8.3},
{grade: C, "highest_range":"3", "average_score":2.4}]

I think you are looking for this:
db.collection.aggregate([
{
'$group': {
'_id': '$grade',
'highest_range': { '$max': '$info.range' },
'average_score': { '$avg': '$score' }
}
}
])
However, $min, $max, $avg works only on numbers, not strings.
You could try { '$first': '$info.range' } or { '$last': '$info.range' }. But it requires $sort for proper result. Not clear what you mean by "highest range".

Related

MongoDB unwind output, grabbing one string

"_id": "Long_ID_Stuff"
"GFV" : "user001"
"hf": "NA"
"h" : {
"totalSamples" : 16,
"hist" : [
["US",16]]
"newEvent" :[
["US", NumberLong("654654654654")]
]
}
I am trying to pull out just the "US" portion of this document in a query and so far it has been giving me nothing.
My query thus far is:
db.x_collection.aggregate([{$unwind :"$h.hist"},{$match : { m:"TOP_COUNTRIES"}},{$match: {"h.lastUpdate":{$gt:1446336000000}}},{$match: {"h.hist":"US"}}]).pretty()
Do I need to do a $unwind: $h, then $unwind: $h.hist?
Without a little more information this is my best guess at what you are looking for. Given that the collection you are aggregating is the "h" collection.
db.x_collection.aggregate([
{ $unwind : "$h.hist"},
{ $match :
{ h.hist : "US" },
{ lastUpdate: { $gt:1446336000000 }},
{ m: "TOP_COUNTRIES" }
}
]).pretty();
if "h" is an array you will need to add this:
{ $unwind : $h },

Compare a date of two elements

My problem is difficult to explain :
In my website I save every action of my visitors (view, click, buy etc).
I have a simple collection named "flow" where my data is registered
{
"_id" : ObjectId("534d4a9a37e4fbfc0bf20483"),
"profile" : ObjectId("534bebc32939ffd316a34641"),
"activities" : [
{
"id" : ObjectId("534bebc42939ffd316a3af62"),
"date" : ISODate("2013-12-13T22:39:45.808Z"),
"verb" : "like",
"product" : "5"
},
{
"id" : ObjectId("534bebc52939ffd316a3f480"),
"date" : ISODate("2013-12-20T19:19:10.098Z"),
"verb" : "view",
"product" : "6"
},
{
"id" : ObjectId("534bebc32939ffd316a3690f"),
"date" : ISODate("2014-01-01T07:11:44.902Z"),
"verb" : "buy",
"product" : "5"
},
{
"id" : ObjectId("534bebc42939ffd316a3741b"),
"date" : ISODate("2014-01-11T08:49:02.684Z"),
"verb" : "favorite",
"product" : "26"
}
]
}
I would like to aggregate these data to retrieve the number of people who made an action (for example "view") and then another later in time (for example "buy"). To to that I need to compare "date" inside my "activities" array...
I tried to use aggregation framework to do that but I do not see how too make this request
This is my beginning :
db.flows.aggregate([
{ $project: { profile: 1, activities: 1, _id: 0 } },
{ $match: { $and: [{'activities.verb': 'view'}, {'activities.verb': 'buy'}] }}, //First verb + second verb
{ $unwind: '$activities' },
{ $match: { 'activities.verb': {$in:['view', 'buy']} } }, //First verb + second verb,
{
$group: {
_id: '$profile',
view: { $push: { $cond: [ { $eq: [ "$activities.verb", "view" ] } , "$activities.date", null ] } },
buy: { $push: { $cond: [ { $eq: [ "$activities.verb", "buy" ] } , "$activities.date", null ] } }
}
}
])
Maybe the format of my collection "flow" is not the best to do what I want...If you have any better idea dont hesitate
Thank you for your help !
Here is the aggregation that will give you the total number of buyers who viewed first and then bought (though not necessarily the same product that they viewed).
db.flow.aggregate(
{$match: {"activities.verb":{$all:["view","buy"]}}},
{$unwind :"$activities"},
{$match: {"activities.verb":{$in:["view","buy"]}}},
{$group: {
_id:"$_id",
firstViewed:{$min:{$cond:{
if:{$eq:["$activities.verb","view"]},
then : "$activities.date",
else : new Date(9999,0,1)
}}},
lastBought: {$max:{$cond:{
if:{$eq:["$activities.verb","buy"]},
then:"$activities.date",
else:new Date(1900,0,1)}
}}}
},
{$project: {viewedThenBought:{$cond:{
if:{$gt:["$lastBought","$firstViewed"]},
then:1,
else:0
}}}},
{$group:{_id:null,totalViewedThenBought:{$sum:"$viewedThenBought"}}}
)
Here you first pass through the pipeline only the documents that have all the "verbs" you are interested in. When you group the first time, you want to use the earliest "view" and the last "buy" and the next project compares them to see if they viewed before they bought.
The last step gives you the count of all the people who satisfied your criteria.
Be careful to leave out all $project phases that don't actually compute any new fields (like you very first $project). The aggregation framework is smart enough to never pass through any fields that it sees are not used in any later stages, so there is never a need to $project just to "eliminate" fields as that will happen automatically.
For your query:
I would like to aggregate these data to retrieve the number of people who made an action
Try this:
db.flows.aggregate([
// De-normalize the array into individual documents
{"$unwind" : "$activities"},
// Match for the verbs you are interested in
{"$match" : {"activities.verb":{$in:["buy", "view"]}}},
// Group by verb to get the count
{"$group" : {_id:"$activities.verb", count:{$sum:1}}}
])
The above query would produce an output like:
{
"result" : [
{
"_id" : "buy",
"count" : 1
},
{
"_id" : "view",
"count" : 1
}
],
"ok" : 1
}
Note: The $and operator in your query ({ $match: { $and: [{'activities.verb': 'view'}, {'activities.verb': 'buy'}] }}) is not required as that's the default if you specify multiple conditions. Only if you need a logical OR, $or operator is required.
If you want to use the date in the aggregation query to do queries like how many "views by day", etc.. the Date Aggregation Operators will come in handy.
I see where you are going with this and I think you are basically on the right track. So more or less un-altered (but for formatting preference) and the few tweeks at the end:
db.flows.aggregate([
// Try to $match "first" always to make sure you can get an index
{ "$match": {
"$and": [
{"activities.verb": "view"},
{"activities.verb": "buy"}
]
}},
// Don't worry, the optimizer "sees" this and will sort of "blend" with
// with the first stage.
{ "$project": {
"profile": 1,
"activities": 1,
"_id": 0
}},
{ "$unwind": "$activities" },
{ "$match": {
"activities.verb": { "$in":["view", "buy"] }
}},
{ "$group": {
"_id": "$profile",
"view": { "$min": { "$cond": [
{ "$eq": [ "$activities.verb", "view" ] },
"$activities.date",
null
]}},
"buy": { "$max": { "$cond": [
{ "$eq": [ "$activities.verb", "buy" ] },
"$activities.date",
null
]}}
}},
{ "$project": {
"viewFirst": { "$lt": [ "$view", "$buy" ] }
}}
])
So essentially the $min and $max operators should be self explanatory in the context in that you should be looking for the "first" view to correspond with the "last" purchase. As for me, and would make sense, you would actually be matching these by product (but hint: "Grouping") but I'll leave that part up to you.
The other advantage here is that the false values will always be negated if there is an actual date to match the "verb". Otherwise this goes through as false and this turns out to be okay.
That is because the next thing you do is $project to "compare" the values and ask the question "Did the 'view' happen before the 'buy'?" which is a logical evaluation of the "less than" $lt operator.
As for the schema itself. If you are storing a lot of these "events" then you are probably better off flattening things out into separate documents and finding some way to mark each with the same "session" identifier if that is separate to "profile".
Getting away from large arrays ( which this seems to lead to ) if likely going to help performance, and with care, makes little different to the aggregation process.

MongoDB group aggregation of multiple elements and sub-documents

I'm new to MongoDB and am trying to figure out how to group based on two elements, one of which is time and the other is a sub-document. My data structure is based on the cube structure:
{
"_id" : ObjectId("52d931f9f61313b46bf456b0"),
"type" : "build",
"time" : ISODate("2014-01-17T01:27:18.413Z"),
"data" : {
"build_number" : 7,
"build_duration" : 885843,
"build_url" : "job/Test_Job/7/",
"build_project_name" : "Test_Job",
"build_result" : "SUCCESS"
}
}
I was able to get some Stackoverflow help grouping when my structure was flat, but I'm having trouble with the data sub-document. Here is one of many query variations I have tried:
db.nb.aggregate(
{
$group: {
_id: {
dayOfMonth: { $dayOfMonth: "$time" },
build_project_name: { data: $build_project_name }
},
build_duration: { $avg: data: { "$build_duration" } }
},
}
)
I've tried many different variations on the syntax, but can't seem to get it quite right. Thank you in advance.
I think pretty much you want to do this:
db.nb.aggregate(
[
{$group:
{_id:
{ dayOfMonth: { $dayOfMonth: "$time" },
build_project_name: "$data.build_project_name"
},
build_duration: { $avg: "$data.build_duration" }}
}
])
First, remember aggregate receives an array of operation for input:
db.collection.aggregate([
{...},
{...}
])
Second, references to sub-documents are represented like a tree, so $data.buildduration points to node data "field" builduration inside of data.

MongoDB 2.2 Aggregation Framework group by field name

Is it possible to group-by field name? Or do I need a different structure so I can group-by value?
I know we can use group by on values and we can unwind arrays, but is it possible to get total apples, pears and oranges owned by John amongst the three houses here without specifying "apples", "pears" and "oranges" explicitly as part of the query? (so NOT like this);
// total all the fruit John has at each house
db.houses.aggregate([
{
$group: {
_id: null,
"apples": { $sum: "$people.John.items.apples" },
"pears": { $sum: "$people.John.items.pears" },
"oranges": { $sum: "$people.John.items.oranges" },
}
},
])
In other words, can I group-by the first field-name under "items" and get the aggregate sum of apples:104, pears:202 and oranges:306, but also bananas, melons and anything else that might be there? Or do I need to restructure the data into an array of key/value pairs like categories?
db.createCollection("houses");
db.houses.remove();
db.houses.insert(
[
{
House: "birmingham",
categories : [
{
k : "location",
v : { d : "central" }
}
],
people: {
John: {
items: {
apples: 2,
pears: 1,
oranges: 3,
}
},
Dave: {
items: {
apples: 30,
pears: 20,
oranges: 10,
},
},
},
},
{
House: "London", categories: [{ k: "location", v: { d: "central" } }, { k: "type", v: { d: "rented" } }],
people: {
John: { items: { apples: 2, pears: 1, oranges: 3, } },
Dave: { items: { apples: 30, pears: 20, oranges: 10, }, },
},
},
{
House: "Cambridge", categories: [{ k: "type", v: { d: "rented" } }],
people: {
John: { items: { apples: 100, pears: 200, oranges: 300, } },
Dave: { items: { apples: 0.3, pears: 0.2, oranges: 0.1, }, },
},
},
]
);
Secondly, and more importantly, could I then also group by "house.categories.k" ? In other words, is it possible to find out how many "apples" "John" has in "rented" vs "owned" or "friends" houses (so group by "categories.k.type")?
Finally - if this is even possible, is it sensible? At first I thought it was quite useful to create dictionaries of nested objects using actual field names of the object, as it seemed a logical use of a document database, and it seemed to make the MR queries easier to write vs arrays, but now I'm starting to wonder if this is all a bad idea and having variable field names makes it very tricky/inefficient to write aggregation queries.
OK, so I think I have this partially solved. At least for the shape of data in the initial question.
// How many of each type of fruit does John have at each location
db.houses.aggregate([
{
$unwind: "$categories"
},
{
$match: { "categories.k": "location" }
},
{
$group: {
_id: "$categories.v.d",
"numberOf": { $sum: 1 },
"Total Apples": { $sum: "$people.John.items.apples" },
"Total Pears": { $sum: "$people.John.items.pears" },
}
},
])
which yields;
{
"result" : [
{
"_id" : "central",
"numberOf" : 2,
"Total Apples" : 4,
"Total Pears" : 2
}
],
"ok" : 1
}
Note that there's only "central", but if I had other "location"s in my DB I'd get a range of totals for each location. I wouldn't need the $unwind step if I had named properties instead of an array for "categories", but this is where I find the structure is at odds with itself. There are several keywords likely under "categories". The sample data shows "type" and "location" but there could be around 10 of these categorizations all with different values. So if I used named fields;
"categories": {
location: "london",
type: "owned",
}
...the problem I then have is indexing. I can't afford to simply index "location" since those are user-defined categories, and if 10,000 users choose 10,000 different ways of categorizing their houses I'd need 10,000 indexes, one for each field. But by making it an array I only need one on the array field itself. The downside is the $unwind step. I ran into this before with MapReduce. The last thing you want to be doing is a ForEach loop in JavaScript to cycle an array if you can help it. What you really want is to filter out the fields by name because it's much quicker.
Now this is all well and good where I already know what fruit I'm looking for, but if I don't, it's much harder. I can't (as far as I can see) $unwind or otherwise ForEach "people.John.items" here. If I could, I'd be overjoyed. So since the names of fruit are again user-defined, it looks like I need to convert them to an array as well, like this;
{
"people" : {
"John" : {
"items" : [
{ k:"apples", v:100 },
{ k:"pears", v:200 },
{ k:"oranges", v:300 },
]
},
}
}
So that now allows me get the fruit (where I don't know which fruit to look for) totalled, again by location;
db.houses.aggregate([
{
$unwind: "$categories"
},
{
$match: { "categories.k": "location" }
},
{
$unwind: "$people.John.items"
},
{
$group: { // compound key - thanks to Jenna
_id: { fruit:"$people.John.items.k", location:"$categories.v.v" },
"numberOf": { $sum: 1 },
"Total Fruit": { $sum: "$people.John.items.v" },
}
},
])
So now I'm doing TWO $unwinds. If you're thinking that looks grotesquely ineffecient, you'd be right. If I have just 10,000 house records, with 10 categories each, and 10 types of fruit, this query takes half a minute to run.
OK, so I can see that moving the $match before the $unwind improves things significantly but then it's the wrong output. I don't want an entry for every category, I want to filter out just the "location" categories.
I would have made this comment, but it's easier to format in a response text box.
{ _id: 1,
house: "New York",
people: {
John: {
items: {apples: 1, oranges:2}
}
Dave: {
items: {apples: 2, oranges: 1}
}
}
}
{ _id: 2,
house: "London",
people: {
John: {
items: {apples: 3, oranges:2}
}
Dave: {
items: {apples: 1, oranges:3}
}
}
}
Just to make sure I understand your question, is this what you're trying to accomplish?
{location: "New York", johnFruit:3}
{location: "London", johnFruit: 5}
Since categories is not nested under house, you can't group by "house.categories.k", but you can use a compound key for the _id of $group to get this result:
{ $group: _id: {house: "$House", category: "$categories.k"}
Although "k" doesn't contain the information that you're presumably trying to group by. And as for "categories.k.type", type is the value of k, so you can't use this syntax. You would have to group by "categories.v.d".
It may be possible with your current schema to accomplish this aggregation using $unwind, $project, possibly $match, and finally $group, but the command won't be pretty. If possible, I would highly recommend restructuring your data to make this aggregation much simpler. If you would like some help with schema, please let us know.
I'm not sure if this is a possible solution, but what if you begin the aggregation process by determining the number of different locations using distinct(), and run separate aggregation commands for each location? distinct() may not be efficient, but every subsequent aggregation will be able to use $match, and therefore, the index on categories. You could use the same logic to count the fruit for "categories.type".
{
"_id" : 1,
"house" : "New York",
"people" : {
"John" : [{"k" : "apples","v" : 1},{"k" : "oranges","v" : 2}],
"Dave" : [{"k" : "apples","v" : 2},{"k" : "oranges","v" : 1}]
},
"categories" : [{"location" : "central"},{"type" : "rented"}]
}
{
"_id" : 2,
"house" : "London",
"people" : {
"John" : [{"k" : "apples","v" : 3},{"k" : "oranges","v" : 2}],
"Dave" : [{"k" : "apples","v" : 3},{"k" : "oranges","v" : 1}]
},
"categories" : [{"location" : "suburb"},{"type" : "rented"}]
}
{
"_id" : 3,
"house" : "London",
"people" : {
"John" : [{"k" : "apples","v" : 0},{"k" : "oranges","v" : 1}],
"Dave" : [{"k" : "apples","v" : 2},{"k" : "oranges","v" : 4}]
},
"categories" : [{"location" : "central"},{"type" : "rented"}]
}
Run distinct() and iterate through the results by running aggregate() commands for each unique value of "categories.location":
db.agg.distinct("categories.location")
[ "central", "suburb" ]
db.agg.aggregate(
{$match: {categories: {location:"central"}}}, //the index entry is on the entire
{$unwind: "$people.John"}, //document {location:"central"}, so
{$group:{ //use this syntax to use the index
_id:"$people.John.k",
"numberOf": { $sum: 1 },
"Total Fruit": { $sum: "$people.John.v"}
}
}
)
{
"result" : [
{
"_id" : "oranges",
"numberOf" : 2,
"Total Fruit" : 3
},
{
"_id" : "apples",
"numberOf" : 2,
"Total Fruit" : 1
}
],
"ok" : 1
}

MongoDb aggregate: Select all group by x

I am trying to convert the following SQL-like statement into a mongo-query, using the new aggregation framework.
SELECT * FROM ...
GROUP BY class
So far I have managed to write the following, which works well - yet only a single field is selected / returned.
db.studentMarks.aggregate(
{
$project: {
class : 1 // Inclusion mode
}
},
{
$group: {
_id: "$class"
}
}
);
I have also attempted play with the $project pipelines exclusion mode, by adding a field name which never exists, in order to trick MongoDb to return all the fields. While the syntax is correct, no results are returned. E.g:
db.studentMarks.aggregate(
{
$project: {
noneExistingField : 0 // Exclusion mode...
// Attempt to trick mongo into returning all fields
// sadly this fails - empty array is returned.
}
},
{
$group: {
_id: "$class"
}
}
);
The reason why I need all fields to be returned, is that I will not know (in the future) what fields are going to be present or not. E.g. a "student" might have field x, y, z or not. Thus, I need to "select all", group the results by a single field and return it, for a generic purpose - doesn't have to be for "student marks", can be for any type of dataset, without knowing all the fields.
For the mentioned reason, I cannot simply project all the fields which must be returned - once again - because I will not know all the fields.
I hope that someone knows a good way of solving my problem, using the new aggregation framework.
Simply don't use the project operator in the pipeline.
db.studentMarks.aggregate({
$group: { _id: "$class" }
});
You can try group by all the fields push each to the array.
Try below.
Option1
`db.StudentMarks.aggregate(
{
$project: {
class : 1 // Inclusion mode
}
},
{ $group : {
_id : "$class"
,field2: {$push: "$field2"}
,field3: {$push: "$field3"}
,field4: {$push: "$field4"}
}}, $sort{field2:-1})`
I have not tried above, but below works for me.
Option2
`db.StudentMarks.aggregate(
{
{ $group : {
_id : "$class"
,field2: {$push: "$field2"}
,field3: {$push: "$field3"}
,field4: {$push: "$field4"}
}}, $sort{field2:-1})`
Result
`{
"result" : [
{
"_id" : 59,
"field2" : [
59,
59,
59
],
"field3" : [
ObjectId("51e7072086eretetterr0001"),
ObjectId("51e7076786cb343453535354"),
ObjectId("51e716598435353534345335")
],
"field4" : [
11111,
11111,
11111
]
},
{
"_id" : 58,
"field2" : [
58,
58,
58,
58,
58
],
"field3" : [
ObjectId("51e469ad3843534535353535"),
ObjectId("51e7178f86cb349845667775"),
ObjectId("51e71df286cb349842456gtg"),
ObjectId("51e71ea686cb345646466466"),
ObjectId("51e71eac86cb346546464646")
],
"field4" : [
11111,
11111,
11111,
11111,
11111
]
}
],
"ok" : 1
}`