MongoDb aggregate: Select all group by x

MongoDb aggregate: Select all group by x - mongodb

I am trying to convert the following SQL-like statement into a mongo-query, using the new aggregation framework.
SELECT * FROM ...
GROUP BY class
So far I have managed to write the following, which works well - yet only a single field is selected / returned.
db.studentMarks.aggregate(
{
$project: {
class : 1 // Inclusion mode
}
},
{
$group: {
_id: "$class"
}
}
);
I have also attempted play with the $project pipelines exclusion mode, by adding a field name which never exists, in order to trick MongoDb to return all the fields. While the syntax is correct, no results are returned. E.g:
db.studentMarks.aggregate(
{
$project: {
noneExistingField : 0 // Exclusion mode...
// Attempt to trick mongo into returning all fields
// sadly this fails - empty array is returned.
}
},
{
$group: {
_id: "$class"
}
}
);
The reason why I need all fields to be returned, is that I will not know (in the future) what fields are going to be present or not. E.g. a "student" might have field x, y, z or not. Thus, I need to "select all", group the results by a single field and return it, for a generic purpose - doesn't have to be for "student marks", can be for any type of dataset, without knowing all the fields.
For the mentioned reason, I cannot simply project all the fields which must be returned - once again - because I will not know all the fields.
I hope that someone knows a good way of solving my problem, using the new aggregation framework.

Simply don't use the project operator in the pipeline.
db.studentMarks.aggregate({
$group: { _id: "$class" }
});

You can try group by all the fields push each to the array.
Try below.
Option1
`db.StudentMarks.aggregate(
{
$project: {
class : 1 // Inclusion mode
}
},
{ $group : {
_id : "$class"
,field2: {$push: "$field2"}
,field3: {$push: "$field3"}
,field4: {$push: "$field4"}
}}, $sort{field2:-1})`
I have not tried above, but below works for me.
Option2
`db.StudentMarks.aggregate(
{
{ $group : {
_id : "$class"
,field2: {$push: "$field2"}
,field3: {$push: "$field3"}
,field4: {$push: "$field4"}
}}, $sort{field2:-1})`
Result
`{
"result" : [
{
"_id" : 59,
"field2" : [
59,
59,
59
],
"field3" : [
ObjectId("51e7072086eretetterr0001"),
ObjectId("51e7076786cb343453535354"),
ObjectId("51e716598435353534345335")
],
"field4" : [
11111,
11111,
11111
]
},
{
"_id" : 58,
"field2" : [
58,
58,
58,
58,
58
],
"field3" : [
ObjectId("51e469ad3843534535353535"),
ObjectId("51e7178f86cb349845667775"),
ObjectId("51e71df286cb349842456gtg"),
ObjectId("51e71ea686cb345646466466"),
ObjectId("51e71eac86cb346546464646")
],
"field4" : [
11111,
11111,
11111,
11111,
11111
]
}
],
"ok" : 1
}`

Related

MongoDB - average of a feature after slicing the max of another feature in a group of documents

I am very new in mongodb and trying to work around a couple of queries, which I am not even sure if they 're feasible.
The structure of each document is:
{
"_id" : {
"$oid": Text
},
"grade": Text,
"type" : Text,
"score": Integer,
"info" : {
"range" : NumericText,
"genre" : Text,
"special": {keys:values}
}
};
The first query would give me:
per grade (thinking I have to group by "grade")
the highest range (thinking I have to call $max:$range, it should work with a string)
the score average (thinking I have to call $avg:$score)
I tried something like the following, which apparently is wrong:
collection.aggregate([{
'$group': {'_id':'$grade',
'highest_range': {'$max':'$info',
'average_score': {'$avg':'$score'}}}
}])
The second query would give the distinct genre records.
Any help is valuable!
ADDITION - providing an example of the document and the output:
{
"_id" : {
"$oid": '60491ea71f8'
},
"grade": D,
"type" : Shop,
"score": 4,
"info" : {
"range" : "2",
"genre" : 'Pet shop',
"special": {'ClientsParking':True,
'AcceptsCreditCard':True,
'BikeParking':False}
}
};
And the output I am looking into is something within lines:
[{grade: A, "highest_range":"4", "average_score":3.5},
{grade: B, "highest_range":"7", "average_score":8.3},
{grade: C, "highest_range":"3", "average_score":2.4}]

I think you are looking for this:
db.collection.aggregate([
{
'$group': {
'_id': '$grade',
'highest_range': { '$max': '$info.range' },
'average_score': { '$avg': '$score' }
}
}
])
However, $min, $max, $avg works only on numbers, not strings.
You could try { '$first': '$info.range' } or { '$last': '$info.range' }. But it requires $sort for proper result. Not clear what you mean by "highest range".

Aggregate on array of embedded documents

I have a mongodb collection with multiple documents. Each document has an array with multiple subdocuments (or embedded documents i guess?). Each of these subdocuments is in this format:
{
"name": string,
"count": integer
}
Now I want to aggregate these subdocuments to find
The top X counts and their name.
Same as 1. but the names have to match a regex before sorting and limiting.
I have tried the following for 1. already - it does return me the top X but unordered, so I'd have to order them again which seems somewhat inefficient.
[{
$match: {
_id: id
}
}, {
$unwind: {
path: "$array"
}
}, {
$sort: {
'count': -1
}
}, {
$limit: x
}]
Since i'm rather new to mongodb this is pretty confusing for me. Happy for any help. Thanks in advance.

The sort has to include the array name in order to avoid an additional sort later on.
Given the following document to work with:
{
students: [{
count: 4,
name: "Ann"
}, {
count: 7,
name: "Brad"
}, {
count: 6,
name: "Beth"
}, {
count: 8,
name: "Catherine"
}]
}
As an example, the following aggregation query will match any name containing the letters "h" and "e". This needs to happen after the "$unwind" step in order to only keep the ones you need.
db.tests.aggregate([
{$match: {
_id: ObjectId("5c1b191b251d9663f4e3ce65")
}},
{$unwind: {
path: "$students"
}},
{$match: {
"students.name": /[he]/
}},
{$sort: {
"students.count": -1
}},
{$limit: 2}
])
This is the output given the above mentioned input:
{ "_id" : ObjectId("5c1b191b251d9663f4e3ce65"), "students" : { "count" : 8, "name" : "Catherine" } }
{ "_id" : ObjectId("5c1b191b251d9663f4e3ce65"), "students" : { "count" : 6, "name" : "Beth" } }
Both names contain the letters "h" and "e", and the output is sorted from high to low.
When setting the limit to 1, the output is limited to:
{ "_id" : ObjectId("5c1b191b251d9663f4e3ce65"), "students" : { "count" : 8, "name" : "Catherine" } }
In this case only the highest count has been kept after having matched the names.
=====================
Edit for the extra question:
Yes, the first $match can be changed to filter on specific universities.
{$match: {
university: "University X"
}},
That will give one or more matching documents (in case you have a document per year or so) and the rest of the aggregation steps would still be valid.
The following match would retrieve the students for the given university for a given academic year in case that would be needed.
{$match: {
university: "University X",
academic_year: "2018-2019"
}},
That should narrow it down to get the correct documents.

Project values of different columns into one field

{
"_id" : ObjectId("5ae84dd87f5b72618ba7a669"),
"main_sub" : "MATHS",
"reporting" : [
{
"teacher" : "ABC"
}
],
"subs" : [
{
"sub" : "GEOMETRIC",
"teacher" : "XYZ",
}
]
}
{
"_id" : ObjectId("5ae84dd87f5b72618ba7a669"),
"main_sub" : "SOCIAL SCIENCE",
"reporting" : [
{
"teacher" : "XYZ"
}
],
"subs" : [
{
"sub" : "CIVIL",
"teacher" : "ABC",
}
]
}
I have simplified the structure of the documents that i have.
The basic structure is that I have a parent subject with an array of reporting teachers and an array of sub-subjects(each having a teacher)
I now want to extract all the subject(parent/sub-subjects) along with the condition if they are sub-subjects or not which are taught by a particular teacher.
For eg:
for teacher ABC i want the following structure:
[{'subject':'MATHS', 'is_parent':'True'}, {'subject':'CIVIL', 'is_parent':'FALSE'}]
-- What is the most efficient query possible ..? I have tried $project with $cond and $switch but in both the cases I have had to repeat the conditional statement for 'subject' and 'is_parent'
-- Is it advised to do the computation in a query or should I get the data dump and then modify the structure in the server code? AS in, I could $unwind and get a mapping of the parent subjects with each sub-subject and then do a for loop.
I have tried
db.collection.aggregate(
{$unwind:'$reporting'},
{$project:{
'result':{$cond:[
{$eq:['ABC', '$reporting.teacher']},
"$main_sub",
"$subs.sub"]}
}}
)
then I realised that even if i transform the else part into another query for the sub-subjects I will have to write the exact same thing for the property of is_parent

You have 2 arrays, so you need to unwind both - the reporting and the subs.
After that stage each document will have at most 1 parent teacher-subj and at most 1 sub teacher-subj pairs.
You need to unwind them again to have a single teacher-subj per document, and it's where you define whether it is parent or not.
Then you can group by teacher. No need for $conds, $filters, or $facets. E.g.:
db.collection.aggregate([
{ $unwind: "$reporting" },
{ $unwind: "$subs" },
{ $project: {
teachers: [
{ teacher: "$reporting.teacher", sub: "$main_sub", is_parent: true },
{ teacher: "$subs.teacher", sub: "$subs.sub", is_parent: false }
]
} },
{ $unwind: "$teachers" },
{ $group: {
_id: "$teachers.teacher",
subs: { $push: {
subject: "$teachers.sub",
is_parent: "$teachers.is_parent"
} }
} }
])

MongoDB : limit query to a field and array projection

I have a collection that contains following information
{
"_id" : 1,
"info" : { "createdby" : "xyz" },
"states" : [ 11, 10, 9, 3, 2, 1 ]}
}
I project only states by using query
db.jobs.find({},{states:1})
Then I get only states (and whole array of state values) ! or I can select only one state in that array by
db.jobs.find({},{states : {$slice : 1} })
And then I get only one state value, but along with all other fields in the document as well.
Is there a way to select only "states" field, and at the same time slice only one element of the array. Of course, I can exclude fields but I would like to have a solution in which I can specify both conditions.

You can do this in two ways:
1> Using mongo projection like
<field>: <1 or true> Specify the inclusion of a field
and
<field>: <0 or false> Specify the suppression of the field
so your query as
db.jobs.find({},{states : {$slice : 1} ,"info":0,"_id":0})
2> Other way using mongo aggregation as
db.jobs.aggregate({
"$unwind": "$states"
}, {
"$match": {
"states": 11
}
}, // match states (optional)
{
"$group": {
"_id": "$_id",
"states": {
"$first": "$states"
}
}
}, {
"$project": {
"_id": 0,
"states": 1
}
})

Compare a date of two elements

My problem is difficult to explain :
In my website I save every action of my visitors (view, click, buy etc).
I have a simple collection named "flow" where my data is registered
{
"_id" : ObjectId("534d4a9a37e4fbfc0bf20483"),
"profile" : ObjectId("534bebc32939ffd316a34641"),
"activities" : [
{
"id" : ObjectId("534bebc42939ffd316a3af62"),
"date" : ISODate("2013-12-13T22:39:45.808Z"),
"verb" : "like",
"product" : "5"
},
{
"id" : ObjectId("534bebc52939ffd316a3f480"),
"date" : ISODate("2013-12-20T19:19:10.098Z"),
"verb" : "view",
"product" : "6"
},
{
"id" : ObjectId("534bebc32939ffd316a3690f"),
"date" : ISODate("2014-01-01T07:11:44.902Z"),
"verb" : "buy",
"product" : "5"
},
{
"id" : ObjectId("534bebc42939ffd316a3741b"),
"date" : ISODate("2014-01-11T08:49:02.684Z"),
"verb" : "favorite",
"product" : "26"
}
]
}
I would like to aggregate these data to retrieve the number of people who made an action (for example "view") and then another later in time (for example "buy"). To to that I need to compare "date" inside my "activities" array...
I tried to use aggregation framework to do that but I do not see how too make this request
This is my beginning :
db.flows.aggregate([
{ $project: { profile: 1, activities: 1, _id: 0 } },
{ $match: { $and: [{'activities.verb': 'view'}, {'activities.verb': 'buy'}] }}, //First verb + second verb
{ $unwind: '$activities' },
{ $match: { 'activities.verb': {$in:['view', 'buy']} } }, //First verb + second verb,
{
$group: {
_id: '$profile',
view: { $push: { $cond: [ { $eq: [ "$activities.verb", "view" ] } , "$activities.date", null ] } },
buy: { $push: { $cond: [ { $eq: [ "$activities.verb", "buy" ] } , "$activities.date", null ] } }
}
}
])
Maybe the format of my collection "flow" is not the best to do what I want...If you have any better idea dont hesitate
Thank you for your help !

Here is the aggregation that will give you the total number of buyers who viewed first and then bought (though not necessarily the same product that they viewed).
db.flow.aggregate(
{$match: {"activities.verb":{$all:["view","buy"]}}},
{$unwind :"$activities"},
{$match: {"activities.verb":{$in:["view","buy"]}}},
{$group: {
_id:"$_id",
firstViewed:{$min:{$cond:{
if:{$eq:["$activities.verb","view"]},
then : "$activities.date",
else : new Date(9999,0,1)
}}},
lastBought: {$max:{$cond:{
if:{$eq:["$activities.verb","buy"]},
then:"$activities.date",
else:new Date(1900,0,1)}
}}}
},
{$project: {viewedThenBought:{$cond:{
if:{$gt:["$lastBought","$firstViewed"]},
then:1,
else:0
}}}},
{$group:{_id:null,totalViewedThenBought:{$sum:"$viewedThenBought"}}}
)
Here you first pass through the pipeline only the documents that have all the "verbs" you are interested in. When you group the first time, you want to use the earliest "view" and the last "buy" and the next project compares them to see if they viewed before they bought.
The last step gives you the count of all the people who satisfied your criteria.
Be careful to leave out all $project phases that don't actually compute any new fields (like you very first $project). The aggregation framework is smart enough to never pass through any fields that it sees are not used in any later stages, so there is never a need to $project just to "eliminate" fields as that will happen automatically.

For your query:
I would like to aggregate these data to retrieve the number of people who made an action
Try this:
db.flows.aggregate([
// De-normalize the array into individual documents
{"$unwind" : "$activities"},
// Match for the verbs you are interested in
{"$match" : {"activities.verb":{$in:["buy", "view"]}}},
// Group by verb to get the count
{"$group" : {_id:"$activities.verb", count:{$sum:1}}}
])
The above query would produce an output like:
{
"result" : [
{
"_id" : "buy",
"count" : 1
},
{
"_id" : "view",
"count" : 1
}
],
"ok" : 1
}
Note: The $and operator in your query ({ $match: { $and: [{'activities.verb': 'view'}, {'activities.verb': 'buy'}] }}) is not required as that's the default if you specify multiple conditions. Only if you need a logical OR, $or operator is required.
If you want to use the date in the aggregation query to do queries like how many "views by day", etc.. the Date Aggregation Operators will come in handy.

I see where you are going with this and I think you are basically on the right track. So more or less un-altered (but for formatting preference) and the few tweeks at the end:
db.flows.aggregate([
// Try to $match "first" always to make sure you can get an index
{ "$match": {
"$and": [
{"activities.verb": "view"},
{"activities.verb": "buy"}
]
}},
// Don't worry, the optimizer "sees" this and will sort of "blend" with
// with the first stage.
{ "$project": {
"profile": 1,
"activities": 1,
"_id": 0
}},
{ "$unwind": "$activities" },
{ "$match": {
"activities.verb": { "$in":["view", "buy"] }
}},
{ "$group": {
"_id": "$profile",
"view": { "$min": { "$cond": [
{ "$eq": [ "$activities.verb", "view" ] },
"$activities.date",
null
]}},
"buy": { "$max": { "$cond": [
{ "$eq": [ "$activities.verb", "buy" ] },
"$activities.date",
null
]}}
}},
{ "$project": {
"viewFirst": { "$lt": [ "$view", "$buy" ] }
}}
])
So essentially the $min and $max operators should be self explanatory in the context in that you should be looking for the "first" view to correspond with the "last" purchase. As for me, and would make sense, you would actually be matching these by product (but hint: "Grouping") but I'll leave that part up to you.
The other advantage here is that the false values will always be negated if there is an actual date to match the "verb". Otherwise this goes through as false and this turns out to be okay.
That is because the next thing you do is $project to "compare" the values and ask the question "Did the 'view' happen before the 'buy'?" which is a logical evaluation of the "less than" $lt operator.
As for the schema itself. If you are storing a lot of these "events" then you are probably better off flattening things out into separate documents and finding some way to mark each with the same "session" identifier if that is separate to "profile".
Getting away from large arrays ( which this seems to lead to ) if likely going to help performance, and with care, makes little different to the aggregation process.