MongoDB group aggregation of multiple elements and sub-documents - mongodb

I'm new to MongoDB and am trying to figure out how to group based on two elements, one of which is time and the other is a sub-document. My data structure is based on the cube structure:
{
"_id" : ObjectId("52d931f9f61313b46bf456b0"),
"type" : "build",
"time" : ISODate("2014-01-17T01:27:18.413Z"),
"data" : {
"build_number" : 7,
"build_duration" : 885843,
"build_url" : "job/Test_Job/7/",
"build_project_name" : "Test_Job",
"build_result" : "SUCCESS"
}
}
I was able to get some Stackoverflow help grouping when my structure was flat, but I'm having trouble with the data sub-document. Here is one of many query variations I have tried:
db.nb.aggregate(
{
$group: {
_id: {
dayOfMonth: { $dayOfMonth: "$time" },
build_project_name: { data: $build_project_name }
},
build_duration: { $avg: data: { "$build_duration" } }
},
}
)
I've tried many different variations on the syntax, but can't seem to get it quite right. Thank you in advance.

I think pretty much you want to do this:
db.nb.aggregate(
[
{$group:
{_id:
{ dayOfMonth: { $dayOfMonth: "$time" },
build_project_name: "$data.build_project_name"
},
build_duration: { $avg: "$data.build_duration" }}
}
])
First, remember aggregate receives an array of operation for input:
db.collection.aggregate([
{...},
{...}
])
Second, references to sub-documents are represented like a tree, so $data.buildduration points to node data "field" builduration inside of data.

Related

MongoDB - average of a feature after slicing the max of another feature in a group of documents

I am very new in mongodb and trying to work around a couple of queries, which I am not even sure if they 're feasible.
The structure of each document is:
{
"_id" : {
"$oid": Text
},
"grade": Text,
"type" : Text,
"score": Integer,
"info" : {
"range" : NumericText,
"genre" : Text,
"special": {keys:values}
}
};
The first query would give me:
per grade (thinking I have to group by "grade")
the highest range (thinking I have to call $max:$range, it should work with a string)
the score average (thinking I have to call $avg:$score)
I tried something like the following, which apparently is wrong:
collection.aggregate([{
'$group': {'_id':'$grade',
'highest_range': {'$max':'$info',
'average_score': {'$avg':'$score'}}}
}])
The second query would give the distinct genre records.
Any help is valuable!
ADDITION - providing an example of the document and the output:
{
"_id" : {
"$oid": '60491ea71f8'
},
"grade": D,
"type" : Shop,
"score": 4,
"info" : {
"range" : "2",
"genre" : 'Pet shop',
"special": {'ClientsParking':True,
'AcceptsCreditCard':True,
'BikeParking':False}
}
};
And the output I am looking into is something within lines:
[{grade: A, "highest_range":"4", "average_score":3.5},
{grade: B, "highest_range":"7", "average_score":8.3},
{grade: C, "highest_range":"3", "average_score":2.4}]
I think you are looking for this:
db.collection.aggregate([
{
'$group': {
'_id': '$grade',
'highest_range': { '$max': '$info.range' },
'average_score': { '$avg': '$score' }
}
}
])
However, $min, $max, $avg works only on numbers, not strings.
You could try { '$first': '$info.range' } or { '$last': '$info.range' }. But it requires $sort for proper result. Not clear what you mean by "highest range".

Select latest document after grouping them by a field in MongoDB

I got a question that I would expect to be pretty simple, but I cannot figure it out. What I want to do is this:
Find all documents in a collection and:
sort the documents by a certain date field
apply distinct on one of its other fields, but return the whole document
Best shown in an example.
This is a mock input:
[
{
"commandName" : "migration_a",
"executionDate" : ISODate("1998-11-04T18:46:14.000Z")
},
{
"commandName" : "migration_a",
"executionDate" : ISODate("1970-05-09T20:16:37.000Z")
},
{
"commandName" : "migration_a",
"executionDate" : ISODate("2005-11-08T11:58:52.000Z")
},
{
"commandName" : "migration_b",
"executionDate" : ISODate("2016-06-02T19:48:34.000Z")
}
]
The expected output is:
[
{
"commandName" : "migration_a",
"executionDate" : ISODate("2005-11-08T11:58:52.000Z")
},
{
"commandName" : "migration_b",
"executionDate" : ISODate("2016-06-02T19:48:34.000Z")
}
]
Or, in other words:
Group the input data by the commandName field
Inside each group sort the documents
Return the newest document from each group
My attempts to write this query have failed:
The distinct() function will only return the value of the field I am distinct-ing on, not the whole document. That makes it unsuitable for my case.
Tried writing an aggregate query, but ran into an issue of how to sort-and-select a single document from inside of each group? The sort aggreation stage will sort the groups among one other, which is not what I want.
I am not too well-versed in Mongo and this is where I hit a wall. Any ideas on how to continue?
For reference, this is the work-in-progress aggregation query I am trying to expand on:
db.getCollection('some_collection').aggregate([
{ $group: { '_id': '$commandName', 'docs': {$addToSet: '$$ROOT'} } },
{ $sort: {'_id.docs.???': 1}}
])
Post-resolved edit
Thank you for the answers. I got what I needed. For future reference, this is the full query that will do what was requested and also return a list of the filtered documents, not groups.
db.getCollection('some_collection').aggregate([
{ $sort: {'executionDate': 1}},
{ $group: { '_id': '$commandName', 'result': { $last: '$$ROOT'} } },
{ $replaceRoot: {newRoot: '$result'} }
])
The query result without the $replaceRoot stage would be:
[
{
"_id": "migration_a",
"result": {
"commandName" : "migration_a",
"executionDate" : ISODate("2005-11-08T11:58:52.000Z")
}
},
{
"_id": "migration_b",
"result": {
"commandName" : "migration_b",
"executionDate" : ISODate("2016-06-02T19:48:34.000Z")
}
}
]
The outer _id and _result are just "group-wrappers" around the actual document I want, which is nested under the result key. Moving the nested document to the root of the result is done using the $replaceRoot stage. The query result when using that stage is:
[
{
"commandName" : "migration_a",
"executionDate" : ISODate("2005-11-08T11:58:52.000Z")
},
{
"commandName" : "migration_b",
"executionDate" : ISODate("2016-06-02T19:48:34.000Z")
}
]
Try this:
db.getCollection('some_collection').aggregate([
{ $sort: {'executionDate': -1}},
{ $group: { '_id': '$commandName', 'doc': {$first: '$$ROOT'} } }
])
I believe this will result in what you're looking for:
db.collection.aggregate([
{
$group: {
"_id": "$commandName",
"executionDate": {
"$last": "$executionDate"
}
}
}
])
You can check it out here
Of course, if you want to match your expected output exactly, you can add a sort (this may not be necessary since your goal is to simply return the newest document from each group):
{
$sort: {
"executionDate": 1
}
}
You can check this version out here.
The use-case the question presents is nearly covered in the $last aggregation operator documentation.
Which summarises:
the $group stage should follow a $sort stage to have the input
documents in a defined order. Since $last simply picks the last
document from a group.
Query: Link
db.collection.aggregate([
{
$sort: {
executionDate: 1
}
},
{
$group: {
_id: "$commandName",
executionDate: {
$last: "$executionDate"
}
}
}
]);

How to get all subdocuments _id into variable

Im trying to get families subdocuments _ids to variable.
Here my schema:
families: [
{
_id: {
type: mongoose.Types.ObjectId
},
name: {
type: String
},
relation: {
type: String
}
}
]
the problem is, i can get the _id of parent to show inside variable, but when im trying to get the families _ids its showing undefined in console log.
What is the proper query to get families subdocuments _ids into variable?
Please try this :
db.yourCollection.aggregate([
{ $unwind: '$families' },
{ $project: { Ids: '$families._id' } }, { $group: { '_id': '$_id', subDocumentsIDs: { $push: '$Ids' } } }
])
Output:
/* 1 */
{
"_id" : ObjectId("5d58d3205a0d22d3c85d16f1"),
"subDocumentsIDs" : [
ObjectId("5d570b350e2fb4f72533d512"),
ObjectId("5d570b350e2fb4f71533d510"),
ObjectId("5d570b350e2fb4172533d511")
]
}
/* 2 */
{
"_id" : ObjectId("5d58d3105a0d22d3c85d1591"),
"subDocumentsIDs" : [
ObjectId("5d570b350e2fb4f72533d312"),
ObjectId("5d570b350e2fb4f71533d310"),
ObjectId("5d570b350e2fb4172533d311")
]
}
Please consider this as a basic example & go ahead with enhancements if anything needed, something like $unwind as an early stage would have performance impacts, if your collection is of large dataset, but you can easily avoid that by using $match as first stage, as you said you're able to get parent _id then use it in $match to filter documents

Aggregate on array of embedded documents

I have a mongodb collection with multiple documents. Each document has an array with multiple subdocuments (or embedded documents i guess?). Each of these subdocuments is in this format:
{
"name": string,
"count": integer
}
Now I want to aggregate these subdocuments to find
The top X counts and their name.
Same as 1. but the names have to match a regex before sorting and limiting.
I have tried the following for 1. already - it does return me the top X but unordered, so I'd have to order them again which seems somewhat inefficient.
[{
$match: {
_id: id
}
}, {
$unwind: {
path: "$array"
}
}, {
$sort: {
'count': -1
}
}, {
$limit: x
}]
Since i'm rather new to mongodb this is pretty confusing for me. Happy for any help. Thanks in advance.
The sort has to include the array name in order to avoid an additional sort later on.
Given the following document to work with:
{
students: [{
count: 4,
name: "Ann"
}, {
count: 7,
name: "Brad"
}, {
count: 6,
name: "Beth"
}, {
count: 8,
name: "Catherine"
}]
}
As an example, the following aggregation query will match any name containing the letters "h" and "e". This needs to happen after the "$unwind" step in order to only keep the ones you need.
db.tests.aggregate([
{$match: {
_id: ObjectId("5c1b191b251d9663f4e3ce65")
}},
{$unwind: {
path: "$students"
}},
{$match: {
"students.name": /[he]/
}},
{$sort: {
"students.count": -1
}},
{$limit: 2}
])
This is the output given the above mentioned input:
{ "_id" : ObjectId("5c1b191b251d9663f4e3ce65"), "students" : { "count" : 8, "name" : "Catherine" } }
{ "_id" : ObjectId("5c1b191b251d9663f4e3ce65"), "students" : { "count" : 6, "name" : "Beth" } }
Both names contain the letters "h" and "e", and the output is sorted from high to low.
When setting the limit to 1, the output is limited to:
{ "_id" : ObjectId("5c1b191b251d9663f4e3ce65"), "students" : { "count" : 8, "name" : "Catherine" } }
In this case only the highest count has been kept after having matched the names.
=====================
Edit for the extra question:
Yes, the first $match can be changed to filter on specific universities.
{$match: {
university: "University X"
}},
That will give one or more matching documents (in case you have a document per year or so) and the rest of the aggregation steps would still be valid.
The following match would retrieve the students for the given university for a given academic year in case that would be needed.
{$match: {
university: "University X",
academic_year: "2018-2019"
}},
That should narrow it down to get the correct documents.

mongodb aggregation framework group + project

I have the following issue:
this query return 1 result which is what I want:
> db.items.aggregate([ {$group: { "_id": "$id", version: { $max: "$version" } } }])
{
"result" : [
{
"_id" : "b91e51e9-6317-4030-a9a6-e7f71d0f2161",
"version" : 1.2000000000000002
}
],
"ok" : 1
}
this query ( I just added projection so I can later query for the entire document) return multiple results. What am I doing wrong?
> db.items.aggregate([ {$group: { "_id": "$id", version: { $max: "$version" } }, $project: { _id : 1 } }])
{
"result" : [
{
"_id" : ObjectId("5139310a3899d457ee000003")
},
{
"_id" : ObjectId("513931053899d457ee000002")
},
{
"_id" : ObjectId("513930fd3899d457ee000001")
}
],
"ok" : 1
}
found the answer
1. first I need to get all the _ids
db.items.aggregate( [
{ '$match': { 'owner.id': '9e748c81-0f71-4eda-a710-576314ef3fa' } },
{ '$group': { _id: '$item.id', dbid: { $max: "$_id" } } }
]);
2. then i need to query the documents
db.items.find({ _id: { '$in': "IDs returned from aggregate" } });
which will look like this:
db.items.find({ _id: { '$in': [ '1', '2', '3' ] } });
( I know its late but still answering it so that other people don't have to go search for the right answer somewhere else )
See to the answer of Deka, this will do your job.
Not all accumulators are available in $project stage. We need to consider what we can do in project with respect to accumulators and what we can do in group. Let's take a look at this:
db.companies.aggregate([{
$match: {
funding_rounds: {
$ne: []
}
}
}, {
$unwind: "$funding_rounds"
}, {
$sort: {
"funding_rounds.funded_year": 1,
"funding_rounds.funded_month": 1,
"funding_rounds.funded_day": 1
}
}, {
$group: {
_id: {
company: "$name"
},
funding: {
$push: {
amount: "$funding_rounds.raised_amount",
year: "$funding_rounds.funded_year"
}
}
}
}, ]).pretty()
Where we're checking if any of the funding_rounds is not empty. Then it's unwind-ed to $sort and to later stages. We'll see one document for each element of the funding_rounds array for every company. So, the first thing we're going to do here is to $sort based on:
funding_rounds.funded_year
funding_rounds.funded_month
funding_rounds.funded_day
In the group stage by company name, the array is getting built using $push. $push is supposed to be part of a document specified as the value for a field we name in a group stage. We can push on any valid expression. In this case, we're pushing on documents to this array and for every document that we push it's being added to the end of the array that we're accumulating. In this case, we're pushing on documents that are built from the raised_amount and funded_year. So, the $group stage is a stream of documents that have an _id where we're specifying the company name.
Notice that $push is available in $group stages but not in $project stage. This is because $group stages are designed to take a sequence of documents and accumulate values based on that stream of documents.
$project on the other hand, works with one document at a time. So, we can calculate an average on an array within an individual document inside a project stage. But doing something like this where one at a time, we're seeing documents and for every document, it passes through the group stage pushing on a new value, well that's something that the $project stage is just not designed to do. For that type of operation we want to use $group.
Let's take a look at another example:
db.companies.aggregate([{
$match: {
funding_rounds: {
$exists: true,
$ne: []
}
}
}, {
$unwind: "$funding_rounds"
}, {
$sort: {
"funding_rounds.funded_year": 1,
"funding_rounds.funded_month": 1,
"funding_rounds.funded_day": 1
}
}, {
$group: {
_id: {
company: "$name"
},
first_round: {
$first: "$funding_rounds"
},
last_round: {
$last: "$funding_rounds"
},
num_rounds: {
$sum: 1
},
total_raised: {
$sum: "$funding_rounds.raised_amount"
}
}
}, {
$project: {
_id: 0,
company: "$_id.company",
first_round: {
amount: "$first_round.raised_amount",
article: "$first_round.source_url",
year: "$first_round.funded_year"
},
last_round: {
amount: "$last_round.raised_amount",
article: "$last_round.source_url",
year: "$last_round.funded_year"
},
num_rounds: 1,
total_raised: 1,
}
}, {
$sort: {
total_raised: -1
}
}]).pretty()
In the $group stage, we're using $first and $last accumulators. Right, again we can see that as with $push - we can't use $first and $last in project stages. Because again, project stages are not designed to accumulate values based on multiple documents. Rather they're designed to reshape documents one at a time. Total number of rounds is calculated using the $sum operator. The value 1 simply counts the number of documents passed through that group together with each document that matches or is grouped under a given _id value. The project may seem complex, but it's just making the output pretty. It's just that it's including num_rounds and total_raised from the previous document.