Mongodb: nested field in $group's _id - mongodb

Assume we have documents like this in the collection
{
_id: {
element_id: '12345',
name: 'foobar'
},
value: {
count: 1
}
}
I am using the aggregation framework to do a $group, like so
db.collection.aggregate([
{ $group: { _id: '$_id.element_id', total: { $sum: '$value.count' } } }
])
And got a result of
{ "result" : [ { "_id" : null, "total" : 1 } ], "ok" : 1 }
Notice that the _id field in the result is null. From experimentation it seems that $group is not allowing a nested field declaration for its _id (e.g. $_id.element_id).
Why is this? And is there a workaround for it?
Thank you.

I found a workaround using $project.
db.collection.aggregate([
{ $project: { element_id: '$_id.element_id', count: '$value.count' } },
{ $group: { _id: '$element_id', total: { $sum: '$count' } } }
])
$project Reshapes a document stream by renaming, adding, or removing fields.
http://docs.mongodb.org/manual/reference/aggregation/#_S_project

This turns out to have been issue SERVER-7491. It appears to have been fixed in 2.2.2 (released about 3 days ago).
The workaround mentioned above worked well for me in 2.2.1. As a note, when using the $project workaround (pre 2.2.2) excluding _id from the $project with _id:0 is inadvisable as it appears to behave quite strangely, I ended up with some working properly and some where that portion of the _id field was missing in the end result within the same aggregation.

Related

Select latest document after grouping them by a field in MongoDB

I got a question that I would expect to be pretty simple, but I cannot figure it out. What I want to do is this:
Find all documents in a collection and:
sort the documents by a certain date field
apply distinct on one of its other fields, but return the whole document
Best shown in an example.
This is a mock input:
[
{
"commandName" : "migration_a",
"executionDate" : ISODate("1998-11-04T18:46:14.000Z")
},
{
"commandName" : "migration_a",
"executionDate" : ISODate("1970-05-09T20:16:37.000Z")
},
{
"commandName" : "migration_a",
"executionDate" : ISODate("2005-11-08T11:58:52.000Z")
},
{
"commandName" : "migration_b",
"executionDate" : ISODate("2016-06-02T19:48:34.000Z")
}
]
The expected output is:
[
{
"commandName" : "migration_a",
"executionDate" : ISODate("2005-11-08T11:58:52.000Z")
},
{
"commandName" : "migration_b",
"executionDate" : ISODate("2016-06-02T19:48:34.000Z")
}
]
Or, in other words:
Group the input data by the commandName field
Inside each group sort the documents
Return the newest document from each group
My attempts to write this query have failed:
The distinct() function will only return the value of the field I am distinct-ing on, not the whole document. That makes it unsuitable for my case.
Tried writing an aggregate query, but ran into an issue of how to sort-and-select a single document from inside of each group? The sort aggreation stage will sort the groups among one other, which is not what I want.
I am not too well-versed in Mongo and this is where I hit a wall. Any ideas on how to continue?
For reference, this is the work-in-progress aggregation query I am trying to expand on:
db.getCollection('some_collection').aggregate([
{ $group: { '_id': '$commandName', 'docs': {$addToSet: '$$ROOT'} } },
{ $sort: {'_id.docs.???': 1}}
])
Post-resolved edit
Thank you for the answers. I got what I needed. For future reference, this is the full query that will do what was requested and also return a list of the filtered documents, not groups.
db.getCollection('some_collection').aggregate([
{ $sort: {'executionDate': 1}},
{ $group: { '_id': '$commandName', 'result': { $last: '$$ROOT'} } },
{ $replaceRoot: {newRoot: '$result'} }
])
The query result without the $replaceRoot stage would be:
[
{
"_id": "migration_a",
"result": {
"commandName" : "migration_a",
"executionDate" : ISODate("2005-11-08T11:58:52.000Z")
}
},
{
"_id": "migration_b",
"result": {
"commandName" : "migration_b",
"executionDate" : ISODate("2016-06-02T19:48:34.000Z")
}
}
]
The outer _id and _result are just "group-wrappers" around the actual document I want, which is nested under the result key. Moving the nested document to the root of the result is done using the $replaceRoot stage. The query result when using that stage is:
[
{
"commandName" : "migration_a",
"executionDate" : ISODate("2005-11-08T11:58:52.000Z")
},
{
"commandName" : "migration_b",
"executionDate" : ISODate("2016-06-02T19:48:34.000Z")
}
]
Try this:
db.getCollection('some_collection').aggregate([
{ $sort: {'executionDate': -1}},
{ $group: { '_id': '$commandName', 'doc': {$first: '$$ROOT'} } }
])
I believe this will result in what you're looking for:
db.collection.aggregate([
{
$group: {
"_id": "$commandName",
"executionDate": {
"$last": "$executionDate"
}
}
}
])
You can check it out here
Of course, if you want to match your expected output exactly, you can add a sort (this may not be necessary since your goal is to simply return the newest document from each group):
{
$sort: {
"executionDate": 1
}
}
You can check this version out here.
The use-case the question presents is nearly covered in the $last aggregation operator documentation.
Which summarises:
the $group stage should follow a $sort stage to have the input
documents in a defined order. Since $last simply picks the last
document from a group.
Query: Link
db.collection.aggregate([
{
$sort: {
executionDate: 1
}
},
{
$group: {
_id: "$commandName",
executionDate: {
$last: "$executionDate"
}
}
}
]);

mongo: find non-superseded documents

I have a collection with documents like:
{
"_id" : "ThisIsASampleId_rand12345",
"timestamp" : ISODate("2019-04-30T10:53:34.515Z"),
"mySpecialId" : "specialId_12345",
"status" : "error",
}
My goal is to find all documents with {status: 'error'}, so long as no subsequent documents exist with the same mySpecialId and status 'success'.
Clearly I can do db.jobs.find({status: 'error'}), but after that, I get lost.
Do I need to do a $lookup in an aggregation pipeline into the same collection, using "mySpecialId" as both local and foreign fields, with a $match that includes something like {$gt: {timestamp: $PREVIOUS_TIMESTAMP}}? That feels wrong, somehow.
Is there a simpler/better/more elegant way to do this?
You can $sort your collection by timestamp field and then run $group with $last operator to get most recent document for each mySpecialId. Then you can simply check if that last document's status is error. If not then either all documents in this group had success or error appeared by was superseded with success. To get back original shape of your documents you can use $replaceRoot.
db.col.aggregate([
{
$sort: { timestamp: 1 }
},
{
$group: {
_id: "$mySpecialId",
lastDoc: { $last: "$$ROOT" }
}
},
{
$match: {
"lastDoc.status": "error"
}
},
{
$replaceRoot: {
newRoot: "$lastDoc"
}
}
])

How to remove only one or two fields from documents in mongodb?

This is a very easy question, just having a really bad brain freeze. In my aggregation, I just want to remove the '_id' field by using $project but return everything else. However, I'm getting
$projection requires at least one output field
I would think it's like :
db.coll.aggregate( [ { $match .... }, { $project: { _id: 0 }}])
From v4.2, you can make use of $unset aggregate operator to remove single or multiple fields. You can also exclude a field or fields from an embedded document using the dot notation.
To remove a single field:
db.coll.aggregate([ { $unset: "_id" } ])
To remove multiple fields:
db.coll.aggregate([ { $unset: [ "_id", "name" ] } ])
To remove embedded fields:
db.coll.aggregate([
{ $unset: [ "_id", "author.name" ] }
])
You need to explicitly include fields when using aggregation either via various pipeline operations or via $project. There currently isn't a way to return all fields unless explicitly defined by field name:
$project : {
_id : 0,
"Name" : 1,
"Address" : 1
}
You can exclude the _id using the technique you used and as shown above.
You can do this just with the exact syntax that you wrote in your question.
Example document:
Person
{
_id: ObjectId('6023a13b756e30fec9f77b26'),
name: 'Pablo',
lastname: 'Presidente',
}
If you do and aggregation, with $lookup you can remove, let's say the _id field like this:
db.person.aggregate( [ { task1 }, { ... }, { taskN }, { $project: { _id: 0 }}])
Also this way you can exclude fields for other related documents to your aggregation; you would do that like this:
db.person.aggregate( [ { task1 }, { ... }, { taskN }, { $project: { _id: 0, 'otherDocument._id': 0 }}])
Performance wise I don't know if this is any good, but leaving that aside, this works like a charm!
More info: https://docs.mongodb.com/manual/reference/operator/aggregation/project/#exclude-fields-from-output-documents
Also, you can use $unset but this I haven't tried out.
From the docs:
$unset and $project
The $unset is an alias for the $project stage that
removes/excludes fields
If you are doing a simple find, is mostly the same, here are the docs with some examples:
https://docs.mongodb.com/manual/tutorial/project-fields-from-query-results/#return-all-but-the-excluded-fields
Hope this is useful, best regards!
PR

How can i get duplicate elements identified by 2 keys in mongo collection?

I have a collection with a few million documents for which i need to find at least duplicate document. The duplication criteria is based on 2 keys, not one. So i need to find at least 2 documents which both have { property1 : value1, property2 : value2,}.
For this i am trying to use the aggregate framewotk as in the following example:
db.listings.aggregate({
$group:
{
_id : { property1 : "$property1", property2 : "$property2" },
count: { $sum: 1 }
},},{
$match : {
count: {
$gt : 1
}
}},{
$limit: 1})
I think this should be working, BUT
Mongo returns the following error:
{
"code" : 16390,
"ok" : 0,
"errmsg" : "exception: sharded pipeline failed on shard shard1: { errmsg: \"exception: aggregation result exceeds maximum document size (16MB)\", code: 16389, ok: 0.0}"
I have also tried
db.collection.aggregate( { $group: { _id:
{ $concat: [ "$property1",
": ",
"$property2"
]
},
count: { $sum: 1 }
}
}
)
Got the same result
Does anyone have a better idea how to do this? I am not really a mongo expert, but i have to do this one way or the other.
Thanks in advance
Your idea to shrink the doc as much as possible with $concat is a good one, but $concat is a $project operator, not a $group operator. So try something like this:
db.collection.aggregate(
{ $project: { _id: { $concat: ["$property1", ":", "$property2"] }}},
{ $group: { _id: '$_id', c: { $sum: 1 }}},
{ $match: { c: { $gt: 1 }}})
It still may use too much memory, but it's worth a shot.
Using map-reduce is an alternative. Here you can find examples :
http://docs.mongodb.org/manual/tutorial/map-reduce-examples/

mongodb aggregation framework group + project

I have the following issue:
this query return 1 result which is what I want:
> db.items.aggregate([ {$group: { "_id": "$id", version: { $max: "$version" } } }])
{
"result" : [
{
"_id" : "b91e51e9-6317-4030-a9a6-e7f71d0f2161",
"version" : 1.2000000000000002
}
],
"ok" : 1
}
this query ( I just added projection so I can later query for the entire document) return multiple results. What am I doing wrong?
> db.items.aggregate([ {$group: { "_id": "$id", version: { $max: "$version" } }, $project: { _id : 1 } }])
{
"result" : [
{
"_id" : ObjectId("5139310a3899d457ee000003")
},
{
"_id" : ObjectId("513931053899d457ee000002")
},
{
"_id" : ObjectId("513930fd3899d457ee000001")
}
],
"ok" : 1
}
found the answer
1. first I need to get all the _ids
db.items.aggregate( [
{ '$match': { 'owner.id': '9e748c81-0f71-4eda-a710-576314ef3fa' } },
{ '$group': { _id: '$item.id', dbid: { $max: "$_id" } } }
]);
2. then i need to query the documents
db.items.find({ _id: { '$in': "IDs returned from aggregate" } });
which will look like this:
db.items.find({ _id: { '$in': [ '1', '2', '3' ] } });
( I know its late but still answering it so that other people don't have to go search for the right answer somewhere else )
See to the answer of Deka, this will do your job.
Not all accumulators are available in $project stage. We need to consider what we can do in project with respect to accumulators and what we can do in group. Let's take a look at this:
db.companies.aggregate([{
$match: {
funding_rounds: {
$ne: []
}
}
}, {
$unwind: "$funding_rounds"
}, {
$sort: {
"funding_rounds.funded_year": 1,
"funding_rounds.funded_month": 1,
"funding_rounds.funded_day": 1
}
}, {
$group: {
_id: {
company: "$name"
},
funding: {
$push: {
amount: "$funding_rounds.raised_amount",
year: "$funding_rounds.funded_year"
}
}
}
}, ]).pretty()
Where we're checking if any of the funding_rounds is not empty. Then it's unwind-ed to $sort and to later stages. We'll see one document for each element of the funding_rounds array for every company. So, the first thing we're going to do here is to $sort based on:
funding_rounds.funded_year
funding_rounds.funded_month
funding_rounds.funded_day
In the group stage by company name, the array is getting built using $push. $push is supposed to be part of a document specified as the value for a field we name in a group stage. We can push on any valid expression. In this case, we're pushing on documents to this array and for every document that we push it's being added to the end of the array that we're accumulating. In this case, we're pushing on documents that are built from the raised_amount and funded_year. So, the $group stage is a stream of documents that have an _id where we're specifying the company name.
Notice that $push is available in $group stages but not in $project stage. This is because $group stages are designed to take a sequence of documents and accumulate values based on that stream of documents.
$project on the other hand, works with one document at a time. So, we can calculate an average on an array within an individual document inside a project stage. But doing something like this where one at a time, we're seeing documents and for every document, it passes through the group stage pushing on a new value, well that's something that the $project stage is just not designed to do. For that type of operation we want to use $group.
Let's take a look at another example:
db.companies.aggregate([{
$match: {
funding_rounds: {
$exists: true,
$ne: []
}
}
}, {
$unwind: "$funding_rounds"
}, {
$sort: {
"funding_rounds.funded_year": 1,
"funding_rounds.funded_month": 1,
"funding_rounds.funded_day": 1
}
}, {
$group: {
_id: {
company: "$name"
},
first_round: {
$first: "$funding_rounds"
},
last_round: {
$last: "$funding_rounds"
},
num_rounds: {
$sum: 1
},
total_raised: {
$sum: "$funding_rounds.raised_amount"
}
}
}, {
$project: {
_id: 0,
company: "$_id.company",
first_round: {
amount: "$first_round.raised_amount",
article: "$first_round.source_url",
year: "$first_round.funded_year"
},
last_round: {
amount: "$last_round.raised_amount",
article: "$last_round.source_url",
year: "$last_round.funded_year"
},
num_rounds: 1,
total_raised: 1,
}
}, {
$sort: {
total_raised: -1
}
}]).pretty()
In the $group stage, we're using $first and $last accumulators. Right, again we can see that as with $push - we can't use $first and $last in project stages. Because again, project stages are not designed to accumulate values based on multiple documents. Rather they're designed to reshape documents one at a time. Total number of rounds is calculated using the $sum operator. The value 1 simply counts the number of documents passed through that group together with each document that matches or is grouped under a given _id value. The project may seem complex, but it's just making the output pretty. It's just that it's including num_rounds and total_raised from the previous document.