How to join to two additional collections with conditions - mongodb

select tb1.*,tb3 from tb1,tb2,tb3
where tb1.id=tb2.profile_id and tb2.field='<text>'
and tb3.user_id = tb2.id and tb3.status =0
actually i converted the sql as mongo sql as follows
mongo sql which i used
db.getCollection('tb1').aggregate
([
{ $lookup:
{ from: 'tb2',
localField: 'id',
foreignField: 'profile_id',
as: 'tb2detail'
}
},
{ $lookup:
{ from: 'tb3',
localField: 'tb2.id',
foreignField: 'user_id',
as: 'tb3details'
}
},
{ $match:
{ 'status':
{ '$ne': 'closed'
},
'tb2.profile_type': 'agent',
'tb3.status': 0
}
}
])
but not getting as per the expected result..
Any help will be appreciated..

What you are missing in here is that $lookup produces an "array" in the output field specified by as in it's arguments. This is the general concept of MongoDB "relations", in that a "relation" between documents is represented as a "sub-property" that is "within" the document itself, being either singular or an "array" for many.
Since MongoDB is "schemaless", the general presumption of $lookup is that you mean "many" and the result is therefore "always" an array. So looking for the "same result as in SQL" then you need to $unwind that array after the $lookup. Whether it's "one" or "many" is of no consequence, since it's still "always" an array:
db.getCollection.('tb1').aggregate([
// Filter conditions from the source collection
{ "$match": { "status": { "$ne": "closed" } }},
// Do the first join
{ "$lookup": {
"from": "tb2",
"localField": "id",
"foreignField": "profileId",
"as": "tb2"
}},
// $unwind the array to denormalize
{ "$unwind": "$tb2" },
// Then match on the condtion for tb2
{ "$match": { "tb2.profile_type": "agent" } },
// join the second additional collection
{ "$lookup": {
"from": "tb3",
"localField": "tb2.id",
"foreignField": "id",
"as": "tb3"
}},
// $unwind again to de-normalize
{ "$unwind": "$tb3" },
// Now filter the condition on tb3
{ "$match": { "tb3.status": 0 } },
// Project only wanted fields. In this case, exclude "tb2"
{ "$project": { "tb2": 0 } }
])
Here you need to note the other things you are missing in the translation:
Sequence is "important"
Aggregation pipelines are more "tersely expressive" than SQL. They are in fact best considered as "a sequence of steps" applied to the datasource in order to collate and transform the data. The best analog to this is "piped" command line instructions, such as:
ps -ef | grep mongod | grep -v grep | awk '{ print $1 }'
Where the "pipe" | can be considered as a "pipeline stage" in a MongoDB aggregation "pipeline".
As such we want to $match in order to filter things from the "source" collection as our first operation. And this is generally good practice since it removes any documents that did not meet required conditions from further conditions. Just like what is happening in our "command line pipe" example, where we take "input" then "pipe" to a grep to "remove" or "filter".
Paths Matter
Where the very next thing you do here is "join" via $lookup. The result is an "array" of the items from the "from" collection argument matched by the supplied fields to output in the "as" "field path" as an "array".
The naming chosen here is important, since now the "document" from the source collection considers all items from the "join" to now exist at that given path. To make this easy, I use the same "collection" name as the "join" for the new "path".
So starting from the first "join" the output is to "tb2" and that will hold all the results from that collection. There is also an important thing to note with the following sequence of $unwind and then $match, as to how MongoDB actually processes the query.
Certain Sequences "really" matter
Since it "looks like" there are "three" pipeline stages, being $lookup then $unwind and then $match. But in "fact" MongoDB really does something else, which is demonstrated in the output of { "explain": true } added to the .aggregate() command:
{
"$lookup" : {
"from" : "tb2",
"as" : "tb2",
"localField" : "id",
"foreignField" : "profileId",
"unwinding" : {
"preserveNullAndEmptyArrays" : false
},
"matching" : {
"profile_type" : {
"$eq" : "agent"
}
}
}
},
{
"$lookup" : {
"from" : "tb3",
"as" : "tb3",
"localField" : "tb2.id",
"foreignField" : "id",
"unwinding" : {
"preserveNullAndEmptyArrays" : false
},
"matching" : {
"status" : {
"$eq" : 0.0
}
}
}
},
So aside from the first point of "sequence" applying where you need to put the $match statements where they are needed and do the "most good", this actually becomes "really important" with the concept of "joins". The thing to note here is that our sequences of $lookup then $unwind and then $match, actually get processed by MongoDB as just the $lookup stages, with the other operations "rolled up" into the one pipeline stage for each.
This is an important distinction to other ways of "filtering" the results obtained by $lookup. Since in this case, the actual "query" conditions on the "join" from $match are performed on the collection to join "before" the results are returned to the parent.
This in combination with $unwind ( which is translated into unwinding ) as shown above is how MongoDB actually deals with the possibility that the "join" could result in producing an array of content in the source document which causes it to exceed the 16MB BSON limit. This would only happen in cases where the result being joined to is very large, but the same advantage is in where the "filter" is actually applied, being on the target collection "before" results are returned.
It is that kind of handling that best "correlates" to the same behavior as a SQL JOIN. It is also therefore the most effective way to obtain results from a $lookup where there are other conditions to apply to the JOIN aside from simply the "local" of "foreign" key values.
Also note that the other behavior change is is from what is essentially a LEFT JOIN performed by $lookup where the "source" document would always be retained regardless of the presence of a matching document in the "target" collection. Instead the $unwind adds to this by "discarding" any results from the "source" which did not have anything matching from the "target" by the additional conditions in $match.
In fact they are even discarded beforehand due to the implied preserveNullAndEmptyArrays: false which is included and would discard anything where the "local" and "foreign" keys did not even match between the two collections. This is a good thing for this particular type of query as the "join" is intended to the "equal" on those values.
Conclude
As noted before, MongoDB generally treats "relations" a lot differently to how you would use a "Relational Database" or RDBMS. The general concept of "relations" is in fact "embedding" the data, either as a single property or as an array.
You may actually desire such output, which is also part of the reason why that without the $unwind sequence here the output of $lookup is actually an "array". However using $unwind in this context is actually the most effective thing to do, as well as giving a guarantee that the "joined" data does not actually cause the aforementioned BSON limit to be exceed as a result of that "join".
If you actually want arrays of output, then the best thing to do here would be to use the $group pipeline stage, and possibly as multiple stages in order to "normalize" and "undo the results" of $unwind
{ "$group": {
"_id": "$_id",
"tb1_field": { "$first": "$tb1_field" },
"tb1_another": { "$first": "$tb1_another" },
"tb3": { "$push": "$tb3" }
}}
Where you would in fact for this case list all the fields you required from "tb1" by their property names using $first to only keep the "first" occurrence ( essentially repeated by results of "tb2" and "tb3" unwound ) and then $push the "detail" from "tb3" into an "array" to represent the relation to "tb1".
But the general form of the aggregation pipeline as given is the exact representation of how results would be obtained from the original SQL, which is "denormalized" output as a result of the "join". Whether you want to "normalize" results again after this is up to you.

Related

mongodb combining $lookup and $match gives error "must contain exactly one field"

When I run the $match by itself it works, and when I run the $lookup by itself it works.
Can I not combine these two? And would the "from" for the $lookup, not need to be the output of the $match somehow, instead of the original collection name.
db.NealTestBay.aggregate([
{
"$match": { type: "parent"},
"$lookup": {
"from": "NealTestBay",
"localField": "toSku",
"foreignField": "toSku",
"as": "grp"
}
}
])
I'm getting this error when I combine them:
[Error] A pipeline stage specification object must contain exactly one field.
at line 1, column 1
My full sample database and rows are in my more complex question here: MongoDB - Can I loop through a set of rows, sum other rows based on the first rows, then update my original rows in one command?. I'm trying to break it down into a series of pipeline stages.
So my first step is to find the rows with type="parent", then join them with the rows that have type="child" and have the same value for the field "toSku'.
I also tried using the "pipeline" under the $lookup, but it told me that it doesn't allow that a the same time you use "localField" and "foreignField".
Based on #Takis comment, this seems to work:
db.NealTestBay.aggregate([
{
"$match": { type: "parent"},
},
{
"$lookup": {
"from": "NealTestBay",
"localField": "toSku",
"foreignField": "toSku",
"as": "grp"
}
}
])
It wasn't so much as a typo, as just reading blogs and guessing what I'm doing.

MongoDB query for finding number of people with conflicting schedules [duplicate]

I have startTime and endTime for all records like this:
{
startTime : 21345678
endTime : 31345678
}
I am trying to find number of all the conflicts. For example if there are two records and they overlap the number of conflict is 1. If there are three records and two of them overlap the conflict is 1. If there are three records and all three overlap the conflicts is 3 i.e [(X1, X2), (X1, X3), (X2, X3)]
As an algorithm I am thinking of sorting the data by start time and for each sorted record checking the end time and finding the records with start time less than the end time. This will be O(n2) time. A better approach will be using interval tree and inserting each record into the tree and finding the counts when overlaps occur. This will be O(nlgn) time.
I have not used mongoDB much so what kind of query can I use to achieve something like this?
As you correctly mention, there are different approaches with varying complexity inherent to their execution. This basically covers how they are done and which one you implement actually depends on which your data and use case is best suited to.
Current Range Match
MongoDB 3.6 $lookup
The most simple approach can be employed using the new syntax of the $lookup operator with MongoDB 3.6 that allows a pipeline to be given as the expression to "self join" to the same collection. This can basically query the collection again for any items where the starttime "or" endtime of the current document falls between the same values of any other document, not including the original of course:
db.getCollection('collection').aggregate([
{ "$lookup": {
"from": "collection",
"let": {
"_id": "$_id",
"starttime": "$starttime",
"endtime": "$endtime"
},
"pipeline": [
{ "$match": {
"$expr": {
"$and": [
{ "$ne": [ "$$_id", "$_id" },
{ "$or": [
{ "$and": [
{ "$gte": [ "$$starttime", "$starttime" ] },
{ "$lte": [ "$$starttime", "$endtime" ] }
]},
{ "$and": [
{ "$gte": [ "$$endtime", "$starttime" ] },
{ "$lte": [ "$$endtime", "$endtime" ] }
]}
]},
]
},
"as": "overlaps"
}},
{ "$count": "count" },
]
}},
{ "$match": { "overlaps.0": { "$exists": true } } }
])
The single $lookup performs the "join" on the same collection allowing you to keep the "current document" values for the "_id", "starttime" and "endtime" values respectively via the "let" option of the pipeline stage. These will be available as "local variables" using the $$ prefix in subsequent "pipeline" of the expression.
Within this "sub-pipeline" you use the $match pipeline stage and the $expr query operator, which allows you to evaluate aggregation framework logical expressions as part of the query condition. This allows the comparison between values as it selects new documents matching the conditions.
The conditions simply look for the "processed documents" where the "_id" field is not equal to the "current document", $and where either the "starttime"
$or "endtime" values of the "current document" falls between the same properties of the "processed document". Noting here that these as well as the respective $gte and $lte operators are the "aggregation comparison operators" and not the "query operator" form, as the returned result evaluated by $expr must be boolean in context. This is what the aggregation comparison operators actually do, and it's also the only way to pass in values for comparison.
Since we only want the "count" of the matches, the $count pipeline stage is used to do this. The result of the overall $lookup will be a "single element" array where there was a count, or an "empty array" where there was no match to the conditions.
An alternate case would be to "omit" the $count stage and simply allow the matching documents to return. This allows easy identification, but as an "array embedded within the document" you do need to be mindful of the number of "overlaps" that will be returned as whole documents and that this does not cause a breach of the BSON limit of 16MB. In most cases this should be fine, but for cases where you expect a large number of overlaps for a given document this can be a real case. So it's really something more to be aware of.
The $lookup pipeline stage in this context will "always" return an array in result, even if empty. The name of the output property "merging" into the existing document will be "overlaps" as specified in the "as" property to the $lookup stage.
Following the $lookup, we can then do a simple $match with a regular query expression employing the $exists test for the 0 index value of output array. Where there actually is some content in the array and therefore "overlaps" the condition will be true and the document returned, showing either the count or the documents "overlapping" as per your selection.
Other versions - Queries to "join"
The alternate case where your MongoDB lacks this support is to "join" manually by issuing the same query conditions outlined above for each document examined:
db.getCollection('collection').find().map( d => {
var overlaps = db.getCollection('collection').find({
"_id": { "$ne": d._id },
"$or": [
{ "starttime": { "$gte": d.starttime, "$lte": d.endtime } },
{ "endtime": { "$gte": d.starttime, "$lte": d.endtime } }
]
}).toArray();
return ( overlaps.length !== 0 )
? Object.assign(
d,
{
"overlaps": {
"count": overlaps.length,
"documents": overlaps
}
}
)
: null;
}).filter(e => e != null);
This is essentially the same logic except we actually need to go "back to the database" in order to issue the query to match the overlapping documents. This time it's the "query operators" used to find where the current document values fall between those of the processed document.
Because the results are already returned from the server, there is no BSON limit restriction on adding content to the output. You might have memory restrictions, but that's another issue. Simply put we return the array rather than cursor via .toArray() so we have the matching documents and can simply access the array length to obtain a count. If you don't actually need the documents, then using .count() instead of .find() is far more efficient since there is not the document fetching overhead.
The output is then simply merged with the existing document, where the other important distinction is that since theses are "multiple queries" there is no way of providing the condition that they must "match" something. So this leaves us with considering there will be results where the count ( or array length ) is 0 and all we can do at this time is return a null value which we can later .filter() from the result array. Other methods of iterating the cursor employ the same basic principle of "discarding" results where we do not want them. But nothing stops the query being run on the server and this filtering is "post processing" in some form or the other.
Reducing Complexity
So the above approaches work with the structure as described, but of course the overall complexity requires that for each document you must essentially examine every other document in the collection in order to look for overlaps. Therefore whilst using $lookup allows for some "efficiency" in reduction of transport and response overhead, it still suffers the same problem that you are still essentially comparing each document to everything.
A better solution "where you can make it fit" is to instead store a "hard value"* representative of the interval on each document. For instance we could "presume" that there are solid "booking" periods of one hour within a day for a total of 24 booking periods. This "could" be represented something like:
{ "_id": "A", "booking": [ 10, 11, 12 ] }
{ "_id": "B", "booking": [ 12, 13, 14 ] }
{ "_id": "C", "booking": [ 7, 8 ] }
{ "_id": "D", "booking": [ 9, 10, 11 ] }
With data organized like that where there was a set indicator for the interval the complexity is greatly reduced since it's really just a matter of "grouping" on the interval value from the array within the "booking" property:
db.booking.aggregate([
{ "$unwind": "$booking" },
{ "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
{ "$match": { "docs.1": { "$exists": true } } }
])
And the output:
{ "_id" : 10, "docs" : [ "A", "D" ] }
{ "_id" : 11, "docs" : [ "A", "D" ] }
{ "_id" : 12, "docs" : [ "A", "B" ] }
That correctly identifies that for the 10 and 11 intervals both "A" and "D" contain the overlap, whilst "B" and "A" overlap on 12. Other intervals and documents matching are excluded via the same $exists test except this time on the 1 index ( or second array element being present ) in order to see that there was "more than one" document in the grouping, hence indicating an overlap.
This simply employs the $unwind aggregation pipeline stage to "deconstruct/denormalize" the array content so we can access the inner values for grouping. This is exactly what happens in the $group stage where the "key" provided is the booking interval id and the $push operator is used to "collect" data about the current document which was found in that group. The $match is as explained earlier.
This can even be expanded for alternate presentation:
db.booking.aggregate([
{ "$unwind": "$booking" },
{ "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
{ "$match": { "docs.1": { "$exists": true } } },
{ "$unwind": "$docs" },
{ "$group": {
"_id": "$docs",
"intervals": { "$push": "$_id" }
}}
])
With output:
{ "_id" : "B", "intervals" : [ 12 ] }
{ "_id" : "D", "intervals" : [ 10, 11 ] }
{ "_id" : "A", "intervals" : [ 10, 11, 12 ] }
It's a simplified demonstration, but where the data you have would allow it for the sort of analysis required then this is the far more efficient approach. So if you can keep the "granularity" to be fixed to "set" intervals which can be commonly recorded on each document, then the analysis and reporting can use the latter approach to quickly and efficiently identify such overlaps.
Essentially, this is how you would implement what you basically mentioned as a "better" approach anyway, and the first being a "slight" improvement over what you originally theorized. See which one actually suits your situation, but this should explain the implementation and the differences.

Accelerate mongo update within two collections

I have a Payments collection with playerId field, which is the _id key of Person collection. I need to count once, what's the maximal payment of a person and save the value to person's document. This is how I do it now:
db.Person.find().forEach( function(person) {
var cursor = db.Payment.aggregate([
{$match: {playerId: person._id}},
{$group: {
_id:"$playerId",
maxp: {$max:"$amount"}
}}
]);
var maxPay = 0;
if (cursor.hasNext()) {
maxPay = cursor.next().maxp;
}
person.maxPay = maxPay;
db.Person.save(person);
});
I suppose seeking maxPay on Payments collection once for all Persons should be faster, but I dunno how to write that in code. Could you help me please?
You can run just a single aggregation pipeline operation which has a $lookup pipeline initially to do a "left join" on the Payment collection. This is necessary in order to get the data from the right collection (payments) embedded within the resulting documents as an array called payments.
The preceding $unwind pipeline deconstructs the embedded payments array i.e. it will generate a new record for each and every element of the payments data field. It basically flattens the data which will be useful for the next $group stage.
In this $group pipeline stage, you calculate your desired aggregates by applying the accumulator expression(s). If for instance your Person schema has other fields you wish to retain, then the $first accumulator operator should suffice in addition to the $max operator for the extra maxPay field.
UPDATE
Unfortunately, there is no operator to "include all fields" in the $group aggregation pipeline operation. This is because the $group pipeline step is mostly used to group and calculate/aggregate data from collection fields (sum, avg, etc.) and returning all the collection's fields is not the pipeline's intended purpose. The group pipeline operator is similar to the SQL's GROUP BY clause where you can't use GROUP BY unless you use any of the aggregation functions (accumulator operators in MongoDB). The same way, if you need to retain most fields, you have to use an aggregation function in MongoDB as well. In this case, you have to apply $first to each field you want to keep.
You can also use the $$ROOT system variable which references the root document. Keep all fields of this document in a field within the $group pipeline, for example:
{
"$group": {
"_id": "$_id",
"maxPay": { "$max": "$payments.amount" },
"doc": { "$first": "$$ROOT" }
}
}
The drawback with this approach is you would need a further $project pipeline to reshape the fields so that they match the original schema because the documents from the resulting pipeline will have only three fields; _id, maxPay and the embedded doc field.
The final pipeline stage, $out, writes the resulting documents of the aggregation pipeline to the same collection, akin to updating the Person collection by atomically replacing the existing collection with the new results collection. The $out operation does not change any indexes that existed on the previous collection. If the aggregation fails, the $out operation makes no changes to the pre-existing collection:
db.Person.aggregate([
{
"$lookup": {
"from": "Payment",
"localField": "_id",
"foreignField": "playerId",
"as": "payments"
}
},
{ "$unwind": {
"path": "$payments",
"preserveNullAndEmptyArrays": true
} },
{
"$group": {
"_id": "$_id",
"maxPay": { "$max": "$payments.amount" },
/* extra fields for demo purposes
"firstName": { "$first": "$firstName" },
"lastName": { "$first": "$lastName" }
*/
}
},
{ "$out": "Person" }
])

Get distinct records with specified fields that match a value, paginated

I'm trying to get all documents in my MongoDB collection
by distinct customer ids (custID)
where status code == 200
paginated (skipped and limit)
return specified fields
var Order = mongoose.model('Order', orderSchema());
My original thought was to use mongoose db query, but you can't use distinct with skip and limit as Distinct is a method that returns an "array", and therefore you cannot modify something that is not a "Cursor":
Order
.distinct('request.headers.custID')
.where('response.status.code').equals(200)
.limit(limit)
.skip(skip)
.exec(function (err, orders) {
callback({
data: orders
});
});
So then I thought to use Aggregate, using $group to get distinct customerID records, $match to return all unique customerID records that have status code of 200, and $project to include the fields that I want:
Order.aggregate(
[
{
"$project" :
{
'request.headers.custID' : 1,
//other fields to include
}
},
{
"$match" :
{
"response.status.code" : 200
}
},
{
"$group": {
"_id": "$request.headers.custID"
}
},
{
"$skip": skip
},
{
"$limit": limit
}
],
function (err, order) {}
);
This returns an empty array though. If I remove project, only $request.headers.custID field is returned when in fact I need more.
Any thoughts?
The thing you need to understand about aggregation pipelines is generally the word "pipeline" means that each stage only receives the input that is emitted by the preceeding stage in order of execution. The best analog to think of here is "unix pipe" |, where the output of one command is "piped" to the other:
ps aux | grep mongo | tee out.txt
So aggregation pipelines work in much the same way as that, where the other main thing to consider is both $project and $group stages operate on only emitting those fields you ask for, and no others. This takes a little getting used to compared to declarative approaches like SQL, but with a little practice it becomes second nature.
Other things to get used to are stages like $match are more important to place at the beginning of a pipeline than field selection. The primary reason for this is possible index selection and usage, which speeds things up immensely. Also, field selection of $project followed by $group is somewhat redundant, as both essentially select fields anyway, and are usually best combined where appropriate anyway.
Hence most optimially you do:
Order.aggregate(
[
{ "$match" : {
"response.status.code" : 200
}},
{ "$group": {
"_id": "$request.headers.custID", // the grouping key
"otherField": { "$first": "$otherField" },
// and so on for each field to select
}},
{ "$skip": skip },
{ "$limit": limit }
],
function (err, order) {}
);
Where the main thing here to remember about $group is that all other fields than _id ( which is the grouping key ) require the use of an accumulator to select, since there is in fact always a multiple occurance of the values for the grouping key.
In this case we are using $first as an accumulator, which will take the first occurance from the grouping boundary. Commonly this is used following a $sort, but does not need to be so, just as long as you understand the behavior of what is selected.
Other accumulators like $max simply take the largest value of the field from within the values inside the grouping key, and are therefore independant of the "current record/document" unlike $first or $last. So it all depends on your needs.
Of course you can shorcut the selection in modern MongoDB releases after MongoDB 2.6 with the $$ROOT variable:
Order.aggregate(
[
{ "$match" : {
"response.status.code" : 200
}},
{ "$group": {
"_id": "$request.headers.custID", // the grouping key
"document": { "$first": "$$ROOT" }
}},
{ "$skip": skip },
{ "$limit": limit }
],
function (err, order) {}
);
Which would take a copy of all fields in the document and place them under the named key ( which is "document" in this case ). It's a shorter way to notate, but of course the resulting document has a different structure, being now all under the one key as sub-fields.
But as long as you understand the basic principles of a "pipeline" and don't exclude data you want to use in later stages by previous stages, then you generally should be okay.

Filtering and Matching Arrays within Arrays

I am looking for querying JSON file which has nested array structure. Each design element has multiple SLID and status. I want to write mongodb query to get designs with highest SLID and status as "OLD".
Here is the sample JSON:
{
"_id" : ObjectId("55cddc30f1a3c59ca1e88f30"),
"designs" : [
{
"Deid" : 1,
"details" : [
{
"SLID" : 1,
"status" : "OLD"
},
{
"SLID" : 2,
"status" : "NEW"
}
]
},
{
"Deid" : 2,
"details" : [
{
"SLID" : 1,
"status" : "NEW"
},
{
"SLID" : 2,
"status" : "NEW"
},
{
"SLID" : 3,
"status" : "OLD"
}
]
}
]
}
In this sample the expected query should return the following as SLID is highest with status "OLD".
{
"_id" : ObjectId("55cddc30f1a3c59ca1e88f30"),
"designs" : [
{
"Deid" : 2,
"details" : [
{
"SLID" : 3,
"status" : "OLD"
}
]
}
]
}
I have tried following query but it kept returning other details array element (which has status "NEW") along with above element.
db.Collection.find({"designs": {$all: [{$elemMatch: {"details.status": "OLD"}}]}},
{"designs.details":{$slice:-1}})
Edit:
To summarize the problem:
Requirement is to get all design from document set with highest SLID (always the last item in details array) if it has status as "OLD".
Present Problem
What you should have been picking up from the previously linked question is that the positional $ operator itself is only capable of matching the first matched element within an array. When you have nested arrays like you do, then this means "always" the "outer" array can only be reported and never the actual matched position within the inner array nor any more than a single match.
Other examples show usage of the aggregation framework for MongoDB in order to "filter" elements from the array by generally processing with $unwind and then using conditions to match the array elements that you require. This is generally what you need to do in this case to get matches from your "inner" array. While there have been improvements since the first answers, your "last match" or effectively a "slice" condition, excludes other present possibilities. Therefore:
db.junk.aggregate([
{ "$match": {
"designs.details.status": "OLD"
}},
{ "$unwind": "$designs" },
{ "$unwind": "$designs.details" },
{ "$group": {
"_id": {
"_id": "$_id",
"Deid": "$designs.Deid"
},
"details": { "$last": "$designs.details"}
}},
{ "$match": {
"details.status": "OLD"
}},
{ "$group": {
"_id": "$_id",
"details": { "$push": "$details"}
}},
{ "$group": {
"_id": "$_id._id",
"designs": {
"$push": {
"Deid": "$_id.Deid",
"details": "$details"
}
}
}}
])
Which would return on your document or any others like it a result like:
{
"_id" : ObjectId("55cddc30f1a3c59ca1e88f30"),
"designs" : [
{
"Deid" : 2,
"details" : [
{
"SLID" : 3,
"status" : "OLD"
}
]
}
]
}
The key there being to $unwind both arrays and then $group back on the relevant unique elements in order to "slice" the $last element from each "inner" array.
The next of your conditions requires a $match in order to see that the "status" field of that element is the value that you want. Then of course since the documents have been essentially "de-normalized" by the $unwind operations and even with the subsequent $group, the following $group statements re-construct the document into it's original form.
Aggregation pipelines can either be quite simple or quite difficult depending on what you want to do, and reconstruction of documents with filtering like this means you need to take care in the steps, particularly if there are other fields involved. As you should also appreciate here, this process of $unwind to de-normalize and $group operations is not very efficient, and can cause significant overhead depending on the number of possible documents that can be met by the initial $match query.
Better Solution
While currently only available in the present development branch, there are some new operators available to the aggregation pipeline that make this much more efficient, and effectively "on par" with the performance of a general query. These are notably the $filter and $slice operators, which can be employed in this case as follows:
db.junk.aggregate([
{ "$match": {
"designs.details.status": "OLD"
}},
{ "$redact": {
"$cond": [
{ "$gt": [
{ "$size":{
"$filter": {
"input": "$designs",
"as": "designs",
"cond": {
"$anyElementTrue":[
{ "$map": {
"input": {
"$slice": [
"$$designs.details",
-1
]
},
"as": "details",
"in": {
"$eq": [ "$$details.status", "OLD" ]
}
}}
]
}
}
}},
0
]},
"$$KEEP",
"$$PRUNE"
]
}},
{ "$project": {
"designs": {
"$map": {
"input": {
"$filter": {
"input": "$designs",
"as": "designs",
"cond": {
"$anyElementTrue":[
{ "$map": {
"input": {
"$slice": [
"$$designs.details",
-1
]
},
"as": "details",
"in": {
"$eq": [ "$$details.status", "OLD" ]
}
}}
]
}
}
},
"as": "designs",
"in": {
"Deid": "$$designs.Deid",
"details": { "$slice": [ "$$designs.details", -1] }
}
}
}
}}
])
This effectively makes the operations just a $match and $project stage only, which is basically what is done with a general .find() operation. The only real addition here is a $redact stage, which allows the documents to be additionally filtered from the initial query condition by futher logical conditions that can inspect the document.
In this case, we can see if the document not only contains an "OLD" status, but also that this is the last element of at least one of the inner arrays matches that status it it's own last entry, otherwise it is "pruned" from the results for not meeting that condition.
In both the $redact and $project, the $slice operator is used to get the last entry from the "details" array within the "designs" array. In the initial case it is applied with $filter to remove any elements where the condition did not match from the "outer" or "designs" array, and then later in the $project to just show the last element from the "designs" array in final presentation. That last "reshape" is done by $map to replace the whole arrays with the last element slice only.
Whilst the logic there seems much more long winded than the initial statement, the performance gain is potentially "huge" due to being able to treat each document as a "unit" without the need to denormalize or otherwise de-construct until the final projection is made.
Best Solution for Now
In summary, the current processes you can use to achieve the result are simply not efficient for solving the problem. It would in fact be more efficient to simply match the documents that meet the basic condition ( contain a "status" that is "OLD" ) in conjuntction with a $where condition to test the last element of each array. However the actual "filtering" of the arrays in output is best left to client code:
db.junk.find({
"designs.details.status": "OLD",
"$where": function() {
return this.designs.some(function(design){
return design.details.slice(-1)[0].status == "OLD";
});
}
}).forEach(function(doc){
doc.designs = doc.designs.filter(function(design) {
return design.details.slice(-1)[0].status == "OLD";
}).map(function(design) {
design.details = design.details.slice(-1);
return design;
});
printjson(doc);
});
So the query condition at least only returns the documents that match all conditions, and then the client side code filters out the content from arrays by testing the last elements and then just slicing out that final element as the content to display.
Right now, that is probably the most efficient way to do this as it mirrors the future aggregation operation capabilties.
The problems are really in the structure of your data. While it may suit your display purposes of your application, the usage of nested arrays makes this notoriously difficult to query as well as "impossible" to atomically update due to the limitations of the positional operator mentioned before.
While you can always $push new elements to the inner array by matching it's existence within or just the presence of the outer array element, what you cannot do is alter that "status" of an inner element in an atomic operation. In order to modify in such a way, you need to retrieve the entire document, then modifiy the contents in code, and save back the result.
The problems with that process mean you are likely to "collide" on updates and possibly overwrite the changes made by another concurrent request to the one you are currently processing.
For these reasons you really should reconsider all of your design goals for the application and the suitability to such a structure. Keeping a more denormalized form may cost you in some other areas, but it sure would make things much more simple to both query and update with the kind of inspection level this seems to require.
The end conclusion here should be that you reconsider the design. Though getting your results is both possible now and in the future, the other operational blockers should be enough to warrant a change.