Is there a way to use the $sort operator twice within a single aggregation pipeline?
I know that using a singular $sort with two keys works properly, i.e. sort by the first key, then the second.
My current project requires multiple $sort stages to exist, for example
db.collection.aggregate([
{
$sort: {
"age": 1
}
},
{
$sort: {
"score": -1
}
}
])
Currently, the second stage doesn't respect the result of the first stage. Is there any workaround for that?
Is it possible to, for example, assign each document a new field 'index' after the first stage, storing its index within the current array of results, and use that field in the second $sort stage?
You can use multiple value in '$sort'.
db.collection.aggregate([
{
"$sort": {
"age": 1,
"score": -1
}
}
])
I have define mongo playground link, you can refer it
https://mongoplayground.net/p/ZaRX_XNSXhu
Related
I have a collection with about 50,000 items with created indexes on e.g. name, and _id
If I use db.items.find().sort({ name: 1, _id: 1 })
or:
db.items.aggregate([
{
$match: {}
},
{
$sort: {
name 1,
_id: 1
}
}
])
then it exceed the RAM limit: Executor error during find command :: caused by :: Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit. and I have to pass { allowDiskUse: true } to aggregate if I want this to work.
However when I use $group stage in the aggregation pipeline it does not exceed RAM limit and it works:
db.items.aggregate.aggregate([
{
$match: {}
},
{
$sort: {
name 1,
_id: 1
}
},
{
$group: {
_id: 1,
x: {
$push: {
_id: '$_id'
}
}
}
}
])
Why is this happening with $sort alone, but not with $sort + $group?
I have a theory it's connected to this feature.
If a pipeline sorts and groups by the same field and the $group stage only uses the $first accumulator operator, consider adding an index on the grouped field which matches the sort order. In some cases, the $group stage can use the index to quickly find the first document of each group.
While the pipeline optimizations and the way things "actually" run is a black box this is the only thing I can think of ( that is mentioned in the docs at least ).
I'm assuming this "optimization" kicks in, making the $group stage utilize the index. meaning the pipeline might be holding "less" memory as it's using the index to scan this. Eventually you're not returning the name making the total result smaller.
Again this is pure speculation, but it's the best I got.
I have this aggregation query in MongoDB:
db.questions.aggregate([
{ $project:{question:1,detail:1, choices:1, answer:1,
percent_false:{
$multiply:[100,{$divide:["$answear_false",{$add:["$answear_false","$answear_true"]}]}]},
percent_true:{
$multiply:[100,{$divide:["$answear_true",{$add:["$answear_false","$answear_true"]}]}]} }}, {$match:{status:'active'} }
]).pretty()
I want using $match on 2 computed fields "percent_true" and "percent_false" like this
$match : {percent_true:{$gte:20}}
How can i do ?
Singe the aggregation framework works in stages, you can treat the computed fields as if they were normal fields because from the $match's perspective, they are normal.
{ $project:{
question:1,detail:1, choices:1, answer:1,
percent_false:{
$multiply:[100,{$divide:["$answear_false",{$add:["$answear_false","$answear_true"]}]}]
},
percent_true:{
$multiply:[100,{$divide:["$answear_true",{$add:["$answear_false","$answear_true"]}]}]}
}
},
{$match:{
status:'active',
percent_true:{$gte:20}
//When documents get fed to match they already have a percent_true field, so you can match on them as normal
}
}
I'm trying to get all documents in my MongoDB collection
by distinct customer ids (custID)
where status code == 200
paginated (skipped and limit)
return specified fields
var Order = mongoose.model('Order', orderSchema());
My original thought was to use mongoose db query, but you can't use distinct with skip and limit as Distinct is a method that returns an "array", and therefore you cannot modify something that is not a "Cursor":
Order
.distinct('request.headers.custID')
.where('response.status.code').equals(200)
.limit(limit)
.skip(skip)
.exec(function (err, orders) {
callback({
data: orders
});
});
So then I thought to use Aggregate, using $group to get distinct customerID records, $match to return all unique customerID records that have status code of 200, and $project to include the fields that I want:
Order.aggregate(
[
{
"$project" :
{
'request.headers.custID' : 1,
//other fields to include
}
},
{
"$match" :
{
"response.status.code" : 200
}
},
{
"$group": {
"_id": "$request.headers.custID"
}
},
{
"$skip": skip
},
{
"$limit": limit
}
],
function (err, order) {}
);
This returns an empty array though. If I remove project, only $request.headers.custID field is returned when in fact I need more.
Any thoughts?
The thing you need to understand about aggregation pipelines is generally the word "pipeline" means that each stage only receives the input that is emitted by the preceeding stage in order of execution. The best analog to think of here is "unix pipe" |, where the output of one command is "piped" to the other:
ps aux | grep mongo | tee out.txt
So aggregation pipelines work in much the same way as that, where the other main thing to consider is both $project and $group stages operate on only emitting those fields you ask for, and no others. This takes a little getting used to compared to declarative approaches like SQL, but with a little practice it becomes second nature.
Other things to get used to are stages like $match are more important to place at the beginning of a pipeline than field selection. The primary reason for this is possible index selection and usage, which speeds things up immensely. Also, field selection of $project followed by $group is somewhat redundant, as both essentially select fields anyway, and are usually best combined where appropriate anyway.
Hence most optimially you do:
Order.aggregate(
[
{ "$match" : {
"response.status.code" : 200
}},
{ "$group": {
"_id": "$request.headers.custID", // the grouping key
"otherField": { "$first": "$otherField" },
// and so on for each field to select
}},
{ "$skip": skip },
{ "$limit": limit }
],
function (err, order) {}
);
Where the main thing here to remember about $group is that all other fields than _id ( which is the grouping key ) require the use of an accumulator to select, since there is in fact always a multiple occurance of the values for the grouping key.
In this case we are using $first as an accumulator, which will take the first occurance from the grouping boundary. Commonly this is used following a $sort, but does not need to be so, just as long as you understand the behavior of what is selected.
Other accumulators like $max simply take the largest value of the field from within the values inside the grouping key, and are therefore independant of the "current record/document" unlike $first or $last. So it all depends on your needs.
Of course you can shorcut the selection in modern MongoDB releases after MongoDB 2.6 with the $$ROOT variable:
Order.aggregate(
[
{ "$match" : {
"response.status.code" : 200
}},
{ "$group": {
"_id": "$request.headers.custID", // the grouping key
"document": { "$first": "$$ROOT" }
}},
{ "$skip": skip },
{ "$limit": limit }
],
function (err, order) {}
);
Which would take a copy of all fields in the document and place them under the named key ( which is "document" in this case ). It's a shorter way to notate, but of course the resulting document has a different structure, being now all under the one key as sub-fields.
But as long as you understand the basic principles of a "pipeline" and don't exclude data you want to use in later stages by previous stages, then you generally should be okay.
I wan't to run a query to get all Articles that have more than 6 com and then sort according length of com list,
for this i doing it:
ArticleModel.objects.filter(com__6__exists=True).order_by('-com.length')[:50]
suppose com is a ListField, but ordering not work, how can i fix it? thanks
Standard queries cannot do this as the "sort" needs to be done on a physical field present in the document. The best way to do this is to actually keep a count of your "list" as another field in the document. That also makes your query more efficient as well as that "counter" field can be indexed, so the basic query becomes ( Raw MongoDB sytax ) :
{ "comLength": { "$gt": 6 } }
If you cannot or do not want to change the document structure then the only way to otherwise sort on the length of your list is to $project it via .aggregate():
ArticleModel._get_collection().aggregate([
{ "$match": { "com.6": { "$exists": true } }},
{ "$project": {
"com": 1,
"otherField": 1,
"comLength": { "$size": "$com" }
}},
{ "$sort": { "comLength": -1 } }
])
And that considers that you have MongoDB 2.6 at least for the use of the $size aggregation operator. If you don't then you have to $unwind and $group in order to calculate the length of arrays:
ArticleModel._get_collection().aggregate([
{ "$match": { "com.6": { "$exists": true } }},
{ "$unwind": "$com" },
{ "$group": {
"_id": "$_id",
"otherField": { "$first": "$otherField" }
"com": { "$push": "$com" },
"comLength": { "$sum": 1 }
}},
{ "$sort": { "comLength": -1 } }
])
So if you are going to go down that route then take a good look at the documentation since you are possibly not used to the raw MongoDB syntax and have been using the query DSL that MongoEngine provdides.
Overall, only the aggregation providers in .aggregate() or .mapReduce() can actually "create a field" that is not present within the document. There is also not test for the "current" length that is available to standard projection or sorting of documents either.
Your best option to to add another field and keep it in sync with the actual array length. But failing that the above shows you the general approach.
If you're creating the database and you know such request will mostly be requested a lot it's recommended to add "com_length" field in A ArticleModel and make it automatically inserted on every save using save() method
add inside of your ArticleModel in models.py
def save(self, *args, **kwargs):
self.com_length = len(self.com)
return super(ArticleModel, self).save(*args, **kwargs)
then for requesting the asked question:
ArticleModel.objects.filter(com__6__exists=True).order_by('-com_length')[:50]
How can I get and return the first element in an array using a Mongo aggregation?
I tried using this code:
db.my_collection.aggregate([
{ $project: {
resp : { my_field: { $slice: 1 } }
}}
])
but I get the following error:
uncaught exception: aggregate failed: {
"errmsg" : "exception: invalid operator '$slice'",
"code" : 15999,
"ok" : 0
}
Note that 'my_field' is an array of 4 elements, and I only need to return the first element.
Since 3.2, we can use $arrayElemAt to get the first element in an array
db.my_collection.aggregate([
{ $project: {
resp : { $arrayElemAt: ['$my_field',0] }
}}
])
Currently, the $slice operator is unavailable in the the $project operation, of the aggregation pipeline.
So what you could do is,
First $unwind, the my_field array, and then group them together and take the $first element of the group.
db.my_collection.aggregate([
{$unwind:"$my_field"},
{$group:{"_id":"$_id","resp":{$first:"$my_field"}}},
{$project:{"_id":0,"resp":1}}
])
Or using the find() command, where you could make use of the $slice operator in the projection part.
db.my_collection.find({},{"my_field":{$slice:1}})
Update: based on your comments, Say you want only the second item in an array, for the record with an id, id.
var field = 2;
var id = ObjectId("...");
Then, the below aggregation command gives you the 2nd item in the my_field array of the record with the _id, id.
db.my_collection.aggregate([
{$match:{"_id":id}},
{$unwind:"$my_field"},
{$skip:field-1},
{$limit:1}
])
The above logic cannot be applied for more a record, since it would involve a $group, operator after $unwind. The $group operator produces a single record for all the records in that particular group making the $limit or $skip operators applied in the later stages to be ineffective.
A small variation on the find() query above would yield you the expected result as well.
db.my_collection.find({},{"my_field":{$slice:[field-1,1]}})
Apart from these, there is always a way to do it in the client side, though a bit costly if the number of records is very large:
var field = 2;
db.my_collection.find().map(function(doc){
return doc.my_field[field-1];
})
Choosing from the above options depends upon your data size and app design.
Starting Mongo 4.4, the aggregation operator $first can be used to access the first element of an array:
// { "my_field": ["A", "B", "C"] }
// { "my_field": ["D"] }
db.my_collection.aggregate([
{ $project: { resp: { $first: "$my_field" } } }
])
// { "resp" : "A" }
// { "resp" : "D" }
The $slice operator is scheduled to be made available in the $project operation in Mongo 3.1.4, according to this ticket: https://jira.mongodb.org/browse/SERVER-6074
This will make the problem go away.
This version is currently only a developer release and is not yet stable (as of July 2015). Expect this around October/November time.
Mongo 3.1.6 onwards,
db.my_collection.aggregate([
{
"$project": {
"newArray" : { "$slice" : [ "$oldarray" , 0, 1 ] }
}
}
])
where 0 is the start index and 1 is the number of elements to slice