I've got a MongoDB setup and running and I'm issuing a query that looks like this:
db.indexCollection.aggregate([
{$match:
{'term': {$regex: 'gas'},
'term': {$regex: 'carbon'},
'term': {$regex: 'hydro'}}
},
{$sort:
{'documentFrequency': 1,
'postingsList._id.termFrequency': 1}
},
{$group:
{_id: {_id: '$postingsList._id.documentID'}}
}
],
{allowDiskUse: true});
The documentIDs that are returned should contain all of the three words (or subwords) specified in the match expression (logical AND), sorted by document frequency in ascending order (priority to small document frequencies) and term frequency in descending order (priority to documents that make the most use of these terms, thus have the highest term frequency).
However, I notice that when I change the value associated with 'postingsList._id.termFrequency' from 1 to -1, nothing changes.. which makes me think that I'm doing something wrong with the sorting. The order of the output (just a list of documentIDs) does change if I change the value associated with the 'documentFrequency'.
Any ideas?
EDIT:
Sample output format when I execute a simple find query on the key 'term',
{
“_id” : ObjectId(“5941b6ad3de5cb1799f8ea26"),
“term” : “gases”,
“documentFrequency” : 2,
“postingsList” : [
{
“documentID” : 8317982,
“termFrequency” : 4
},
{
“documentID” : 9587169,
“termFrequency” : 1
}
]
}
The output after the $group stage is not as you were hoping, because a $group stage's output may not reflect the order of the input data; to be specific, the documentation says:
$group does not order its output documents.
If you want the output to be ordered after you do a $group, then you need to put a $sort after the $group.
Related
I am trying to apply sort on the whole group and limit the results.
But my below mongoose code sorts the group on the mentioned limit.
collection.aggregate([
{ $sort : {NAME: -1}} ,
{ $match : { NAME : {$regex : `.*${query.NAME.toUpperCase()}.*`} } },
{ $group : { _id : "$NAME", NAME:{$first:"$NAME"} }},
{ $skip : 1},
{ $limit : 10}],function(err,data){}
Let's say it sort first 10 results in the group, instead of sorting everything and show the first 10 results.
Thanks in advance.
See this link to the documentation. I haven't tested this since I haven't got the environment, nor the database to do so, but, I believe you might want to put your $sort argument just before $limit in the pipeline.
I have inserted a sample document
db.test.insert({
x:1,
a:[
{b:1,c:1,d:1},
{b:2,c:2}
]
})
I am facing 2 problems when I try to use $fitler aggregation as in my below query
db.test.aggregate(
{$project:{
a:{$filter:{
input : '$a',
as : 'item',
cond : '$$item.d'
}}
}}
)
Element Existence
1] How do I test the existence of element a.d, I found a way of just using cond : '$$item.d', but I think there should be a better way of doing it.
Selective Projection
2] How do I selectively project b and d nodes.
I tried the below code and it works, but I think there is a pipeline in projection as well. Therefore I applied projection twice on the same node 1 for filtering array elements, 2 for array element nodes
db.test.aggregate(
{$project:{
a:{$filter:{
input : '$a',
as : 'item',
cond : '$$item.d'
}},
a:{b:1, d:1}
}}
)
I seem to get the solution, but I think there may be a better way. Thanks for reply!
(1) It appears to me that the $exists operator is not yet unavailable in aggregation pipelines. You may wish to check if there is a jira requesting this, and if so, watch it and vote for it, and if not, add one?
Your workaround, I believe, will only return cases where item.d is true, rather than when it exists. So if item.d == null, false, 0, it will not be returned. I would suggest trying this instead:
cond : { $gte : [ '$$item.d', null ] }
(2) I'm not 100% sure I understood the question, but if I do, I think the way to do it is to have two $project's in the pipeline. So something like this:
db.test.aggregate(
[ { $project:
{a:{$filter:{input:'$a',as:'item',cond:{$gte:['$$item.d',null]}}}}
},
{ $project: { a : { b : 1, d : 1 } } }
]
)
I'm trying to get all documents in my MongoDB collection
by distinct customer ids (custID)
where status code == 200
paginated (skipped and limit)
return specified fields
var Order = mongoose.model('Order', orderSchema());
My original thought was to use mongoose db query, but you can't use distinct with skip and limit as Distinct is a method that returns an "array", and therefore you cannot modify something that is not a "Cursor":
Order
.distinct('request.headers.custID')
.where('response.status.code').equals(200)
.limit(limit)
.skip(skip)
.exec(function (err, orders) {
callback({
data: orders
});
});
So then I thought to use Aggregate, using $group to get distinct customerID records, $match to return all unique customerID records that have status code of 200, and $project to include the fields that I want:
Order.aggregate(
[
{
"$project" :
{
'request.headers.custID' : 1,
//other fields to include
}
},
{
"$match" :
{
"response.status.code" : 200
}
},
{
"$group": {
"_id": "$request.headers.custID"
}
},
{
"$skip": skip
},
{
"$limit": limit
}
],
function (err, order) {}
);
This returns an empty array though. If I remove project, only $request.headers.custID field is returned when in fact I need more.
Any thoughts?
The thing you need to understand about aggregation pipelines is generally the word "pipeline" means that each stage only receives the input that is emitted by the preceeding stage in order of execution. The best analog to think of here is "unix pipe" |, where the output of one command is "piped" to the other:
ps aux | grep mongo | tee out.txt
So aggregation pipelines work in much the same way as that, where the other main thing to consider is both $project and $group stages operate on only emitting those fields you ask for, and no others. This takes a little getting used to compared to declarative approaches like SQL, but with a little practice it becomes second nature.
Other things to get used to are stages like $match are more important to place at the beginning of a pipeline than field selection. The primary reason for this is possible index selection and usage, which speeds things up immensely. Also, field selection of $project followed by $group is somewhat redundant, as both essentially select fields anyway, and are usually best combined where appropriate anyway.
Hence most optimially you do:
Order.aggregate(
[
{ "$match" : {
"response.status.code" : 200
}},
{ "$group": {
"_id": "$request.headers.custID", // the grouping key
"otherField": { "$first": "$otherField" },
// and so on for each field to select
}},
{ "$skip": skip },
{ "$limit": limit }
],
function (err, order) {}
);
Where the main thing here to remember about $group is that all other fields than _id ( which is the grouping key ) require the use of an accumulator to select, since there is in fact always a multiple occurance of the values for the grouping key.
In this case we are using $first as an accumulator, which will take the first occurance from the grouping boundary. Commonly this is used following a $sort, but does not need to be so, just as long as you understand the behavior of what is selected.
Other accumulators like $max simply take the largest value of the field from within the values inside the grouping key, and are therefore independant of the "current record/document" unlike $first or $last. So it all depends on your needs.
Of course you can shorcut the selection in modern MongoDB releases after MongoDB 2.6 with the $$ROOT variable:
Order.aggregate(
[
{ "$match" : {
"response.status.code" : 200
}},
{ "$group": {
"_id": "$request.headers.custID", // the grouping key
"document": { "$first": "$$ROOT" }
}},
{ "$skip": skip },
{ "$limit": limit }
],
function (err, order) {}
);
Which would take a copy of all fields in the document and place them under the named key ( which is "document" in this case ). It's a shorter way to notate, but of course the resulting document has a different structure, being now all under the one key as sub-fields.
But as long as you understand the basic principles of a "pipeline" and don't exclude data you want to use in later stages by previous stages, then you generally should be okay.
I currently have objects in mongo set up like this for my application (simplified example, I removed some irrelevant fields for clarity here):
{
"_id" : ObjectId("529159af5b508dd71500000a"),
"c" : "somecontent",
"l" : [
{
"d" : "2013-11-24T01:43:11.367Z",
"u" : "User1"
},
{
"d" : "2013-11-24T01:43:51.206Z",
"u" : "User2"
}
]
}
What I would like to do is run a find query to return the objects which have the highest array length under "l" and sort highest->lowest, limit to 25 results. Some objects may have 1 object in the array, some may have 100. I'd like to find out which ones have the most under "l". I'm new to mongo and got everything else to work up until this point, but I just can't figure out the right parameters to get this specific query. Where I'm getting confused is how to handle counting the length of the array, sorting, etc. I could manually code this by parsing everything in the collection, but I'm sure there has to be a way for mongo to do this far more efficiently. I'm not against learning, if anyone knows any resources for more advanced queries or could help me out I'd really be thankful as this is the last piece! :-)
As a side note, node.js and mongo together is amazing and I wish I started using them in conjunction a long time ago.
Use the aggregation framework. Here's how:
db.collection.aggregate( [
{ $unwind : "$l" },
{ $group : { _id : "$_id", len : { $sum : 1 } } },
{ $sort : { len : -1 } },
{ $limit : 25 }
] )
There is no easy way to do this with your existing schema. The reason for this is that there is nothing in mongodb to find the size of your array length. Yes, you have $size operator, but the way it works is just to find all the arrays of a specific length.
So you can not sort your find query based on the length of the array. The only reasonable way out is to add additional field to your schema which will hold the length of the array (you will have something like "l_length : 3" in additional to your fields for every document). Good thing is that you can do it easily by looking at this relevant answer and after this you just need to make sure to increment or decrement this value when you are modifying the array.
When you will add this field, you can easily sort it by that field and moreover you can take advantage of indexes.
There is no straight approach to do this,
You can try adding size field in your document using $size,
$addFields to add new field total to get total elements in l array
$sort by total in descending order
$limit to select single document
$project to remove total field if you don't needed
db.collection.aggregate([
{ $addFields: { total: { $size: "$l" } } },
{ $sort: { total: -1 } },
{ $limit: 25 }
// { $project: { total: 0 } }
])
Playground
Assuming I have the following document structures:
> db.logs.find()
{
'id': ObjectId("50ad8d451d41c8fc58000003")
'name': 'Sample Log 1',
'uploaded_at: ISODate("2013-03-14T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099',
'tag_doc': {
'group_x: ['TAG-1','TAG-2'],
'group_y': ['XYZ']
}
},
{
'id': ObjectId("50ad8d451d41c8fc58000004")
'name': 'Sample Log 2',
'uploaded_at: ISODate("2013-03-15T01:00:00+01:00"),
'case_id: '50ad8d451d41c8fc58000099'
'tag_doc': {
'group_x: ['TAG-1'],
'group_y': ['XYZ']
}
}
> db.cases.findOne()
{
'id': ObjectId("50ad8d451d41c8fc58000099")
'name': 'Sample Case 1'
}
Is there a way to perform a $match in aggregation framework that will retrieve only all the latest Log for each unique combination of case_id and group_x? I am sure this can be done with multiple $group pipeline but as much as possible, I want to immediately limit the number of documents that will pass through the pipeline via the $match operator. I am thinking of something like the $max operator except it is used in $match.
Any help is very much appreciated.
Edit:
So far, I can come up with the following:
db.logs.aggregate(
{$match: {...}}, // some match filters here
{$project: {tag:'$tag_doc.group_x', case:'$case_id', latest:{uploaded_at:1}}},
{$unwind: '$tag'},
{$group: {_id:{tag:'$tag', case:'$case'}, latest: {$max:'$latest'}}},
{$group: {_id:'$_id.tag', total:{$sum:1}}}
)
As I mentioned, what I want can be done with multiple $group pipeline but this proves to be costly when handling large number of documents. That is why, I wanted to limit the documents as early as possible.
Edit:
I still haven't come up with a good solution so I am thinking if the document structure itself is not optimized for my use-case. Do I have to update the fields to support what I want to achieve? Suggestions very much appreciated.
Edit:
I am actually looking for an implementation in mongodb similar to the one expected in How can I SELECT rows with MAX(Column value), DISTINCT by another column in SQL? except it involves two distinct field values. Also, the $match operation is crucial because it makes the resulting set dynamic, with filters ranging to matching tags or within a range of dates.
Edit:
Due to the complexity of my use-case I tried to use a simple analogy but this proves to be confusing. Above is now the simplified form of the actual use case. Sorry for the confusion I created.
I have done something similar. But it's not possible with match, but only with one group pipeline. The trick is do use multi key with correct sorting:
{ user_id: 1, address: "xyz", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }, { user_id: 1, address: "xyz2", date_sent: ISODate("2013-03-14T01:00:00+01:00"), message: "test" }
if i wan't to group on user_id & address and i wan't the message with the latest date we need to create a key like this:
{ user_id:1, address:1, date_sent:-1 }
then you are able to perform aggregate without sort, which is much faster and will work on shards with replicas. if you don't have a key with correct sort order you can add a sort pipeline, but then you can't use it with shards, because all that is transferred to mongos and grouping is done their (also will get memory limit problems)
db.user_messages.aggregate(
{ $match: { user_id:1 } },
{ $group: {
_id: "$address",
count: { $sum : 1 },
date_sent: { $max : "$date_sent" },
message: { $first : "$message" },
} }
);
It's not documented that it should work like this - but it does. We use it on production system.
I'd use another collection to 'create' the search results on the fly - as new posts are posted - by upserting a document in this new collection every time a new blog post is posted.
Every new combination of author/tags is added as a new document in this collection, whereas a new post with an existing combination just updates an existing document with the content (or object ID reference) of the new blog post.
Example:
db.searchResult.update(
... {'author_id':'50ad8d451d41c8fc58000099', 'tag_doc.tags': ["TAG-1", "TAG-2" ]},
... { $set: { 'Referenceid':ObjectId("5152bc79e8bf3bc79a5a1dd8")}}, // or embed your blog post here
... {upsert:true}
)
Hmmm, there is no good way of doing this optimally in such a manner that you only need to pick out the latest of each author, instead you will need to pick out all documents, sorted, and then group on author:
db.posts.aggregate([
{$sort: {created_at:-1}},
{$group: {_id: '$author_id', tags: {$first: '$tag_doc.tags'}}},
{$unwind: '$tags'},
{$group: {_id: {author: '$_id', tag: '$tags'}}}
]);
As you said this is not optimal however, it is all I have come up with.
If I am honest, if you need to perform this query often it might actually be better to pre-aggregate another collection that already contains the information you need in the form of:
{
_id: {},
author: {},
tag: 'something',
created_at: ISODate(),
post_id: {}
}
And each time you create a new post you seek out all documents in this unqiue collection which fullfill a $in query of what you need and then update/upsert created_at and post_id to that collection. This would be more optimal.
Here you go:
db.logs.aggregate(
{"$sort" : { "uploaded_at" : -1 } },
{"$match" : { ... } },
{"$unwind" : "$tag_doc.group_x" },
{"$group" : { "_id" : { "case" :'$case_id', tag:'$tag_doc.group_x'},
"latest" : { "$first" : "$uploaded_at"},
"Name" : { "$first" : "$Name" },
"tag_doc" : { "$first" : "$tag_doc"}
}
}
);
You want to avoid $max when you can $sort and take $first especially if you have an index on uploaded_at which would allow you to avoid any in memory sorts and reduce the pipeline processing costs significantly. Obviously if you have other "data" fields you would add them along with (or instead of) "Name" and "tag_doc".