Does order matter within a MongoDB $match block having $text search? - mongodb

I am performing an aggregate query that contains a match, group, project, and then a sort pipeline. I am wondering if adding the text search block within my match block first vs last makes any performance difference. I am currently going based on the Robo 3T response time metric and not noticing a difference and wanted to confirm whether my observation holds true to the facts or not.
The query looks something like this:
db.COLLECTION.aggregate(
{$match: {$text:{$search: 'xyz'},...}},
{$group: {...}},
{$project: {...}},
{$sort: {...}}
)

Related

Mongoose aggregate pipeline: sorting indexed date in MongoDB is slow

I've been working with this error for some time on my App here and was hoping someone can lend a hand finding the error of this aggregation query.
I'm using a docker container running MongoDB shell version v4.2.8. The app uses an Express.js backend with Mongoose middleware to interface with the database.
I want to make an aggregation pipeline that first matches by an indexed field called 'platform_number'. We then sort that by the indexed field 'date' (stored as an ISODate type). The remaining pipeline does not seem to influence the performance, its just some projections and filtering.
{$sort: {date: -1}} bottlenecks the entire aggregate, even though there are only around 250 documents returned. I do have an unindexed key called 'cycle_number' that correlates directly with the 'date' field. Replacing {date: -1} with {cycle_number: -1} speeds up the query, but then I get an out of memory error. Sorting has a max 100MB cap on Ram and this sort fails with 250 documents.
A possible solution would be to include the additional option { "allowDiskUse": true }. But before I do, I want to know why 'date' isn't sorting properly in the first place. Another option would be to index 'cycle_number' but again, why does 'date' throw up its hands?
The aggregation pipeline is provided below. It is first a match, followed by the sort and so on. I'm happy to explain what the other functions are doing, but they don't make much difference when I comment them out.
let agg = [ {$match: {platform_number: platform_number}} ] // indexed number
agg.push({$sort: {date: -1}}) // date is indexed in decending order
if (xaxis && yaxis) {
agg.push(helper.drop_missing_bgc_keys([xaxis, yaxis]))
agg.push(helper.reduce_bgc_meas([xaxis, yaxis]))
}
const query = Profile.aggregate(agg)
query.exec(function (err, profiles) {
if (err) return next(err)
if (profiles.length === 0) { res.send('platform not found') }
else {
res.json(profiles)
}
})
Once again, I've been tiptoeing around this issue for some time. Solving the issue would be great, but understanding the issue better is also awesome, Thank you for your help!
The query executor is not able to use a different index for the second stage. MongoDB indexes map the key values to the location of documents in the data files.
Once the $match stage has completed, the documents are in the pipeline, so no further index use is possible.
However, if you create a compound index on {platform_number:1, date:-1} the query planner can combine the $match and $sort stages into a single stage that will not require a blocking sort, which should greatly improve the performance of this pipeline.

Mongodb optimal query

My use case is different. I am trying to map it to user and orders for easy understanding.
I have to get the following for a user
For each department
For each order type
delivered count
unique orders
Unique order count means user might have ordered the same product, but that count has to be 1 for the same. I have the background logic and identified via duplicate order ids.
db.getCollection('user_orders').aggregate([{"user_id":123},
{$group: {"_id": {"department":"$department", "order_type":"$order_type"},
"del_count":{$sum:"$del_count"},
"unique_order":{$addToSet:{"unique_order":"$unique_order"}}}},
{$project: {"_id":0,
"department":"$_id.department",
"order_type_name":"$_id.order_type",
"unique_order_count": {$size:"$unique_order"},
"del_count":"$del_count"
}},
{$group: {"_id":"$department",
order_types: {$addToSet:
{"order_type_name":"$order_type_name",
"unique_order_count": "$unique_order_count",
"del_count":"$del_count"
}}}}
])
Sorry for my query formatting.
This query is working absolutely fine. I added the second grouping to bring the responses together for all order types of the same department.
Can I do the same in less number of pipelines - efficient ways?
The $project stage appears to be redundant but it's more refactoring rather than performance improvement. Your simplified pipeline can look like below:
db.getCollection('user_orders').aggregate([{$group: {"_id": {"department":"$department", "order_type":"$order_type"},
"del_count":{$sum:"$del_count"},
"unique_order":{$addToSet:{"unique_order":"$unique_order"}}}},
{$group: {"_id":"$_id.department",
order_types: {$addToSet:
{"order_type_name":"$_id.order_type",
"unique_order_count": {$size:"$unique_order"},
"del_count":"$del_count"
}}}}
])

Can an index on a subfield cover queries on projections of that field?

Imagine you have a schema like:
[{
name: "Bob",
naps: [{
time: 2019-05-01T15:35:00,
location: "sofa"
}, ...]
}, ...
]
So lots of people, each with a few dozen naps. You want to find out 'what days do people take the most naps?', so you index naps.time, and then query with:
aggregate([
{$unwind: naps},
{$group: { _id: {$day: "$naps.time"}, napsOnDay: {"$sum": 1 } }
])
But when doing explain(), mongo tells me no index was used in this query, when clearly the index on the time Date field could have been. Why is this? How can I get mongo to use the index for the more optimal query?
Indexes stores pointers to actual documents, and can only be used when working with a material document (i.e. the document that is actually stored on disk).
$match or $sort does not mutate the actual documents, and thus indexes can be used in these stages.
In contrast, $unwind, $group, or any other stages that changes the actual document representations basically loses the connection between the index and the material documents.
Additionally, when those stages are processed without $match, you're basically saying that you want to process the whole collection. There is no point in using the index if you want to process the whole collection.

Efficient pagination of MongoDB aggregation?

For efficiency, the Mongo documentation recommends that limit statements immediately follow sort statements, thus ending up with the somewhat nonsensical:
collection.find(f).sort(s).limit(l).skip(p)
I say this is somewhat nonsensical because it seems to say take the first l items, and then drop the first p of those l. Since p is usually larger than l, you'd think you'd end up with no results, but in practice you end up with l results.
Aggregation works more as you'd expect:
collection.aggregate({$unwind: u}, {$group: g},{$match: f}, {$sort: s}, {$limit: l}, {$skip: p})
returns 0 results if p>=l.
collection.aggregate({$unwind: u}, {$group: g}, {$match: f}, {$sort: s}, {$skip: p}, {$limit: l})
works, but the documentation seems to imply that this will fail if the match returns a result set that's larger than working memory. Is this true? If so, is there a better way to perform pagination on a result set returned through aggregation?
Source: the "Changed in version 2.4" comment at the end of this page: http://docs.mongodb.org/manual/reference/operator/aggregation/sort/
In MongoDB cursor methods (i.e. when using find()) like limit, sort, skip can be applied in any order => order does not matter. A find() returns a cursor on which modifications applied. Sort is always done before limit, skip is done before limit as well. So in other words the order is: sort -> skip -> limit.
Aggregation framework does not return a DB cursor. Instead it returns a document with results of aggregation. It works by producing intermediate results at each step of the pipeline and thus the order of operations really matters.
I guess MongoDB does not support order for cursor modifier methods because of the way it's implemented internally.
You can't paginate on a result of aggregation framework because there is a single document with results only. You can still paginate on a regular query by using skip and limit, but a better practice would be to use a range query due to it's efficiency of using an index.
UPDATE:
Since v2.6 Mongo aggregation framework returns a cursor instead of a single document. Compare: v2.4 and v2.6.
The documentation seems to imply that this (aggregation) will fail if the match returns a result set that's larger than working memory. Is this true?
No. You can, for example, aggregate on a collection that is larger than physical memory without even using the $match operator. It might be slow, but it should work. There is no problem if $match returns something that is larger than RAM.
Here are the actual pipeline limits.
http://docs.mongodb.org/manual/core/aggregation-pipeline-limits/
The $match operator solely does not cause memory problems. As stated in the documentation, $group and $sort are the usual villains. They are cumulative, and might require access to the entire input set before they can produce any output. If they load too much data into physical memory, they will fail.
If so, is there a better way to perform pagination on a result set returned through aggregation?
I has been correctly said that you cannot "paginate" (apply $skip and $limit) on the result of the aggregation, because it is simply a MongoDB document. But you can "paginate" on the intermediate results of the aggregation pipeline.
Using $limit on the pipeline will help on keeping the result set within the 16 MB bounds, the maximum BSON document size. Even if the collection grows, you should be safe.
Problems could arise with $group and, specially, $sort. You can create "sort friendly" indexes to deal with them if they do actually happen. Have a look at the documentation on indexing strategies.
http://docs.mongodb.org/manual/tutorial/sort-results-with-indexes/
Finally, be aware that $skip does not help with performance. On the contrary, they tend to slow down the application since it forces MongoDB to scan every skipped document to reach the desired point in the collection.
http://docs.mongodb.org/manual/reference/method/cursor.skip/
MongoDB recommendation of $sort preceding $limit is absolutely true as when it happens it optimizes the memory required to do the operation for top n results.
It just that the solution you proposes doesn't fit your use case, which is pagination.
You can modify your query to to get the benefit from this optimization.
collection.aggregate([
{
$unwind: u
},
{
$group: g
},
{
$match: f
},
{
$sort: s
},
{
$limit: l+p
},
{
$skip: p
}
]);
or for find query
collection.find(f).sort(s).limit(l+p).skip(p)
Though, as you can see the with big pagination the memory will grow more and more even with this optimization.

Apply function and sort in MongoDB without MapReduce

I have an interesting problem. I have a working M/R version of this but it's not really a viable solution in a small-scale environment since it's too slow and the query needs to be executed real-time.
I would like to iterate over each element in a collection and score it, sort by descending, limit to top 10 and return the results to the applications.
Here is the function I'd like applied to each document in pseudo code.
var score = 0;
foreach(tag in document.Tags) {
score += someMap[tag];
}
return score;
Since your someMap is changing each time, I don't see any alternative other than to score all the documents and return the highest-scoring ones. Whatever method you adopt for this type of operation, you'll have to consider all the documents in the collection, which is going to be slow, and will become more and more costly as the collection you're scanning grows.
One issue with map reduce is that each mongod instance can only run one concurrent map reduce. This is a limitation of the javascript engine, which is single-threaded. Multiple map reduces will be interleaved, but they cannot run concurrently with one another. This means that if you're relying on map reduce for "real-time" uses, that is, if your web page has to run a map reduce to render, you'll eventually hit a limit where page load times become unacceptably slow.
You can work around this by querying all the documents into your application, and doing the scoring, sorting, and limiting in your application code. Queries in MongoDB can run concurrently, unlike map reduce, though of course this means that your application servers will have to do a lot of work.
Finally, if you are willing to wait for MongoDB 2.2 to be released (which should be within a few months), you can use the new aggregation framework in place of map reduce. You'll have to massage the someMap to generate the correct pipeline steps. Here's an example of what this might look like if someMap were {"a": 5, "b": 2}:
db.runCommand({aggregate: "foo",
pipeline: [
{$unwind: "$tags"},
{$project: {
tag1score: {$cond: [{$eq: ["$tags", "a"]}, 5, 0]},
tag2score: {$cond: [{$eq: ["$tags", "b"]}, 3, 0]}}
},
{$project: {score: {$add: ["$tag1score", "$tag2score"]}}},
{$group: {_id: "$_id", score: {$sum: "$score"}}},
{$sort: {score: -1}},
{$limit: 10}
]})
This is a little complicated, and bears explaining:
First, we "unwind" the tags array, so that the following steps in the pipeline process documents where "tags" is a scalar -- the value of the tag from the array -- and all the other document fields (notably _id) are duplicated for each unwound element.
We use a projection operator to convert from tags to named score fields. The $cond/$eq expression for each roughly means (for the tag1score example) "if the value in the document in the 'tags' field id equal to 'a', then return 5 and assign that value to a new field tag1score, else return 0 and assign that". This expression would be repeated for each tag/score combination in your someMap. At this point in the pipeline, each document will nave N tagNscore fields, but at most one of them will have a non-zero value.
Next we use another projection operator to create a score field whose value is the sum of the tagNscore fields in the document.
Next we group the documents by their _id, and sum up the value of the score field from the previous step across all documents in each group.
We sort by score, descending (i.e. greatest scores first)
We limit to only the top 10 scores.
I'll leave it as an exercise to the reader how to convert someMap into the correct set of projections in step 2, and the correct set of fields to add in step 3.
This is essentially the same set of steps that your application code or map reduce would go through, but has the following distinct advantages: instead of map reduce, the aggregation framework is fully implemented in C++ and is faster and more concurrent than map reduce; and unlike querying all the documents to your application, the aggregation framework works with the data on the server side, saving network load. But like the other two approaches, this will still have to consider each document, and can only limit the result set once the score has been calculated for all of them.