Mongo View Not Showing Same Indexing Speed Improvement As in Targeted Collection - mongodb

I have created a mongo view that basically targets documents on an "accounts" collection - specifically where the value for "transactions.amount.balance" is greater than zero. So that looks like this:
{"transactions.amounts.balance": { $gt : 0 }}
Now, because results took a long time to return, I have added an index on this field in the collection this view works with. Subsequently, when I then run this query on the collection, the results now return much more quickly -- like less than a second instead of 9 seconds prior to adding the index.
However, that said, I don't seem to be noticing the same performance improvement in the mongo view I've created, which, again, among other things, recreates this same query on the same collection.
My understanding is that a view will inherit all of the indexes that have been created on the collection it targets. So, if that's the case, why am I not seeing any kind of performance improvement in the mongo view? Am I missing something?
By the way, when I check the input and output of each stage of my aggregation pipeline, sure enough, this is the one that takes about 9 seconds to return results:
{ "transactions.amounts.balance" : { "$gt" : 0.0 } }
Why is this query step so much slower in my view than when run directly on the collection it targets? Is there something else I can to help speed up the execution of this query step?
Here are the first few steps of the aggregation pipeline in my mongo view:
db.accounts.aggregate(
// Pipeline
[
// Stage 1
{
$unwind: {
"path": "$transactions"
}
},
// Stage 2
{
$match: {
"transactions.amounts.balance": {
"$gt": 0.0
}
}
},
// Stage 3
{
$addFields: {
"openBalance": "$transactions.amounts.balance"
}
}

According to the documentation, $match will only use an index if it's used with no other preceeding stages.
If you place a $match at the very beginning of a pipeline, the query
can take advantage of indexes like any other db.collection.find() or
db.collection.findOne().
Since you unwind your documents first, $match won't use the index which you should see from the explain() plan, too.
Depending on your data, specifically, if you have lots of documents that do not contain a matching entry in the transactions.amounts.balance array, it can be helpful in terms of performance to simply duplicate the $match filter and put one to the very beginning of your pipeline in order to eliminate some of the documents. In the best case (again, this depends on your data), the resulting number of documents will be low enough for the second $match stage to not hurt performance any longer.

Related

This question is regarding the match and sort oprimization is MongoDb

{
"_id" : ObjectId("62c3aa311984f666ef75d1n7"),
"eventCode" : "332",
"time" : 1657008013000.0,
"dat" : "61558575921c023a93f81362",
}
This is how a document looks like, now I need to calculate some value for which I am using aggregation pipeline and I am using the match and sort operators first, what I am using is.
$match: {
dat: { $regex: "^" + eventStat.dat },
time: {
$gte: eventStat.time.from,
$lte: eventStat.time.to,
},
},
$sort: { time: 1 }
So I am using this two opeartors in the pipeline first,
Now Mongodb Document says that aggregation will always implement match first before sort but in some cases it performs sort first, I am not sure but I think that happens when there is a index on field key used in sort not present in match and Mongodb decides it better to sort first.
Here I am using time in both match and sort so I want to know that is there still any case possible where sort might happen before match?
If yes, I read that a dummy project operator can force it to match first but what exactly is a dummy project opeartor?
Most questions about how the database is executing a query can be answered (or at least further reasoned about) by inspecting the explain plan(s) associated with the operation(s). Let's first address a few of your statements directly before turning to inspect explain plans ourselves.
Now Mongodb Document says that aggregation will always implement match first before sort
Where does it say this?
In general, all databases are required to provide results that are semantically valid relative to the query that the client issued. This gets mentioned often when SQL is being discussed as it is a "declarative language". This means that users describe what data they want rather than how to retrieve that data.
MongoDB's aggregation framework is a bit less declarative than SQL. Or said another way, the aggregation framework is a little more descriptive in how to do things. This is because the ordering that the stages are defined in for a pipeline help define the semantics of the results. If, for example, one were to $project out a field first and then attempt to use that (no longer present) field in a subsequent stage (such as a $match or $group), MongoDB would not make any adjustments to how it processes the pipeline to make that field available to that later stage. This is because the user specifically requested the removal of that stage earlier in the pipeline which is part of the semantics for the overall pipeline.
Based on this (and another factor that we will talk about next), I would be surprised to see any documentation suggesting that the database always performs a match stage before a sort stage.
but in some cases it performs sort first, I am not sure but I think that happens when there is a index on field key used in sort not present in match and Mongodb decides it better to sort first.
Again returning to generalizations about all databases, one of their primary jobs is to return data to clients as efficiently as possible. So as long as their approach at executing the query does not logically change the results based on the semantics expressed by the client in the query, the database can gather the results in any manner that it thinks will be the most effective.
For aggregation specifically, this most commonly means that stages will either get reordered or combined altogether for execution. Some of the changes that the database will attempt to do are outlined on the Aggregation Pipeline Optimization page.
Logically, filtering data and then sorting it yields the same results as sorting the data and then filtering it. So indeed, one of the optimizations outlined on that page is indeed reordering $match and $sort stages.
The important thing to keep in mind here is mentioned at the very top of that page. The database "attempts to reshape the pipeline for improved performance", but how effective these adjustments are depend on other factors. The biggest factor for many of these is the presence (or absence) of an associated index to support the (reordered) pipeline.
Here I am using time in both match and sort so I want to know that is there still any case possible where sort might happen before match?
Unless you are explicitly forcing the database to use a particular plan (such as by hinting it), there is always a chance that it will choose to do something unexpected. Databases are quite good at picking optimal plans though and are always improving with each new release, so ideally we'd leave the system to do its work and not try to do that work for the database (with hints or otherwise). In your particular situation, I believe we can design an approach that is highly optimized for both the $match and the $sort setting it up for success.
If yes, I read that a dummy project operator can force it to match first but what exactly is a dummy project opeartor?
It sounds like this is also asking about other ways in which we could manually influence plan selection. We are going to stay away from that as it is fragile, not something we should rely on long term, and unnecessary for our purposes anyway.
Inspecting Explain
So what happens if we have an index on { time: 1 } and we run the aggregation? Well, the explain output (on 6.0) shows us the following:
queryPlanner: {
parsedQuery: {
'$and': [
{ time: { '$lte': 100 } },
{ time: { '$gte': 0 } },
{ dat: { '$regex': '^ABC' } }
]
},
...
winningPlan: {
stage: 'FETCH',
filter: { dat: { '$regex': '^ABC' } },
inputStage: {
stage: 'IXSCAN',
keyPattern: { time: 1 },
indexBounds: { time: [ '[0, 100]' ] }
...
}
},
Notice that there is no $sort stage at all. What has happened is that the database realized that it could use the { time: 1 } index to do two things at the same time:
Filter the data according to the range predicates on the time field.
Walk the index in the requested sort order without having to manually do so.
So if we go back to the main original question of whether aggregation will perform the match or sort first, we now see that a third option is for the database to do both activities them at the same time!
At the very least, you should have an index on { time: 1 }.
Ideally you would instead have a compound index on the other field (dat) as well. There is a bit of a wrinkle here in that you are currently applying a regex operator against the field. If the filter were a direct equality match, the guidance would be easy (prepend dat: 1 as the first key in the compound index).
Without knowing more about your situation, it's unclear which of the two compound indexes the database could use more effectively to support this operation. If the regex filter on dat is highly selective, then { dat: 1, time: 1 } will probably be ideal. It will require a manual sort, but that can all be done after scanning the index before retrieving the full documents. If the regex filter on dat is not very selective, then { time: 1, dat: 1 } may be ideal. This would prevent the need to manually sort, but will result in some additional index key scanning.
In either case, examining explain output may be helpful in finding the approach that is best suited for your particular situation.

Mongoose aggregate pipeline: sorting indexed date in MongoDB is slow

I've been working with this error for some time on my App here and was hoping someone can lend a hand finding the error of this aggregation query.
I'm using a docker container running MongoDB shell version v4.2.8. The app uses an Express.js backend with Mongoose middleware to interface with the database.
I want to make an aggregation pipeline that first matches by an indexed field called 'platform_number'. We then sort that by the indexed field 'date' (stored as an ISODate type). The remaining pipeline does not seem to influence the performance, its just some projections and filtering.
{$sort: {date: -1}} bottlenecks the entire aggregate, even though there are only around 250 documents returned. I do have an unindexed key called 'cycle_number' that correlates directly with the 'date' field. Replacing {date: -1} with {cycle_number: -1} speeds up the query, but then I get an out of memory error. Sorting has a max 100MB cap on Ram and this sort fails with 250 documents.
A possible solution would be to include the additional option { "allowDiskUse": true }. But before I do, I want to know why 'date' isn't sorting properly in the first place. Another option would be to index 'cycle_number' but again, why does 'date' throw up its hands?
The aggregation pipeline is provided below. It is first a match, followed by the sort and so on. I'm happy to explain what the other functions are doing, but they don't make much difference when I comment them out.
let agg = [ {$match: {platform_number: platform_number}} ] // indexed number
agg.push({$sort: {date: -1}}) // date is indexed in decending order
if (xaxis && yaxis) {
agg.push(helper.drop_missing_bgc_keys([xaxis, yaxis]))
agg.push(helper.reduce_bgc_meas([xaxis, yaxis]))
}
const query = Profile.aggregate(agg)
query.exec(function (err, profiles) {
if (err) return next(err)
if (profiles.length === 0) { res.send('platform not found') }
else {
res.json(profiles)
}
})
Once again, I've been tiptoeing around this issue for some time. Solving the issue would be great, but understanding the issue better is also awesome, Thank you for your help!
The query executor is not able to use a different index for the second stage. MongoDB indexes map the key values to the location of documents in the data files.
Once the $match stage has completed, the documents are in the pipeline, so no further index use is possible.
However, if you create a compound index on {platform_number:1, date:-1} the query planner can combine the $match and $sort stages into a single stage that will not require a blocking sort, which should greatly improve the performance of this pipeline.

How to speed up agregate queries in MongoDB

I am running examples of aggregate queries similar to this:
https://www.compose.com/articles/aggregations-in-mongodb-by-example/
db.mycollection.aggregate([
{
{ $match: {"nested.field": "1110"}}, {
$group: {
_id: null,
total: {
$sum: "$nested.field"
},
average_transaction_amount: {
$avg: "$nested.field"
},
min_transaction_amount: {
$min: "$nested.field"
},
max_transaction_amount: {
$max: "$nested.field"
}
}
}
]);
One collection that I created have 5,000,000 inserted big JSON documents (around 1,000 K->V pairs, some are nested).
Before adding index on one nested field - it takes around 5min to do count of that field.
After adding index - for count it takes less than a second (which is good).
Now I am trying to do SUM or AVG or any other like example above - it takes minutes (not seconds).
Is there a way to improve aggregate queries in MongoDB?
Thanks!
Unfortunately, group currently does not use indexes in mongodb. Only sort and match can take advantage of indexes. So the query as you wrote it is as optimized as it could be.
There are a couple things you could do. For max and min, you could just query them instead of using the aggregation framework. You can than sort by $nested.field and take just one. You can put an index on $nested.field and you can then sort ascending or descending with the same index.
If you have any control over when the data is inserting, and the query is as simple as it looks, you could keep track of the data yourself. So you could have a table in mongo where the collection has the "Id" or whatever you are grouping on and have fields for "total" and "sum". You could increment them on inserts and then getting the total and averages would be fast queries. Not sure if that's an option for your situation, but its the best you can do.
Generally, mongo is super fast. In my opinion, the only place its not quite as good as SQL is aggregation. The benefits heavily outweigh the struggles to me. I generally maintain separate reporting collections for this kind of situation as I recommended.

Iterating over distinct items in one field in MongoDB

I have a very large collection (~7M items) in MongoDB, primarily consisting of documents with three fields.
I'd like to be able to iterate over all the unique values for one of the fields, in an expedient manner.
Currently, I'm querying for just that field, and then processing the returned results by iterating on the cursor for uniqueness. This works, but it's rather slow, and I suspect there must be a better way.
I know mongo has the db.collection.distinct() function, but this is limited by the maximum BSON size (16 MB), which my dataset exceeds.
Is there any way to iterate over something similar to the db.collection.distinct(), but using a cursor or some other method, so the record-size limit isn't as much of an issue?
I think maybe something like the map/reduce functionality would possibly be suited for this kind of thing, but I don't really understand the map-reduce paradigm in the first place, so I have no idea what I'm doing. The project I'm working on is partially to learn about working with different database tools, so I'm rather inexperienced.
I'm using PyMongo if it's relevant (I don't think it is). This should be mostly dependent on MongoDB alone.
Example:
For this dataset:
{"basePath" : "foo", "internalPath" : "Neque", "itemhash": "49f4c6804be2523e2a5e74b1ffbf7e05"}
{"basePath" : "foo", "internalPath" : "porro", "itemhash": "ffc8fd5ef8a4515a0b743d5f52b444bf"}
{"basePath" : "bar", "internalPath" : "quisquam", "itemhash": "cf34a8047defea9a51b4a75e9c28f9e7"}
{"basePath" : "baz", "internalPath" : "est", "itemhash": "c07bc6f51234205efcdeedb7153fdb04"}
{"basePath" : "foo", "internalPath" : "qui", "itemhash": "5aa8cfe2f0fe08ee8b796e70662bfb42"}
What I'd like to do is iterate over just the basePath field. For the above dataset, this means I'd iterate over foo, bar, and baz just once each.
I'm not sure if it's relevant, but the DB I have is structured so that while each field is not unique, the aggregate of all three is unique (this is enforced with an index).
The query and filter operation I'm currently using (note: I'm restricting the query to a subset of the items to reduce processing time):
self.log.info("Running path query")
itemCursor = self.dbInt.coll.find({"basePath": pathRE}, fields={'_id': False, 'internalPath': False, 'itemhash': False}, exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
for item in itemCursor:
# print item
items.add(item["basePath"])
self.log.info("total unique items = %s", len(items))
Running the same query with self.dbInt.coll.distinct("basePath") results in OperationFailure: command SON([('distinct', u'deduper_collection'), ('key', 'basePath')]) failed: exception: distinct too big, 16mb cap
Ok, here is the solution I wound up using. I'd add it as an answer, but I don't want to detract from the actual answers that got me here.
reStr = "^%s" % fqPathBase
pathRE = re.compile(reStr)
self.log.info("Running path query")
pipeline = [
{ "$match" :
{
"basePath" : pathRE
}
},
# Group the keys
{"$group":
{
"_id": "$basePath"
}
},
# Output to a collection "tmp_unique_coll"
{"$out": "tmp_unique_coll"}
]
itemCursor = self.dbInt.coll.aggregate(pipeline, allowDiskUse=True)
itemCursor = self.dbInt.db.tmp_unique_coll.find(exhaust=True)
self.log.info("Query complete. Processing")
self.log.info("Query returned %d items", itemCursor.count())
self.log.info("Filtering returned items to require uniqueness.")
items = set()
retItems = 0
for item in itemCursor:
retItems += 1
items.add(item["_id"])
self.log.info("Recieved items = %d", retItems)
self.log.info("total unique items = %s", len(items))
General performance compared to my previous solution is about 2X in terms of wall-clock time. On a query that returns 834273 items, with 11467 uniques:
Original method(retreive, stuff into a python set to enforce uniqueness):
real 0m22.538s
user 0m17.136s
sys 0m0.324s
Aggregate pipeline method :
real 0m9.881s
user 0m0.548s
sys 0m0.096s
So while the overall execution time is only ~2X better, the aggregation pipeline is massively more performant in terms of actual CPU time.
Update:
I revisited this project recently, and rewrote the DB layer to use a SQL database, and everything was much easier. A complex processing pipeline is now a simple SELECT DISTINCT(colName) WHERE xxx operation.
Realistically, MongoDB and NoSQL databases in general are vary much the wrong database type for what I'm trying to do here.
From the discussion points so far I'm going to take a stab at this. And I'm also noting that as of writing, the 2.6 release for MongoDB should be just around the corner, good weather permitting, so I am going to make some references there.
Oh and the FYI that didn't come up in chat, .distinct() is an entirely different animal that pre-dates the methods used in the responses here, and as such is subject to many limitations.
And this soltion is finally a solution for 2.6 up, or any current dev release over 2.5.3
The alternative for now is use mapReduce because the only restriction is the output size
Without going into the inner workings of distinct, I'm going to go on the presumption that aggregate is doing this more efficiently [and even more so in upcoming release].
db.collection.aggregate([
// Group the key and increment the count per match
{$group: { _id: "$basePath", count: {$sum: 1} }},
// Hey you can even sort it without breaking things
{$sort: { count: 1 }},
// Output to a collection "output"
{$out: "output"}
])
So we are using the $out pipeline stage to get the final result that is over 16MB into a collection of it's own. There you can do what you want with it.
As 2.6 is "just around the corner" there is one more tweak that can be added.
Use allowDiskUse from the runCommand form, where each stage can use disk and not be subject to memory restrictions.
The main point here, is that this is nearly live for production. And the performance will be better than the same operation in mapReduce. So go ahead and play. Install 2.5.5 for you own use now.
A MapReduce, in the current version of Mongo would avoid the problems of the results exceeding 16MB.
map = function() {
if(this['basePath']) {
emit(this['basePath'], 1);
}
// if basePath always exists you can just call the emit:
// emit(this.basePath);
};
reduce = function(key, values) {
return Array.sum(values);
};
For each document the basePath is emitted with a single value representing the count of that value. The reduce simply creates the sum of all the values. The resulting collection would have all unique values for basePath along with the total number of occurrences.
And, as you'll need to store the results to prevent an error using the out option which specifies a destination collection.
db.yourCollectionName.mapReduce(
map,
reduce,
{ out: "distinctMR" }
)
#Neil Lunn 's answer could be simplified:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}])
$project filters fields for you. In particular, '_id': 0 filters out the _id field.
Result still too large? Batch it with $limit and $skip:
field = 'basePath' # Field I want
db.collection.aggregate( [{'$project': {field: 1, '_id': 0}}, {'$limit': X}, {'$skip': Y}])
I think the most scalable solution is to perform a query for each unique value. The queries must be executed one after the other, and each query will give you the "next" unique value based on the previous query result. The idea is that the query will return you one single document, that will contain the unique value that you are looking for. If you use the proper projection, mongo will just use the index loaded into memory without having to read from disk.
You can define this strategy using $gt operator in mongo, but you must take into account values like null or empty strings, and potentially discard them using the $ne or $nin operator. You can also extend this strategy using multiple keys, using operators like $gte for one key and $gt for the other.
This strategy should give you the distinct values of a string field in alphabetical order, or distinct numerical values sorted ascendingly.

Efficient pagination of MongoDB aggregation?

For efficiency, the Mongo documentation recommends that limit statements immediately follow sort statements, thus ending up with the somewhat nonsensical:
collection.find(f).sort(s).limit(l).skip(p)
I say this is somewhat nonsensical because it seems to say take the first l items, and then drop the first p of those l. Since p is usually larger than l, you'd think you'd end up with no results, but in practice you end up with l results.
Aggregation works more as you'd expect:
collection.aggregate({$unwind: u}, {$group: g},{$match: f}, {$sort: s}, {$limit: l}, {$skip: p})
returns 0 results if p>=l.
collection.aggregate({$unwind: u}, {$group: g}, {$match: f}, {$sort: s}, {$skip: p}, {$limit: l})
works, but the documentation seems to imply that this will fail if the match returns a result set that's larger than working memory. Is this true? If so, is there a better way to perform pagination on a result set returned through aggregation?
Source: the "Changed in version 2.4" comment at the end of this page: http://docs.mongodb.org/manual/reference/operator/aggregation/sort/
In MongoDB cursor methods (i.e. when using find()) like limit, sort, skip can be applied in any order => order does not matter. A find() returns a cursor on which modifications applied. Sort is always done before limit, skip is done before limit as well. So in other words the order is: sort -> skip -> limit.
Aggregation framework does not return a DB cursor. Instead it returns a document with results of aggregation. It works by producing intermediate results at each step of the pipeline and thus the order of operations really matters.
I guess MongoDB does not support order for cursor modifier methods because of the way it's implemented internally.
You can't paginate on a result of aggregation framework because there is a single document with results only. You can still paginate on a regular query by using skip and limit, but a better practice would be to use a range query due to it's efficiency of using an index.
UPDATE:
Since v2.6 Mongo aggregation framework returns a cursor instead of a single document. Compare: v2.4 and v2.6.
The documentation seems to imply that this (aggregation) will fail if the match returns a result set that's larger than working memory. Is this true?
No. You can, for example, aggregate on a collection that is larger than physical memory without even using the $match operator. It might be slow, but it should work. There is no problem if $match returns something that is larger than RAM.
Here are the actual pipeline limits.
http://docs.mongodb.org/manual/core/aggregation-pipeline-limits/
The $match operator solely does not cause memory problems. As stated in the documentation, $group and $sort are the usual villains. They are cumulative, and might require access to the entire input set before they can produce any output. If they load too much data into physical memory, they will fail.
If so, is there a better way to perform pagination on a result set returned through aggregation?
I has been correctly said that you cannot "paginate" (apply $skip and $limit) on the result of the aggregation, because it is simply a MongoDB document. But you can "paginate" on the intermediate results of the aggregation pipeline.
Using $limit on the pipeline will help on keeping the result set within the 16 MB bounds, the maximum BSON document size. Even if the collection grows, you should be safe.
Problems could arise with $group and, specially, $sort. You can create "sort friendly" indexes to deal with them if they do actually happen. Have a look at the documentation on indexing strategies.
http://docs.mongodb.org/manual/tutorial/sort-results-with-indexes/
Finally, be aware that $skip does not help with performance. On the contrary, they tend to slow down the application since it forces MongoDB to scan every skipped document to reach the desired point in the collection.
http://docs.mongodb.org/manual/reference/method/cursor.skip/
MongoDB recommendation of $sort preceding $limit is absolutely true as when it happens it optimizes the memory required to do the operation for top n results.
It just that the solution you proposes doesn't fit your use case, which is pagination.
You can modify your query to to get the benefit from this optimization.
collection.aggregate([
{
$unwind: u
},
{
$group: g
},
{
$match: f
},
{
$sort: s
},
{
$limit: l+p
},
{
$skip: p
}
]);
or for find query
collection.find(f).sort(s).limit(l+p).skip(p)
Though, as you can see the with big pagination the memory will grow more and more even with this optimization.