What is the difference between COUNT_SCAN and IXSCAN? - mongodb

Whenever I run a count query on MongoDB with explain I can see two different stages COUNT_SCAN and IXSCAN. I want to know the difference between them according to performance and how can I improve the query.
field is indexed.
following query:
db.collection.explain(true).count({field:1}})
uses COUNT_SCAN and query like:
db.collection.explain(true).count({field:"$in":[1,2]})
uses IXSCAN.

The short: COUNT_SCAN is the most efficient way to get a count by reading the value from an index, but can only be performed in certain situations. Otherwise, IXSCAN is performed following by some filtering of documents and a count in memory.
When reading from secondary the read concern available is used. This concern level doesn't consider orphan documents in sharded clusters, and so no SHARDING_FILTER stage will be performed. This is when you see COUNT_SCAN.
However, if we use read concern local, we need to fetch the documents in order to perform the SHARDING_FILTER filter stage. In this case, there are multiple stages to fulfill the query: IXSCAN, then FETCH then SHARDING_FILTER.

Related

Why do we need an additional LIMIT stage with compound index in Mongo

I am using Mongo 4.2 (stuck with this) and have a collection say "product_data" with documents with the following schema:
_id:"2lgy_itmep53vy"
uIdHash:"2lgys2yxouhug5xj3ms45mluxw5hsweu"
userTS:1494055844000
Case 1: With this, I have the following indexes for the collection:
_id:Regular - Unique
uIdHash: Hashed
I tried to execute
db.product_data.find( {"uIdHash":"2lgys2yxouhug5xj3ms45mluxw5hsweu"}).sort({"userTS":-1}).explain()
and these are the stages in result:
Ofcourse, I could realize that it would make sense to have an additional compound index to avoid the mongo in-memory 'Sort' stage.
Case 2: Now I have attempted to add another index with those which were existing
3. {uIdHash:1 , userTS:-1}: Regular and Compound
Up to my expectation, the result of execution here was able to optimize on the sorting stage:
All good so far, now that I am looking to build for pagination on top of this query. I would need to limit the data queried. Hence the query further translates to
db.product_data.find( {"uIdHash":"2lgys2yxouhug5xj3ms45mluxw5hsweu"}).sort({"userTS":-1}).limit(10).explain()
The result for each Case now are as follows:
Case 1 Limit Result:
The in-memory sorting does less work (36 instead of 50) and returns the expected number of documents. Fair enough, a good underlying optimization in the stage.
Case 2 Limit Result:
Surprisingly, with the index in use and the data queried, there is an additional Limit stage added to processing!
The doubts now I have are as follows:
Why do we need an additional stage for LIMIT, when we already have 10 documents retured from FETCH stage?
What would be the impact of this additional stage? Given that I need pagination, shall I stick with Case 1 indexes and not use the last compound index?
Limit stage tells you that the database is limiting the result set. This means subsequent stages will work with less data.
Your question of "why do we need an additional stage for limit" doesn't make sense. You send the query to the database, and you do not use (or need) any stages. The database decides how to fulfill the query, if you asked it to limit the result set it does that and it communicates to you that it has done that, by telling you there is a limit stage in query processing.
The query executor is able to perform some optimizations. One of these is that when there is a limit and no blocking stage (like a sort), when the limit is reached, all of the upstream stages can stop early.
This means that if there were no limit stage, the ixscan and fetch stages would have continued through all 24 matching documents.
There is no discreet limit stage with the non-index sort because it is combined with the sort stage.

which branches of a mongodb $or query were satisfied to include a document in the set returned?

Say I have a mongo $or query, something like { $or: [query1, query2, ... queryN] }, where each embedded query could be complex. Upon executing the query, a set of documents matching one or more of the embedded queries is returned. I would like to know which of the N embedded queries was satisfied for each document in the returned set, perhaps by adding a new field that I specify, eg. marks, into each returned document that would hold a list of the indexes of whichever of the queries was satisfied. I need this information to indicate how each document was identified in my application's interface.
I realize I could inspect the set once it is returned and determine the queries that were satisfied, but these queries could be arbitrarily complex and expensive to inspect - besides, this must have already been done inside mongo itself while doing the search.
I also realize I could run each of the N queries sequentially and then mark and merge the results into a growing set, but I want to save that overhead by running a single query instead of N queries.
And I realize that Mongo will certainly stop once the first satisfying query is found for each document, so I may not be able to get the complete set, but then I would at least like some assurance that the queries are executed in a certain order, say 1...N, and each document could be marked with its first satisfying index.
Does anyone know if there's a mechanism in Mongo that can do this?
You can use aggregation.
Use $addFields to add a new field for each query.
You could either $match first, and then add the fields, or add the fields first and on the added fields.

Does Mongo search indexed fields first?

I have a query that searches on two fields: one of those fields is indexed, and one is not.
Will Mongo do the right thing by first searching on the indexed field, and only then searching the other non-indexed search parameter?
It depends.
If you have normal find, then IXSCAN will be used on indexed field. You can check this by using explain()
If you have aggregate queries and your first stage matches on non indexed field and second stage uses indexed field, then index will not be used. Here also you can use explain() to check.
The answer is maybe.
What MongoDB will do is the first time it sees a particular query shape, it will run a short test.
It determines all of the potential index plans that might be used to service the query, including collection scan.
It then runs each of these in parallel, with a limit of 101 documents, 101 units of work, or 100 milliseconds.
The score for each plan is the number of documents found during the test period, divided by the number of units of work required.
After the test, any plan that completed gets bonus points, and additional bonus points if it completed without needing an in-memory sort.
This plan is then cached to be reuse the next time it sees a query with the same shape.
You can run explain with the "allPlansExecution" options to see all of the candidate plans that were considered, and how each compares.
In most cases, it will choose the index as expected, but there are some situations where it doesn't, and explain can be very useful to determine why.

Why aggregate+sort is faster than find+sort in mongo?

I'm using mongoose in my project. When the number of documents in my collection becomes bigger, the method of find+sort becomes slower. So I use aggregate+$sort instead. I just wonder why?
Without seeing your data and your query it is difficult to answer why aggregate+sort is faster than find+sort.
But below are the things that holds good on find and aggregate
A well indexed(Indexing that suits your query) data will always yield faster results on your find query.
The components of aggregation pipeline which you use on your aggregate query, more operations is directly proportional to more execution time.
When you go for aggregation pipeline you can create new fields such as sum, avg and so on, which is not possible in a find.
see this thread for more info
MongoDB {aggregation $match} vs {find} speed

Aggregation framework on full table scan

I know that aggregation framework is suitable if there is an initial $match pipeline to limit the collection to be aggregated. However, there may be times that the filtered collection may still be large, say around 2 million and the aggregation will involve $group. Is the aggregation framework fit to work on such a collection given a requirement to output results in at most 5 seconds. Currently I work on a single node. By performing the aggregation on a shard set, will there be a significant improvement in the performance.
As far as I know the only limitations are that the result of the aggregation can't surpass the limit of 16MB, since what it returns is a document and that's the limit size for a document in MongoDB. Also you can't use more than 10% of the total memory of the machine, for that usually $match phases are used to reduce the set you work with, or a $project phase to reduce the data per document.
Be aware that in a sharded environment after $group or $sort phases the aggregation is brought back to the MongoS before sending it to the next phase of the pipeline. Potentially the MongoS could be running in the same machine as your application and could hurt your application performance if not handled correctly.