Is it faster to use with_limit_and_skip=True when counting query results in pymongo - mongodb

I'm doing a query where all I want to know if there is at least one row in the collection that matches the query, so I pass limit=1 to find(). All I care about is whether the count() of the returned cursor is > 0. Would it be faster to use count(with_limit_and_skip=True) or just count()? Intuitively it seems to me like I should pass with_limit_and_skip=True, because if there are a whole bunch of matching records then the count could stop at my limit of 1.
Maybe this merits an explanation of how limits and skips work under the covers in mongodb/pymongo.
Thanks!

Your intuition is correct. That's the whole point of the with_limit_and_skip flag.
With with_limit_and_skip=False, count() has to count all the matching documents, even if you use limit=1, which is pretty much guaranteed to be slower.
From the docs:
Returns the number of documents in the results set for this query. Does not take limit() and skip() into account by default - set with_limit_and_skip to True if that is the desired behavior.

Related

Why MongoDB find has same performance as count

I am running tests against my MongoDB and for some reason find has the same performance as count.
Stats:
orders collection size: ~20M,
orders with product_id 6: ~5K
product_id is indexed for improved performance.
Query: db.orders.find({product_id: 6}) vs db.orders.find({product_id: 6}).count()
result the orders for the product vs 5K after 0.08ms
Why count isn't dramatically faster? it can find the first and last elements position with the product_id index
As Mongo documentation for count states, calling count is same as calling find, but instead of returning the docs, it just counts them. In order to perform this count, it iterates over the cursor. It can't just read the index and determine the number of documents based on first and last value of some ID, especially since you can have index on some other field that's not ID (and Mongo IDs are not auto-incrementing). So basically find and count is the same operation, but instead of getting the documents, it just goes over them and sums their number and return it to you.
Also, if you want a faster result, you could use estimatedDocumentsCount (docs) which would go straight to collection's metadata. This results in loss of the ability to ask "What number of documents can I expect if I trigger this query?". If you need to find a count of docs for a query in a faster way, then you could use countDocuments (docs) which is a wrapper around an aggregate query. From my knowledge of Mongo, the provided query looks like a fastest way to count query results without calling count. I guess that this should be preferred way regarding performances for counting the docs from now on (since it's introduced in version 4.0.3).

Could the order of the results change after using Pymongo's rewind() function?

Given that MongoDB query results are returned in the order that they are found, which "may coincide with insertion order (but isn't guaranteed to be) or the order of the index(es) used":
Does this mean that the order of the results could change after using Pymongo's rewind() function?
It seems like rewind() performs another database query, right?
Correct, rewind() performs another database query, as if the first had never happened. If you don't specify any sort order to your results, and if MongoDB had to move some documents (because some changed size, for example) between the first and the second query, you will get them in different order.
If you need your documents in a particular order, use sort.
http://api.mongodb.com/python/current/api/pymongo/cursor.html#pymongo.cursor.Cursor.sort

Implementation of limit in mongodb

My collection name is trial and data size is 112mb
My query is,
db.trial.find()
and i have added limit up-to 10.
db.trial.find.limit(10).
but the limit is not working.the entire query is running.
Replace
db.trial.find.limit(10)
with
db.trial.find().limit(10)
Also you mention that the entire database is being queried? Run this
db.trial.find().limit(10).explain()
It will tell you how many documents it looked at before stopping the query (nscanned). You will see that nscanned will be 10.
The .limit() modifier on it's own will only "limit" the results of the query that is processed, so that works as designed to "limit" the results returned. In a raw form though with no query you should just have the n scanned as the limit you want:
db.trial.find().limit(10)
If your intent is to only operate on a set number of documents you can alter this with the $maxScan modifier:
db.trial.find({})._addSpecial( "$maxScan" , 11 )
Which causes the query engine to "give up" after the set number of documents have been scanned. But that should only really matter when there is something meaningful in the query.
If you are actually trying to do "paging" then you are better of using "range" queries with $gt and $lt and cousins to effectively change the range of selection that is done in your query.

MongoDB skip & limit when querying two collections

Let's say I have two collections, A and B, and a single document in A is related to N documents in B. For example, the schemas could look like this:
Collection A:
{id: (int),
propA1: (int),
propA2: (boolean)
}
Collection B:
{idA: (int), # id for document in Collection A
propB1: (int),
propB2: (...),
...
propBN: (...)
}
I want to return properties propB2-BN and propA2 from my API, and only return information where (for example) propA2 = true, propB6 = 42, and propB1 = propA1.
This is normally fairly simple - I query Collection B to find documents where propB6 = 42, collect the idA values from the result, query Collection A with those values, and filter the results with the Collection A documents from the query.
However, adding skip and limit parameters to this seems impossible to do while keeping the behavior users would expect. Naively applying skip and limit to the first query means that, since filtering occurs after the query, less than limit documents could be returned. Worse, in some cases no documents could be returned when there are actually still documents in the collection to be read. For example, if the limit was 10 and the first 10 Collection B documents returned pointed to a document in Collection A where propA2 = false, the function would return nothing. Then the user would assume there's nothing left to read, which may not be the case.
A slightly less naive solution is to simply check if the return count is < limit, and if so, repeat the queries until the return count = limit. The problem here is that skip/limit queries where the user would expect exclusive sets of documents returned could actually return the same documents.
I want to apply skip and limit at the mongo query level, not at the API level, because the results of querying collection B could be very large.
MapReduce and the aggregation framework appear to only work on a single collection, so they don't appear to be alternatives.
This seems like something that'd come up a lot in Mongo use - any ideas/hints would be appreciated.
Note that these posts ask similar sounding questions but don't actually address the issues raised here.
Sounds like you already have a solution (2).
You cannot optimize/skip/limit on first query, depending on search you can perhaps do it on second query.
You will need a loop around it either way, like you write.
I suppose, the .skip will always be costly for you, since you will need to get all the results and then throw them away, to simulate the skip, to give the user consistent behavior.
All the logic would have to go to your loop - unless you can match in a clever way to second query (depending on requirements).
Out of curiosity: Given the time passed, you should have a solution by now?!

In Mongodb, how do I get the count of the total results returned, without the limit?

Let's say i put a limit and skip on the MongoDB query...I want to know the total results if there was not a limit on there.
Of course, I could do this the shitty way...which is to query twice.
In MongoDB the default behavior of count() is to ignore skip and limit and count the number of results in the entire original query. So running count will give you exactly what you want.
Passing a Boolean true to count or calling size instead would give you a count WITH skip or limit.
There is no way to get the count without executing the query twice.