Why MongoDB find has same performance as count - mongodb

I am running tests against my MongoDB and for some reason find has the same performance as count.
Stats:
orders collection size: ~20M,
orders with product_id 6: ~5K
product_id is indexed for improved performance.
Query: db.orders.find({product_id: 6}) vs db.orders.find({product_id: 6}).count()
result the orders for the product vs 5K after 0.08ms
Why count isn't dramatically faster? it can find the first and last elements position with the product_id index

As Mongo documentation for count states, calling count is same as calling find, but instead of returning the docs, it just counts them. In order to perform this count, it iterates over the cursor. It can't just read the index and determine the number of documents based on first and last value of some ID, especially since you can have index on some other field that's not ID (and Mongo IDs are not auto-incrementing). So basically find and count is the same operation, but instead of getting the documents, it just goes over them and sums their number and return it to you.
Also, if you want a faster result, you could use estimatedDocumentsCount (docs) which would go straight to collection's metadata. This results in loss of the ability to ask "What number of documents can I expect if I trigger this query?". If you need to find a count of docs for a query in a faster way, then you could use countDocuments (docs) which is a wrapper around an aggregate query. From my knowledge of Mongo, the provided query looks like a fastest way to count query results without calling count. I guess that this should be preferred way regarding performances for counting the docs from now on (since it's introduced in version 4.0.3).

Related

Firestore 1 global index vs 1 index per query what is better?

I'm working on my app and I just ran into a dilemma regarding what's the best way to handle indexes for firestore.
I have a query that search for publication in a specify community that contains at least one of the tag and in a geohash range. The index for that query looks like this:
community Ascending tag Ascending location.geohash Ascending
Now if my user doesnt need to filter by tag, I run the query without the arrayContains(tag) which prompt me to create another index:
community Ascending location.geohash Ascending
My question is, is it better to create that second index or, to just use the first one and specifying all possible tags in arrayContains in the query if the user want no filters on tag ?
Neither is pertinently better, but it's a typical space vs time tradeoff.
Adding the extra tags in the query adds some overhead there, but it saves you the (storage) cost for the additional index. So you're trading some small amount of runtime performance for a small amount of space/cost savings.
One thing to check is whether the query with tags can actually run on just the second index, as Firestore may be able to do a zigzag merge join. In that case you could only keep the second, smaller index and save the runtime performance of adding additional clauses, but then get a (similarly small) performance difference on the query where you do specify one or more tags.

Implementation of limit in mongodb

My collection name is trial and data size is 112mb
My query is,
db.trial.find()
and i have added limit up-to 10.
db.trial.find.limit(10).
but the limit is not working.the entire query is running.
Replace
db.trial.find.limit(10)
with
db.trial.find().limit(10)
Also you mention that the entire database is being queried? Run this
db.trial.find().limit(10).explain()
It will tell you how many documents it looked at before stopping the query (nscanned). You will see that nscanned will be 10.
The .limit() modifier on it's own will only "limit" the results of the query that is processed, so that works as designed to "limit" the results returned. In a raw form though with no query you should just have the n scanned as the limit you want:
db.trial.find().limit(10)
If your intent is to only operate on a set number of documents you can alter this with the $maxScan modifier:
db.trial.find({})._addSpecial( "$maxScan" , 11 )
Which causes the query engine to "give up" after the set number of documents have been scanned. But that should only really matter when there is something meaningful in the query.
If you are actually trying to do "paging" then you are better of using "range" queries with $gt and $lt and cousins to effectively change the range of selection that is done in your query.

MongoDB query: Using Limit together with $near skips few documents

I am currently developing an app which gets the specific number of documents from a collection if their location cordinates falls within certain range of distance. I am using a active record library for Codeigniter and the query that is generated is as follows
db.updates.find({locs: { $near: [72.844102008984, 19.130207090604 ], $maxDistance: 5000 }, posted_on : { $lt :1398425538.1942 },}).sort( { posted_on: -1 } ).limit(10).toArray()
The problem I am facing is that the above query skips few documents which should actually get pulled. But if I remove the limit(10) from the above query then proper documents gets pulled.
I am not sure, but does using limit() in MongoDB omit few results ? or does it limits to only the closest(nearest) documents?
P.S - The documents skipped using the limit are not always the same & random results are generated
I suspect you are running into problems with the special nature of the $near query. $near performs both a limit() and a sort() on the cursor returning the results -
Specifies a point for which a geospatial query returns the closest documents first. The query sorts the documents from nearest to farthest.
By default, queries that use a 2d index return a limit of 100 documents; however you may use limit() to change the number of results.
http://docs.mongodb.org/manual/reference/operator/query/near/
While the documentation does specifically discuss overriding the limit of 100 with your own limit call
You can further limit the number of results using cursor.limit().
It is silent on adding your own sort() or both sorting and overriding the limit at the same time. I suspect you are running into side effects of doing both. Note that it's not incorrect to do both - it just may not produce the results you are looking for. I'd suggest trying the same query using $geoWithin
http://docs.mongodb.org/manual/reference/operator/query/geoWithin/
$geoWithin does not apply a sort or a limit on the results, so it gives you something of a more raw result set.
Do you have any identical posted_on dates in the system? I recommend sorting by a second key, perhaps _id. If the sort order is non-deterministic the system may skip documents in a non-deterministic manor. Adding the _id field to your sort order is generally not that expensive if you have an index on the other fields as they will already be very close to the correct order and _id is part of all indexes. ("By default, all collections have an index on the _id field, and applications and users may add additional indexes to support important queries and operations." http://docs.mongodb.org/manual/core/index-single/ )

Is it faster to use with_limit_and_skip=True when counting query results in pymongo

I'm doing a query where all I want to know if there is at least one row in the collection that matches the query, so I pass limit=1 to find(). All I care about is whether the count() of the returned cursor is > 0. Would it be faster to use count(with_limit_and_skip=True) or just count()? Intuitively it seems to me like I should pass with_limit_and_skip=True, because if there are a whole bunch of matching records then the count could stop at my limit of 1.
Maybe this merits an explanation of how limits and skips work under the covers in mongodb/pymongo.
Thanks!
Your intuition is correct. That's the whole point of the with_limit_and_skip flag.
With with_limit_and_skip=False, count() has to count all the matching documents, even if you use limit=1, which is pretty much guaranteed to be slower.
From the docs:
Returns the number of documents in the results set for this query. Does not take limit() and skip() into account by default - set with_limit_and_skip to True if that is the desired behavior.

mongoDB group by/distinct query

model checkin:
checkin
_id
interest_id
author_id
I've got a collection of checkins (resolved by simple "find" query)
I'd like to count the number of checkins for each interest.
What makes the task a bit more difficult - we should count two checkins from one person and one interest as one checkin.
AFAIK, group operations in mongo are performed by map/reduce query. Should we use it here? The only idea I've got with such an approach is to aggregate the array of users for each interest and then return this array's length.
EDIT I ended up with not using map/reduce at all, allthough Emily's answer worked fine & quick.
I have to select only checkins from last 60 minutes, and there shouldn't be too many results. So I just get all of them to Ruby driver, and do all the calculation's on ruby side. It's a bit slower, but much more scalable and easy-to-understand.
best,
Roman
Map reduce would probably be the way to go for this and you could get the desired results with two map reduces.
In the first, you could remove duplicate author_id and interest_id pairs.
key would be author_id and interest_id
values would be checkin_id
The second map reduce would just be a count of the number of checkins by a given author_id.
key would be author_id
value would be checkin_id count