mongodb get count without repeating find - mongodb

When performing a query in MongoDb, I need to obtain a total count of all matches, along with the documents themselves as a limited/paged subset.
I can achieve the goal with two queries, but I do not see how to do this with one query. I am hoping there is a mongo feature that is, in some sense, equivalent to SQL_CALC_FOUND_ROWS, as it seems like overkill to have to run the query twice. Any help would be great. Thanks!
EDIT: Here is Java code to do the above.
DBCursor cursor = collection.find(searchQuery).limit(10);
System.out.println("total objects = " + cursor.count());

I'm not sure which language you're using, but you can typically call a count method on the cursor that's the result of a find query and then use that same cursor to obtain the documents themselves.

It's not only overkill to run the query twice, but there is also the risk of inconsistency. The collection might change between the two queries, or each query might be routed to a different peer in a replica set, which may have different versions of the collection.
The count() function on cursors (in the MongoDB JavaScript shell) really runs another query, You can see that by typing "cursor.count" (without parentheses), so it is not better than running two queries.
In the C++ driver, cursors don't even have a "count" function. There is "itcount", but it only loops over the cursor and fetches all results, which is not what you want (for performance reasons). The Java driver also has "itcount", and there the documentation says that it should be used for testing only.
It seems there is no way to do a "find some and get total count" consistently and efficiently.

Related

How to efficiently query MongoDB for documents when I know that 95% are not used

I have a collection of ~500M documents.
Every time when I execute a query, I receive one or more documents from this collection. Let's say I have a counter for each document, and I increase this counter by 1 whenever this document is returned from the query. After a few months of running the system in production, I discover that the counter of only 5% of the documents is greater than 0 (zero). Meaning, 95% of the documents are not used.
My question is: Is there an efficient way to arrange these documents to speedup the query execution time, based on the fact that 95% of the documents are not used?
What is the best practice in this case?
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
~500M documents That is quite a solid figure, good job if that's true. So here is how I see the solution of the problem:
If you want to re-write/re-factor and rebuild the DB of an app. You could use versioning pattern.
How does it looks like?
Imagine you have a two collections (or even two databases, if you are using micro service architecture)
Relevant docs / Irrelevant docs.
Basically you could use find only on relevant docs collection (which store 5% of your useful docs) and if there is nothing, then use Irrelevant.find(). This pattern will allows you to store old/historical data. And manage it via TTL index or capped collection.
You could also add some Redis magic to it. (Which uses precisely the same logic), take a look:
This article can also be helpful (as many others, like this SO question)
But don't try to replace Mongo with Redis, team them up instead.
Using Indexes and .explain()
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
Yes, it will deal with your problem. To take a look, download MongoDB Compass, create this boolean field in your schema, (don't forget to add default value), index the field and then use Explain module with some query. But don't forget about compound indexes! If you create field on one index, measure the performance by queering only this one field.
The result should been looks like this:
If your index have usage (and actually speed-up) Compass will shows you it.
To measure the performance of the queries (with and without indexing), use Explain tab.
Actually, all this part can be done without Compass itself, via .explain and .index queries. But Compass got better visuals of this process, so it's better to use it. Especially since he becomes absolutely free for all.

Mongo - How to explain query without executing it

I'm new to Mongo and have searched but don't see a specific answer.
I understand that Mongo explain method will execute the query in parallel with possible access plans and choose a winning plan based on execution time.
The Best Practices Guide states "The query plan can be calculated and returned without first having to run the query". I cannot find how to do this in the doc.
So what if even the winning plan takes a very long time to execute before it returns even a small result set, for example sorting a large collection?
I've seen older comments that execution stops after the first 101 documents are returned, but again can't find that in official doc.
So my question is: How to get the access plan without executing the query?
Thanks for your help. I'm using Mongo 3.4.
You can set the verbosity of the explain() function. By default it uses the queryPlanner which is (if I'm right) what you are looking for.
Check the other modes in the official MongoDB documentation about explain()
MongoDB builds Query Plans by running a limited subset of a query, then caches the winning plan so it can be re-used (until something causes a cache flush).
So the "limited subset" means that it will get documents that match up to the 100 docs as a "sample", and go no further, but what the documentation means by
The query plan can be calculated and returned without first having to
run the query
is that you can do an explain before you run the full query. This is good practice because it then populates that Plan Cache for you ahead of time.
To put it simply, if MongoDB notices that another query is taking longer than a completed query, it will cease execution and ditch the plan in favour of the more efficient plan. For example, you can see this in action by running the allPlansExecution explain where an index is chosen over a colscan.

Does providing a projection argument to find() limit the data that is added to Mongo's working set?

In Mongo, suppose I have a collection mycollection that has fields a, b, and huge. I very frequently want to perform queries, mapreduce, updates, etc. on a, and b and very occassionally want to return huge in query results as well.
I know that db.mycollection.find() will scan the entire collection and result in Mongo attempting to add the whole collection to the working set, which may exceed the amount of RAM I have available.
If I instead call db.mycollection.find({}, { a : 1, b : 1 }), will this still result in the whole collection being added to the working set or only the terms of my projection?
MongoDB can use something called covered queries: http://docs.mongodb.org/manual/applications/indexes/#create-indexes-that-support-covered-queries these allow you to load all the values from the index rather than the disk, or memory, if those documents are in memory at the time.
Be warned that you cannot use covered queries on a full table scan, the condition, projection and sort must all be within the index; i.e.:
db.col.ensureIndex({a:1,b:1});
db.col.find({a:1}, {_id:0, a:1, b:1})(.sort({b:1}));
Would work (the sort is in brackets because it is not totally needed). You can add _id to your index if you intend to return that too.
Map Reduce does not support covered queries, there is no way to project only a certain amount of fields into the MR, as far as I know; maybe there is some hack I do not know of. Map Reduce only supports a $match like operator in terms of input query with a separate parameter for the sort of the incoming query ( http://docs.mongodb.org/manual/applications/map-reduce/ ).
Note that for updates I believe only atomic operations: http://docs.mongodb.org/manual/tutorial/isolate-sequence-of-operations/ (excluding findAndModify) do not load the document into your working set, however, believe is the keyword there.
Considering you need to do both MR and normal find and update on these records I would strongly recommend you look into checking why you are paging in so much data and whether you really do need to do it that often. It seems like you are trying to do too much processing in a short and frequent amount of time.
On the other hand, if this is a script which runs every night or something then I would not worry too much about its excessive working set (i.e. score board recalc script).

How to see sort information system.profile (mongodb)?

I enabled profiling mode 2 (all events). My idea is to write a script, that will go through all the queries and execute explain plans and see whether any of the queries do not use indexes. However it's impossible to do so as I don't see any sorting information in the system.profile. Why so?
Thanks
UPDATE:
Imagine, you have a users collection. And you have created an index on this collection: user(name, createdAt). Now I would like to find out some users sorted by time. In system.profile the second part of the query (sorting/pagination) won't be saved, however it's very important to understand what sorting operation were used, as it affects the performance and index selection.
So my intention was to create a script that will go through all statements in system.profile and execute explain plans and see whether indexes are used or not. If not, I can automatically catch all new/not reliable queries while executing integration tests.
Why did 10gen choose to not include sorting information in the system.profile?
I'd suggest if you're interested in the feature, to request it here. (I couldn't find an existing suggestion for this feature). You might get more detail about why it hasn't been included.
I can see how it could be valuable sometimes, although it's recommended that you compare the nscanned vs the nreturned as a way of knowing whether indexes are being used (and more importantly, making sure as few documents are scanned as possible).
So, while not perfect, it gives you a quick way of evaluating performance and detecting index usage.
You can use the aggregation framework to sort
e.g. slowest operations first
db.system.profile.aggregate( { $sort : { millis : -1 }})
Sorry guys. I found an answer to that question. Actually mongo has it. It's called orderby inside the query element. My bad :-)

MongoDB. Use cursor as value for $in in next query

Is there a way to use the cursor returned by the previous query as a value for $in in the next query? For example, something like this:
var users = db.user.find({state:1})
var offers = db.offer.find({user:{$in:users}})
I think this can reduce the traffic between mongodb and client in case the client doesn't need user information at all, just offers. Am i wrong?
Basically you want to do a join between two collections which Mongo doesn't support. You can reduce the amount of data being transferred from the server by limiting the fields returned from the first query to only the unique user information (i.e. the _id) that you need to get data from the offers collection.
If you really just want to make one query then you should store more information in the offers collection. For example, if you're trying to find offers for active users then you would store the active state of the user in the offers collection.
To work from your comment:
Yes, that's why I used tag 'join' in a question. The idea is that I
can make a first query more сomplex using a bunch of fields and
regexes without storing user data in other collections except
references. In these cases I always have to perform two consecutive
queries, but transfering of the results of the first query is not
necessary neither for me nor for the mongodb itself. I just want to
understand could it be done now, will it be possible to do so in the
future or it cannot be implemented for some technical reasons
As far as I understand it there is no immediate hurry to make this possible. Also the way it is coded atm will make this quite a big change to the way cursors work and are defined. A change big enough to possibly cause implementation breaks for other people. It is really a case of whether to set safe for inserts and updates for all future drivers. It is recognised that safe should be default but this will break implementation for other people who expect it the other way around.
It is rather inefficient if you don't require the results of the first query at all however since most networks are prepped with high traffic in mind and the traffic is cheap there hasn't been a demand to make it able to do chained queries server side in the cursor.
However subselects (which this basically is, it is selecting a set of rows based upon a sub selection of previous rows) have been on mongodb-user a couple of times and there might even be a JIRA for it somewhere, if not might be useful to make one.
As for doing it right now: there is no way.