How accurate is MongoDB's estimated count query? - mongodb

The official MongoDB driver offers a 'count' and 'estimated document count' API, as far as I know the former command is highly memory intensive so it's recommended to use the latter in situations that require it.
But how accurate is this estimated document count? Can the count be trusted in a Production environment, or is using the count API recommended when absolute accuracy is needed?

Comparing the two, to me it's very difficult to conjure up a scenario in which you'd want to use countDocuments() when estimatedDocumentCount() was an option.
That is, the equivalent form of estimatedDocumentCount() is countDocuments({}), i.e., an empty query filter. The cost of the first function is O(1); the second is O(N), and if N is very large, the cost will be prohibitive.
Both return a count, which, in a scenario in which Mongo has been deployed, is likely to be quite ephemeral, i.e., it's inaccurate the moment you have it, as the collection changes.

Please review the MongoDB documentation for estimatedDocumentCount(). Specifically, they note that "After an unclean shutdown of a mongod using the Wired Tiger storage engine, count statistics reported by db.collection.estimatedDocumentCount() may be inaccurate." This is due to metadata being used for the count and checkpoint drift, which will typically be resolved after 60 seconds or so.
In contrast, the MongoDB documentation for countDocuments() states that this method is a wrapper that performs a $group aggregation stage to $sum the results set, ensuring absolute accuracy of the count.
Thus, if absolute accuracy is essential, use countDocuments(). If all you need is a rough estimate, use estimatedDocumentCount(). The names are accurate to their purpose and should be used accordingly.

The main difference is filtering.
count_documents can be filtered on like a normal query whereas estimated_document_count cannot be.
If filtering is not part of your use case then I would use count_documents since it is much faster.

Related

At how many elements will $in be a slow operation?

In my app, users can block other users. There will be queries where I will need to find
$in: [use_id_x, array_that_contains_all_blocked_user_ids]
At what length of array_that_contains_all_blocked_user_ids will this operation become slow.
If it is expected that a user can block up to block 100,000 users, how can I design my schema such that this operation will scale?
Can't find any things in official doc of $in(aggregation).
However, in official doc of $in, it stated that:
It is recommended that you limit the number of parameters passed to the
$in
operator to tens of values. Using hundreds of parameters or more can negatively impact query performance.
My interpretation is that the official recommendation is below 100. But at the end, I think it depends on what is "slow" to you, and that highly depends on your actual scenario like system specs, and performance requirements...

Aggregate $lookup so slow

1st stage:
$Match
{
delFlg: false
}
2nd stage:
$lookup
{
from: "parkings",
localField: "parking_id",
foreignField: "_id",
as: "parking",
}
Using 2 IDX (ID & parking_ID).
In MongoDB /Linux/ Compass explain plan almost 9000ms.
In Local MongoDB server explain plan 1400ms.
Is there any problem? Why this our production server running so slow?
Full Explain Plan
Explain Plan
Query Performance Summary
Documents returned:
71184
Actual query execution time (ms):
8763
Query used the following indexes:
2
_id_
delFlg_1
explainVersion
"1"
stages
Array
serverInfo
Object
serverParameters
Object
command
Object
ok
1
$clusterTime
Object
operationTime
Timestamp({ t: 1666074389, i: 3 })
Using MongoDB 5.0.5 Version
Local MongoDB 6.0.2 Version
Unfortunately the explain output that was provided is mostly truncated (eg Array and Object hiding most of the important information). Even without that there should be some meaningful observations that we can make.
As always I suggest starting with more clearly defining what your ultimate goals are. What is considered "so slow" and how quickly would the operation need to return in order to be acceptable for your use case? As outlined below, I doubt there is much particular room for improvement with the current schema. You may need to rethink the approach entirely if you are looking for improvements that are several order of magnitude in nature.
Let's consider what this operation is doing:
The database goes to an entry in the { delFlg: 1 } index that has a value of false.
It then goes to FETCH that full document from the collection.
Based on the value of the parking_id field, it performs an IXSCAN of the { _id: 1 } index on the parkings collection. If a matching index entry is discovered then the database will FETCH the full document from the other collection and add it to the new parking array field that is being generated.
This full process then repeats 71,183 more times.
Using 8763ms as the duration, that means that the database performed each full iteration of the work described above more than 8 times per millisecond (8.12 ~= 71184 (iters) / 8763 (ms)). I think that sounds pretty reasonable and is unlikely to be something that you can meaningfully improve on. Scaling up the hardware that the database is running on may provide some incremental improvements, but it is generally costly, not scalable, and likely not worthwhile if you are looking for more substantial improvements.
You also mentioned two different timings and database versions in the question. If I understood correctly, it was mentioned that the (explain) operation takes ~9 seconds on version 5.0 and about 1.4 seconds on version 6.0. While there is far too little information here to say for sure, one reason for that may be associated with improvements for $lookup that were introduced in version 6.0. Indeed from their 7 Big Reasons to Upgrade to MongoDB 6.0 article, they claim the following:
The performance of $lookup has also been upgraded. For instance, if there is an index on the foreign key and a small number of documents have been matched, $lookup can get results between 5 and 10 times faster than before.
This seems to match your situation and observations pretty directly.
Apart from upgrading, what else might improve performance here? Well, if many/most documents in the source collection have { delFlg: false } then you may wish to get rid of the { delFlg: 1 } index. Indexes provide benefits when the size of the results they are retrieving are small relative to the overall size of the data in the collection. But as the percentage grows, the overhead of scanning most of the index plus randomly fetching all of the data quickly becomes less effective than just scanning the full collection directly. It is mentioned in the comments that invoices collection contains 70k documents, so this predicate on delFlg doesn't seem to remove hardly any results at all.
One other thing really stands out is this statement from the comments "parkings contains 16 documents". Should the information from those 16 documents just be moved into the parkings documents directly instead? If the two are commonly accessed together and if the parkings information doesn't change very often, then combining the two would reduce a lot of overhead and would further improve performance.
Two additional components about explain that are worth keeping in mind:
The command does not factor in network time to actually transmit the results to the client. Depending on the size of the documents and the physical topology of the environment, this network time could add another meaningful delay to the operation. Be sure that the client really needs all of the data that is being sent.
Depending on the verbosity, explain will measure the time it takes to run the operation through completion. Since there are no blocking stages in your aggregation pipeline, we would expect your initial batch of documents to be returned to the client in much less time. Apart from networking, that time may be around 12ms (which is approximately (101 docs / 71184 ) * 8763) for 5.0.

How to efficiently query MongoDB for documents when I know that 95% are not used

I have a collection of ~500M documents.
Every time when I execute a query, I receive one or more documents from this collection. Let's say I have a counter for each document, and I increase this counter by 1 whenever this document is returned from the query. After a few months of running the system in production, I discover that the counter of only 5% of the documents is greater than 0 (zero). Meaning, 95% of the documents are not used.
My question is: Is there an efficient way to arrange these documents to speedup the query execution time, based on the fact that 95% of the documents are not used?
What is the best practice in this case?
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
~500M documents That is quite a solid figure, good job if that's true. So here is how I see the solution of the problem:
If you want to re-write/re-factor and rebuild the DB of an app. You could use versioning pattern.
How does it looks like?
Imagine you have a two collections (or even two databases, if you are using micro service architecture)
Relevant docs / Irrelevant docs.
Basically you could use find only on relevant docs collection (which store 5% of your useful docs) and if there is nothing, then use Irrelevant.find(). This pattern will allows you to store old/historical data. And manage it via TTL index or capped collection.
You could also add some Redis magic to it. (Which uses precisely the same logic), take a look:
This article can also be helpful (as many others, like this SO question)
But don't try to replace Mongo with Redis, team them up instead.
Using Indexes and .explain()
If - for example - I will add another boolean field for each document named "consumed" and index this field. Can I improve the query execution time somehow?
Yes, it will deal with your problem. To take a look, download MongoDB Compass, create this boolean field in your schema, (don't forget to add default value), index the field and then use Explain module with some query. But don't forget about compound indexes! If you create field on one index, measure the performance by queering only this one field.
The result should been looks like this:
If your index have usage (and actually speed-up) Compass will shows you it.
To measure the performance of the queries (with and without indexing), use Explain tab.
Actually, all this part can be done without Compass itself, via .explain and .index queries. But Compass got better visuals of this process, so it's better to use it. Especially since he becomes absolutely free for all.

Best way to query entire MongoDB collection for ETL

We want to query an entire live production MongoDB collection (v2.6, around 500GB of data on around 70M documents).
We're wondering what's the best approach for this:
A single query with no filtering to open a cursor and get documents in batches of 5/6k
Iterate with pagination, using a logic of find().limit(5000).skip(currentIteration * 5000)
We're unsure what's the best practice and will yield the best results with minimum impact on performance.
I would go with 1. & 2. mixed if possible: Iterate over your huge dataset in pages but access those pages by querying instead of skipping over them as this may be costly as also pointed out by the docs.
The cursor.skip() method is often expensive because it requires the
server to walk from the beginning of the collection or index to get
the offset or skip position before beginning to return results. As the
offset (e.g. pageNumber above) increases, cursor.skip() will become
slower and more CPU intensive. With larger collections, cursor.skip()
may become IO bound.
So if possible build your pages on an indexed field and process those batches of data with an according query range.
The brutal way
Generally speaking, most drivers load batches of documents anyway. So your languages equivalent of
var docs = db.yourcoll.find()
docs.forEach(
function(doc){
//whatever
}
)
will actually just create a cursor initially, and will then, when the current batch is close to exhaustion, load a new batch transparently. So doing this pagination manually while planning to access every document in the collection will have little to no advantage, but hold the overhead of multiple queries.
As for ETL, manually iterating over the documents to modify and then store them in a new instance does under most circumstances not seem reasonable to me, as you basically reinvent the wheel.
Alternate approach
Generally speaking, there is no one-size-fits all "best" way. The best way is the one that best fits your functional and non-functional requirements.
When doing ETL from MongoDB to MongoDB, I usually proceed as follows:
ET…
Unless you have very complicated transformations, MongoDB's aggregation framework is a surprisingly capable ETL tool. I use it regularly for that purpose and have yet to find a problem not solvable with the aggregation framework for in-MongoDB ETL. Given the fact that in general each document is processed one by one, the impact on your production environment should be minimal, if noticeable at all. After you did your transformation, simply use the $out stage to save the results in a new collection.
Even collection spanning transformations can be achieved, using the $lookup stage.
…L
After you did the extract and transform on the old instance, for loading the data to the new MongoDB instance, you have several possibilities:
Create a temporary replica set, consisting of the old instance, the new instance and an arbiter. Make sure your old instance becomes primary, do the ET part, have the primary step down so your new instance becomes primary and remove the old instance and the arbiter from the replica set. The advantage is that you facilitate MongoDB's replication mechanics to get the data from your old instance to your new instance, without the need to worry about partially executed transfers and such. And you can use it the other way around: Transfer the data first, make the new instance the primary, remove the other members from the replica set perform your transformations and remove the "old" data, then.
Use db.CloneCollection(). The advantage here is that you only transfer the collections you need, at the expense of more manual work.
Use db.cloneDatabase() to copy over the entire DB. Unless you have multiple databases on the original instance, this method has little to now advantage over the replica set method.
As written, without knowing your exact use cases, transformations and constraints, it is hard to tell which approach makes the most sense for you.
MongoDB 3.4 support Parallel Collection Scan. I never tried this myself yet. But looks interesting to me.
This will not work on sharded clusters. If we have parallel processing setup this will speed up the scanning for sure.
Please see the documentation here: https://docs.mongodb.com/manual/reference/command/parallelCollectionScan/

MongoDB Using Map Reduce against Aggregation

I have seen this asked a couple of years ago. Since then MongoDB 2.4 has multi-threaded Map Reduce available (after the switch to the V8 Javascript engine) and has become faster than what it was in previous versions and so the argument of being slow is not an issue.
However, I am looking for a scenario where a Map Reduce approach might work better than the Aggregation Framework. Infact, possibly a scenario where the Aggregation Framework cannot work at all but the Map Reduce can get the required results.
Thanks,
John
Take a look to this.
The Aggregation FW results are stored in a single document so are limited to 16 MB: this might be not suitable for some scenarios. With MapReduce there are several output types available including a new entire collection so it doesn't have space limits.
Generally, MapReduce is better when you have to work with large data sets (may be the entire collection). Furthermore, it gives much more flexibility (you write your own aggregation logic) instead of being restricted to some pipeline commands.
Currently the Aggregation Framework results can't exceed 16MB. But, I think more importantly, you'll find that the AF is better suited to "here and now" type queries that are dynamic in nature (like filters are provided at run-time by the user for example).
A MapReduce is preplanned and can be far more complex and produce very large outputs (as they just output to a new collection). It has no run-time inputs that you can control. You can add complex object manipulation that simply is not possible (or efficient) with the AF. It's simple to manipulate child arrays (or things that are array like) for example in MapReduce as you're just writing JavaScript, whereas in the AF, things can become very unwieldy and unmanageable.
The biggest issue is that MapReduce's aren't automatically kept up to date and they're difficult to predict when they'll complete). You'll need to implement your own solution to keeping them up to date (unlike some other NoSQL options). Usually, that's just a timestamp of some sort and an incremental MapReduce update as shown here). You'll possibly need to accept that the data may be somewhat stale and that they'll take an unknown length of time to complete.
If you hunt around on StackOverflow, you'll find lots of very creative solutions to solving problems with MongoDB and many solutions use the Aggregation Framework as they're working around limitations of the general query engine in MongoDB and can produce "live/immediate" results. (Some AF pipelines are extremely complex though which may be a concern depending on the developers/team/product).