In my app, users can block other users. There will be queries where I will need to find
$in: [use_id_x, array_that_contains_all_blocked_user_ids]
At what length of array_that_contains_all_blocked_user_ids will this operation become slow.
If it is expected that a user can block up to block 100,000 users, how can I design my schema such that this operation will scale?
Can't find any things in official doc of $in(aggregation).
However, in official doc of $in, it stated that:
It is recommended that you limit the number of parameters passed to the
$in
operator to tens of values. Using hundreds of parameters or more can negatively impact query performance.
My interpretation is that the official recommendation is below 100. But at the end, I think it depends on what is "slow" to you, and that highly depends on your actual scenario like system specs, and performance requirements...
Related
The official MongoDB driver offers a 'count' and 'estimated document count' API, as far as I know the former command is highly memory intensive so it's recommended to use the latter in situations that require it.
But how accurate is this estimated document count? Can the count be trusted in a Production environment, or is using the count API recommended when absolute accuracy is needed?
Comparing the two, to me it's very difficult to conjure up a scenario in which you'd want to use countDocuments() when estimatedDocumentCount() was an option.
That is, the equivalent form of estimatedDocumentCount() is countDocuments({}), i.e., an empty query filter. The cost of the first function is O(1); the second is O(N), and if N is very large, the cost will be prohibitive.
Both return a count, which, in a scenario in which Mongo has been deployed, is likely to be quite ephemeral, i.e., it's inaccurate the moment you have it, as the collection changes.
Please review the MongoDB documentation for estimatedDocumentCount(). Specifically, they note that "After an unclean shutdown of a mongod using the Wired Tiger storage engine, count statistics reported by db.collection.estimatedDocumentCount() may be inaccurate." This is due to metadata being used for the count and checkpoint drift, which will typically be resolved after 60 seconds or so.
In contrast, the MongoDB documentation for countDocuments() states that this method is a wrapper that performs a $group aggregation stage to $sum the results set, ensuring absolute accuracy of the count.
Thus, if absolute accuracy is essential, use countDocuments(). If all you need is a rough estimate, use estimatedDocumentCount(). The names are accurate to their purpose and should be used accordingly.
The main difference is filtering.
count_documents can be filtered on like a normal query whereas estimated_document_count cannot be.
If filtering is not part of your use case then I would use count_documents since it is much faster.
I already read official documentation to get the basic idea on getPlanCache() and hint().
getPlanCache()
Displays the cached query plans for the specified query shape.
The query optimizer only caches the plans for those query shapes that can have more than one viable plan.
Official Documentation: https://docs.mongodb.com/manual/reference/method/PlanCache.getPlansByQuery/
hint()
The $hint operator forces the query optimizer to use a specific index to fulfill the query. Specify the index either by the index name or by document.
Official Documentation: https://docs.mongodb.com/manual/reference/operator/meta/hint/
MyQuestion
If I can make sure the specific collection can cache the plan, I don't need to use hint() to ensure optimized performance. Is that correct?
I already read official documentation to get the basic idea on getPlanCache() and hint().
To be clear: these are troubleshooting aids for investigating query performance. The MongoDB query planner chooses the most efficent plan available based on a measure of "work" involved in executing a given query shape. If there is only a single viable plan, there is no need to cache the plan selection. If there are multiple query plans available for the same query shape, the query planner will periodically evaluate performance and update the cached plan selection if appropriate.
The query plan cache methods allow you to inspect and clear information in the plan cache. Generally you would only want to clear the plan cache while investigating issues in a development/staging environment as this could have a noticeable affect on a busy deployment.
If I can make sure the specific collection can cache the plan, I don't need to use hint() to ensure optimized performance. Is that correct?
In general you should avoid using hint (outside of testing query plans) as this bypasses the query planner and forces use of the hinted index even if there might be a more efficient index available.
If a specific query is not performing as expected, explain() output is the best starting point for insight into the query planning process. If you're not sure how to optimise a specific query, I'd suggest posting a question on DBA StackExchange including the output of explain(true) (verbose explain) and your MongoDB server version.
For a helpful presentation, see: Reading the .explain() Output - Charlie Swanson (June 2017).
Is creating multiple compound indexes for serving various types of queries is better?
or
Is it better to
use a single compound index in a way that supports multiple queries(which is hard to analysis and construct, since there are many number of queries).
My basic question is "Does creating multiple compound indexes will slow down read/write operations?"
Please suggest me a solution.
There isn't any answer that fits all cases, but in general adding the right indexes will give you better performance. You will have less reads when accessing data. Calculating the index will cost you some performance, however if they are correct and used your db will perform better afterwards. Start with monitoring: mongodb monitoring docs
Indices will slow down writes but speed up reads. A high read to write ratio warrants one or more indices on commonly fetched fields (keys). For example our current system sees 25 writes to 20,000 reads (tps) so indices are beneficial to counter the wide margin. That being said, be mindful of retaining the mongo write lock as short as possible.
MongoDB uses a readers-writer 1 lock that allows concurrent reads
access to a database but gives exclusive access to a single write
operation. mongodb docs
I have seen this asked a couple of years ago. Since then MongoDB 2.4 has multi-threaded Map Reduce available (after the switch to the V8 Javascript engine) and has become faster than what it was in previous versions and so the argument of being slow is not an issue.
However, I am looking for a scenario where a Map Reduce approach might work better than the Aggregation Framework. Infact, possibly a scenario where the Aggregation Framework cannot work at all but the Map Reduce can get the required results.
Thanks,
John
Take a look to this.
The Aggregation FW results are stored in a single document so are limited to 16 MB: this might be not suitable for some scenarios. With MapReduce there are several output types available including a new entire collection so it doesn't have space limits.
Generally, MapReduce is better when you have to work with large data sets (may be the entire collection). Furthermore, it gives much more flexibility (you write your own aggregation logic) instead of being restricted to some pipeline commands.
Currently the Aggregation Framework results can't exceed 16MB. But, I think more importantly, you'll find that the AF is better suited to "here and now" type queries that are dynamic in nature (like filters are provided at run-time by the user for example).
A MapReduce is preplanned and can be far more complex and produce very large outputs (as they just output to a new collection). It has no run-time inputs that you can control. You can add complex object manipulation that simply is not possible (or efficient) with the AF. It's simple to manipulate child arrays (or things that are array like) for example in MapReduce as you're just writing JavaScript, whereas in the AF, things can become very unwieldy and unmanageable.
The biggest issue is that MapReduce's aren't automatically kept up to date and they're difficult to predict when they'll complete). You'll need to implement your own solution to keeping them up to date (unlike some other NoSQL options). Usually, that's just a timestamp of some sort and an incremental MapReduce update as shown here). You'll possibly need to accept that the data may be somewhat stale and that they'll take an unknown length of time to complete.
If you hunt around on StackOverflow, you'll find lots of very creative solutions to solving problems with MongoDB and many solutions use the Aggregation Framework as they're working around limitations of the general query engine in MongoDB and can produce "live/immediate" results. (Some AF pipelines are extremely complex though which may be a concern depending on the developers/team/product).
I'm evaluating MongoDB, coming from Membased/memcached because I want more flexibility.
Of course Membase is excellent in doing fast (multi)-key lookups.
I like the additional options that MongoDB gives me, but is it also fast in doing multi-key lookups? I've seen the $or and $in operator and I'm sure I can model it with that. I just want to know if it's performant (in the same league) as Membase.
use-case, e.g., Lucene/Solr returns 20 product-ids. Lookup these product-ids in Couchdb to return docs/ appropriate fields.
Thanks,
Geert-Jan
For your use case, I'd say it is, from my experience: I hacked some analytics into a database of mine that made a lot of $in queries with thousands of ids and it worked fine (it was a hack). To my surprise, it worked rather well, in the lower millisecond area.
Of course, it's hard to compare this, and -as usual- theory is a bad companion when it comes to performance. I guess the best way to figure it out is to migrate some test data and send some queries to the system.
Use MongoDB's excellent built-in profiler, use $explain, keep the one index per query rule in mind, take a look at the logs, keep an eye on mongostat, and do some benchmarks. This shouldn't take too long and give you a definite and affirmative answer. If your queries turn out slow, people here and on the news group probably have some ideas how to improve the exact query, or the indexation.
One index per query. It's sometimes thought that queries on multiple
keys can use multiple indexes; this is not the case with MongoDB. If
you have a query that selects on multiple keys, and you want that
query to use an index efficiently, then a compound-key index is
necessary.
http://www.mongodb.org/display/DOCS/Indexing+Advice+and+FAQ#IndexingAdviceandFAQ-Oneindexperquery.
There's more information on that page as well with regard to Indexes.
The bottom line is Mongo will be great if your indexes are in memory and you are indexing on the columns you want to query using composite keys. If you have poor indexing then your performance will suffer as a result. This is pretty much in line with most systems.