I am developing a web application with MongoDB. I have noticed a phenomenon that when executing the same query repeatedly, the processing speed boosts and finally converges at a certain point. Like when querying all documents in the collection, the time usage per query will be 100000 ms -> 20000 ms -> 9000ms -> ... -> 500 ms.
I am wondering what is the reason behind the speed boost? And how to estimate the convergence point?
There are many reasons, i can give some points.
First of all, MongoDB is able to choose the best index for your query. To do so, MongoDB use a Query Plans. But this operation take time:
If there are no matching entries, the query planner generates
candidate plans for evaluation over a trial period. The query planner
chooses a winning plan, creates a cache entry containing the winning
plan, and uses it to generate the result documents.
https://docs.mongodb.com/manual/core/query-plans/
INDEX should be loaded into memory in order to speed up the performance, this also take time. Try to look at Touch:
https://docs.mongodb.com/manual/reference/command/touch/
The touch command loads data from the data storage layer into memory.
touch can load the data (i.e. documents) indexes or both documents and
indexes.
Another reason, the INDEX do not fit into memory in order to know if this is your case maybe you can check with totalIndexSize
https://docs.mongodb.com/manual/reference/method/db.collection.totalIndexSize/#db.collection.totalIndexSize
I'm more focused to improve the query-planner side since MongoDB doesn't take the best decision for you all the time.
All this topic should be carefully evaluated in my opinion to avoid performance degrade.
Good Luck!
Related
I have the following problem connected to the MongoDB database design. Here is my situation:
I have a collection with about 50k documents (15kB each),
every document have a dictionary storing data samples,
my query always gets all the data from the document,
every query uses an index,
the collection have only one index (based on a single datetime field),
in most cases, I need to get data from many documents (typically 25 < N < 100),
it is easier for me to perform many SELECT queries over a single one,
I have a lot of updates in my databases, much less than SELECT ones,
I use the WiredTiger engine (the newest version of MongoDB),
server instance and web application are on the same machine.
I have two possibilities for making a SELECT query:
perform a single query retrieving all documents I am interested in,
perform N queries, everyone gets a single document, where typically 25 < N < 100 (what about a different scenario when 100 < N < 1k or 1k < N < 10k?)
So the question is if there is any additional overhead when I perform many small queries over a single one? In relational databases making many queries is a very bad practice - but in NoSQL? I am asking about a general practice - should I avoid that much queries?
In the documentation, I read that the number of queries is not important but the number of searches over documents - is that true?
Thanks for help ;)
There is a similar question like the one you asked : Is it ok to query mongodb multiple times
IMO, for your use-case i.e. 25<N<100, one should definitely go with batching.
In case of Single queries :
Looping in a single thread will not suffice, you'll have to make parallel requests which would create additional overhead
creates tcp/ip overhead for every request
there is a certain amount of setup and teardown for each query creating and exhausting cursors which would create unnecessary overhead.
As explained in the answer above, there appears be a sweet-spot for how many values to batch up vs. the number of round trips and that depends on your document type as well.
In broader terms, anything 10<N<1000 should go with batching and the remaining records should form part of other batches but querying single document at a time would definitely create unnecessary overhead.
The problem when you perform small queries over one query is network overhead that is the network latency roundtrip.
For a single request in a batch processing it may be not much, but if you make multiple requests like these or use this technique on frontend it will decrease performance.
Also you may need to preprocess the data like sorting aggregating it manually.
We would like to evaluate the effectiveness of our indexes in a MongoDB-based REST service setup. The idea is to populate a collection with a synthetic dataset (e.g. 10,000,000 documents) then run a load injector process doing random REST operations (each one involving a query at MongoDB layer) to evaluate which indexes are being used and statistical information about them (e.g. per index hit rate).
We have considered using explain() command or indexStats. However, regarding explain(), it has two problems: 1) it allows only evaluate the effectiveness of a single query, 2) it is difficult to use in a “black box” environment in which our load injector process interacts with the REST service on top on MongoDB but not MonoDB itself. Regarding indexStats, as far as I understand, it shows information about the index structure “on disk” but not about index usage.
Thus, which is the best way of doing that kind of test? Any procedure description or URL to information about the topic is highly welcomed.
You should read about performance profiling.
You can turn on the profiling on with:
db.setProfilingLevel(2);
Or if you don't want to much noise and you want to check only queries that took more than e.g. 20ms:
db.setProfilingLevel(1,20);
You can query the system.profile to get the information about slow queries, e.g. find all operations slower than 30 ms:
db.system.profile.find( { millis : { $gt : 30 } } ).pretty()
You can than manually profile each slow query with explain().
For real-time monitoring you can use mongotop and mongostat. You should also consider setting up MMS (free monitoring service). In MMS you can check btree hits/misses and compare them to other relevant statistics.
Edit
You can get the relevant indexCounters data by using serverStatus command:
db.serverStatus()
I have made a test with 10 M rows of data. Each row has 3 integer and 2 string columns. First I import this data to mongoDB which is a single shard. I do a simple "where" query with db.table.find() on a non-index columns. The query fetches a single row which takes roughly in 7 seconds.
On the same hardware I load the same data to a c# list which is in memory. I do a while loop to scan all 10M data and do a simple equal control to emulate where query. It takes only around 650 ms which is much more faster than MongoDB.
I have a 32 GB machine so mongodb is having no problem to memory map the table.
Why mongoDB is much slower? Is it because the mongoDB is keeping the data in a data structure which is hard to full scan or is it because memory mapping in not the same as keeping a data in a variable.
As Remon pointed out you are definitely comparing apples to oranges in this test.
To understand a bit more on what is happening behind the scenes in that table scan, read through the MongoDB internals here. (Look under the Storage model)
There is the concept of extents which represents a contiguous disk space.
Each extent points to a linked list of docs.
The doc contains the data in BSON format. So now you can imagine how we would retrieve data.
Now the beauty of having an index is aptly shown at the right top corner. MongoDB uses a BTree structure to navigate which is pretty fast.
Try changing your test to have some warm up runs and use an index.
UPDATE : I have done some testing as a part of my day job to compare the performance of JBoss Cache (an in memory Java Cache) with MongoDB as an application cache (queries against _id). The results are quite comparable.
Where to start..
First of all the test is completely apples and oranges. Loading a dataset into memory and doing a completely in-memory scan of it is in no way equal to a table scan on any database.
I'm also willing to bet you're doing your test on cold data and MongoDB performance improves dramatically as it swaps hot data into memory. Please note that MongoDB doesn't preemptively swap data into memory. It does so if, and only if, the data is accessed frequently (or at all, depending). Actually it's more accurate to say the OS does since MongoDB's storage engine is built on top of MMFs (memory mapped files).
So in short, your test isn't a good test and the way you're testing MongoDB isn't producing accurate results. You're testing a theoretical best case with your C# equivalent that on top of that is considerably less complex than the database code.
I'm building an application that stores lots of data per user (possibly in gigabytes).
Something like a request log, so lets say you have the following fields for every record:
customer_id
date
hostname
environment
pid
ip
user_agent
account_id
user_id
module
action
id
response code
response time (range)
and possibly some more.
The good thing is that the usage will be mostly write only, but when there are reads
I'd like to be able to answer then quickly in near real time.
Another prediction about the usage pattern is that most of the time people will be looking at the most recent data,
and infrequently query for the past, aggregate etc, so my guess is that the working set will be much smaller then
the whole database, i.e. recent data for most users and ranges of history for some users that are doing analytics right now.
for the later case I suppose its ok for first query to be slower until it gets the range into memory.
But the problem is that Im not quite sure how to effectively index the data.
The start of the index is clear, its customer_id and date. but the rest can be
used in any combination and I can't predict the most common ones, at least not with any degree of certainty.
We are currently prototyping this with mongo. Is there a way to do it in mongo (storage/cpu/cost) effectively?
The only thing that comes to mind is to try to predict a couple of frequent queries and index them and just massively shard the data
and ensure that each customer's data is spread evenly over the shards to allow fast table scan over just the 'customer, date' index for the rest
of the queries.
P.S. I'm also open to suggestions about db alternatives.
with this limited number of fields, you could potentially just have an index on each of them, or perhaps in combination with customer_id. MongoDB is clever enough to pick the fastest index for each case then. If you can fit your whole data set in memory (a few GB is not a lot of data!), then this all really doesn't matter.
You're saying you have a GB per user, but that still means you can have an index on the fields as there are only about a dozen. And with that much data, you want sharding anyway at some point soon.
cheers,
Derick
I think, your requirements don't really mix well together. You can't have lots of data and instantaneous ad-hoc queries.
If you use a lot of indexes, then your writes will be slow, and you'll need much more RAM.
May I suggest this:
Keep your index on customer id and date to serve recent data to users and relax your requirements to either real-timeliness or accuracy of aggregate queries.
If you sacrifice accuracy, you will be firing map-reduce jobs every once in a while to precompute queries. Users then may see slightly stale data (or may not, it's historical immutable data, after all).
If you sacrifice speed, then you'll run map-reduce each time (right now it's the only sane way of calculating aggregates in a mongodb cluster).
Hope this helps :)
I'm doing a where in box query on a collection of ~40K documents. The query takes ~0.3s and fetching documents takes ~0.6 seconds (there are ~10K documents in the result set).
The documents are fairly small (~100 bytes each) and I limit the result to return the lat/lon only.
It seems very slow. Is this about right or am I doing something wrong?
It seems very slow indeed. A roughly equivalent search on I did on PostgreSQL, for example, is almost too fast to measure (i.e. probably faster than 1ms).
I don't know much about MongoDB, but are you certain that the geospatial index is actually turned on? (I ask because in RDBMSs it's easy to define a table with geometrical/geographical columns yet not define the actual indexing appropriately, and so you get roughly the same performance as what you describe).