Inserts bursts on collection with indexes - mongodb

I have a very big collection with almost every field is indexed (I know sounds like I need to redesign my system, or queries, but lets think it goes as it is). So a lot of data and a lot of indexes. The writes happen in big bursts (millions of inserts immediately, use insertMany). It is fine at the beginning, cause first I write everything and then index, works very fast. The next burst is much slower, than the first, and every next of that is even slower. I know that if I write all the data first then index it would be fast. But I do that in the bursts it gets 100-1000 times slower. Of course, it is something to do with indexes here, looked at the mongod logs, it shows a lot of index file check points taking quite some time. Try to see if there an index building job (MongoDB version 4.2, so background indexing should kick in) that I can kill, did not find one, should there be one?
The only solution I found is to drop off indexing before the burst and build again after, but it sounds very pessimistic approach. Any suggestions in how to delay, supress indexing temporarily? Any parameter I can adjust in DB or insert queries to ease the insertion?

Related

read queries becoming slower the more indexes I add

It seems that the more compound index I add to my collection it gets better to some point and then beyond that the more indexes the slower it becomes.
Is this possible? If so why?
EDITED:
I am referring to read queries. not write queries. I am aware that writes will be slower.
This is the case for any sort of index, not just compound indexes.
In MongoDB (and most databases) a lot of operations are sped up by having an index, at the cost of maintaining each index.
Generally speaking this shouldn't slow down things like a find but it will very much affect insert and update as those change the underlying data and thus requires modifying or rebuilding of each index those changes are linked to.
However, even with inserts and updates an index can help speed up those operations as the query engine can find the documents to update quicker.
In the end it very much a balance as the cost to maintain the indexes, and the space they take up ... can if you were to be overzealous (i.e. creating many, many less used indexes) ... counteract their helpfulness.
For a deeper dive into that, I'd suggest these docs:
https://www.mongodb.com/docs/manual/core/data-model-operations/#std-label-data-model-indexes
https://www.mongodb.com/docs/manual/core/index-creation/
I agree with the information that #Justin Jenkins shared in their answer, as there is absolutely write overhead associated with maintaining indexes. I don't think that answer focuses query performance much though which is what I understand this question to be about. I will give some thoughts about that below, though without additional details about the situation it will necessarily be a little generic.
Although indexes absolutely feel magical at times, they are still just a utility that we make available for the database to use when running operations. Ideally it would never be the case that adding an index would slow down the execution of a query, but unfortunately it can in some circumstances. This is not particularly common which is why it is not often an upfront talking point or concern.
Here are some important considerations:
The database is responsible for figuring out the index(es) that would result in the most efficient execution plan for every arbitrary query that is executed
Indexes are data structures. They take up space in memory when loaded from disk and must be traversed to be read.
The server hosting the database only has finite resources. Every time it uses some of those resources to maintain indexes it reduces the amount of resources available to process queries. It also introduces more possibilities for locking, yielding, or other contention to maintain consistency.
If you are observing a sudden and drastic degradation in query performance, I would tend to suspect a problem associated with the first consideration above. Again while not particularly common, it is possible that the increased number of indexes is now preventing the database from finding the optimal plan. This would be most likely if the query contained an $or operator, but can happen in other situations as well. Be on the lookout for a different index being reported in the winningPlan of the explain output for the query. It would usually happen after a specific number of indexes were created and/or if that new index(es) had a particular definition relevant to the query of interest.
A slower and more linear degradation in performance would seem to be for a different reason, such as the second or third items mentioned above. While memory/cache contention can certainly still degrade performance reasonably quickly, you would not see a shift in the query plans with one of these problems. What can happen here instead is now you have two indexes which (for simplicity) take up twice the amount of space now competing for the same limited space in memory. If what is requested exceeds what is available then the database will have to begin reading useful portions of the indexes (and data) into and out of its cache. This overhead can quickly add up and will result in operations now spending more time waiting for their portion of the index to be made available in memory for reading. I would expect a broader portion of queries to be impacted, though more moderately, in this situation.
In any case, the most actionable broad advice we can give would be for you to review and consolidate your existing indexes. There is a little bit of guidance on the topic here in the documentation. The general idea is that the prefix of the index (the keys at the beginning) are the important ones when it comes to usage for queries. Except for a few special circumstances, a single field index on { A: 1 } is completely redundant if you have a separate compound index on { A: 1, B: 1 }. Since the latter index can support all of the operations that the former one can, the former one (single field index in this example) should be removed.
Ultimately you may have to make some tradeoffs about which indexes to maintain and there may not be a 'perfect' index present for every single query. That's okay. Sometimes it is better to let one query do a little extra scanning when one of its predicate fields is not indexed as opposed to maintaining an entirely separate index. There is a tradeoff here at some point and, as #Justin Jenkins put it, it's important to go too far and become overzealous when creating indexes.

mongodb java driver readFully is slow

db.collection.find in my app that uses mongodb java driver (latest) are super slow. I investigated one of them as follows
// about 300 hundred ids at a time (i've tried lower and higher numbers - no impact
db.users.find({_id : {$in : [1,2,3,4,5,6....]}})
Once I get the cursor I do: cursor.toArray() and then iterate of the results
The toArray operation is extremely slow. On average they take about a minute. IMPORTANT: my database is under very heavy load at all times. This particular collection has over 50mm entries.
I've narrowed down the issue in mongo java driver to com.mongodb.Response - specifically to this line:
final byte [] b = new byte[36];
Bits.readFully(in, b);
Incredibly readFully of just 36 bytes takes over a minute some times!
When I bring own the load on the databases, the improvements are drastic. From about a minute to 5-6 seconds. I mean 5-6 seconds to get 300 documents is still super slow, but definitely better then 1 minute.
What can I do to troubleshoot this further? Are there settings on MondoDB that I need to look at?
What happens
You are loading all of the 300 user documents.
What happens is that the _id index is searched and the respective documents are sent completely to your app. So mongoDB will access it's data files, read the first document and send it to you, then it jumps to the next document and sends it to you and so forth. If you used the cursor, you could start iterating over the returned documents as soon as a number of documents equalling your defined cursor size have been returned, as others will be lazily loaded from the cursor on the server on demand. (Bit of a simplification, but sufficient for answering this question). What you do is to explicitly wait until the index is scanned, the documents are located, sent back to your app and have reached it down to the last byte of the last document. As #wdberkeley (who works for 10gen) correctly pointed out, this is a Very Bad Idea™.
What might cause or intensify the problem
Under heavy load, two things might happen. The more likely is that your _id index isn't in RAM any more, causing thousands, if not millions of reads from disk - which is slow. Much slower than if the indices are kept in RAM (by several orders of magnitude). So it is not the code snippet you mentioned, but the response time of MongoDB which causes the delay. Another option under heavy load is that your disk IO is simply too low or (more likely) the random file read latency is too high. I assume you are using spinning disks plus not enough RAM for a database that size.
What to do to find the cause
Try to find out your index size using the db.users.stats(). I am pretty sure that your index size(s combined) exceed your available RAM.
Measure the disk IO and latency. If you use a GNU/Linux OS, you might want to find out how high your IOwait percentage is. A high percentage shows that your disk latency is too high for the load put on the server. It might even be that your are reaching the disk's IO limits.
Do your queries on a mongo shell. In case they are fast, you can be pretty sure that your toArray call is the cause of the problem.
What to do to resolve the problem
If you have not enough RAM, either scale up or scale out.
If your disk latency or throughput is too high, either scale out or ( better and cheaper in most cases ) use SSDs for storing MongoDB's data.
Use a cursor object to iterate over the documents. This is a better solution in almost every use case I can think of.
Upgrading MongoDB driver to 3.6.4 will fetch the data in no-time.
We have around 2 million documents in our collection and with previous version it was taking around ~3 minutes but after after upgrading to 3.6.4 it took only 5-7 sec.So what I feel is that there is some issue with the old version of mongoDB driver.

MongoDB fast deletion best approach

My application currently use MySQL. In order to support very fast deletion, I organize my data in partitions, according to timestamp. Then when data becomes obsolete, I just drop the whole partition.
It works great, and cleaning up my DB doesn't harm my application performance.
I would want to replace MySQL with MongoDB, and I'm wondering if there's something similiar in MongoDB, or would I just need to delete the records one by one (which, I'm afraid, will be really slow and will make my DB busy, and slow down queries response time).
In MongoDB, if your requirement is to delete data to limit the collection size, you should use a capped collection.
On the other hand, if your requirement is to delete data based on a timestamp, then a TTL index might be exactly what you're looking for.
From official doc regarding capped collections:
Capped collections automatically remove the oldest documents in the collection without requiring scripts or explicit remove operations.
And regarding TTL indexes:
Implemented as a special index type, TTL collections make it possible to store data in MongoDB and have the mongod automatically remove data after a specified period of time.
I thought, even though I am late and an answer has already been accepted, I would add a little more.
The problem with capped collections is that they regularly reside upon one shard in a cluster. Even though, in latter versions of MongoDB, capped collections are shardable they normally are not. Adding to this a capped collection MUST be allocated on the spot, so if you wish to have a long history before clearing the data you might find your collection uses up significantly more space than it should.
TTL is a good answer however it is not as fast as drop(). TTL is basically MongoDB doing the same thing, server-side, that you would do in your application of judging when a row is historical and deleting it. If done excessively it will have a detrimental effect on performance. Not only that but it isn't good at freeing up space to your $freelists which is key to stopping fragmentation in MongoDB.
drop()ing a collection will literally just "drop" the collection on the spot, instantly and gracefully giving that space back to MongoDB (not the OS) giving you absolutely no fragmentation what-so-ever. Not only that but the operation is a lot faster, 90% of the time, than most other alternatives.
So I would stick by my comment:
You could factor the data into time series collections based on how long it takes for data to become historical, then just drop() the collection
Edit
As #Zaid pointed out, even with the _id field capped collections are not shardable.
One solution to this is using TokuMX which supports partitioning:
https://www.percona.com/blog/2014/05/29/introducing-partitioned-collections-for-mongodb-applications/
Advantages over capped collections: capped collections use a fixed amount of space (even when you don't have this much data) and they can't be resized on-the-fly. Partitioned collections usage depends on data; you can add and remove partitions (for newly inserted data) as you see fit.
Advantages over TTL: TTL is slow, it just takes care of removing old data automatically. Partitions are fast - removing data is basically just a file removal.
HOWEVER: after getting acquired by Percona, development of TokuMX appears to have stopped (would love to be corrected on this point). Unfortunately MongoDB doesn't support this functionality and with TokuMX on its way out it looks like we will be stranded without proper solution.

MongoDB - single huge collection of raw data. Split or not?

We collect and store instrumentation data from a large number of hosts.
Our storage is MongoDB - several shards with replicas. Everything is stored in a single large collection.
Each document we insert is a time based observation with some attributes (measurements). The time stamp is the most important attribute because all queries are based on time at least. Documents are never updated, so it's a pure write-in-look-up model. Right now it works reasonably well with several billions of docs.
Now,
We want to grow a bit and hold up to 12 month of data which may amount to a scary trillion+ observations (documents).
I was wandering if dumping everything into a single monstrous collection is the best choice or there is a more intelligent way to go about it.
By more intelligent I mean - use less hardware while still providing fast inserts and (importantly) fast queries.
So I thought about splitting the large collection into smaller pieces hoping to gain memory on indexes, insertion and query speed.
I looked into shards, but sharding by the time stamp sounds like a bad idea because all writes will go into one node canceling the benefits of sharding.
The insert rates are pretty high, so we need sharding to work properly here.
I also thought about creating a new collection every month and then pick up a relevant collection for a user query.
Collections older than 12 month will be either dropped or archived.
There is also an option to create entirely new database every month and do similar rotation.
Other options? Or perhaps one large collection is THE option to grow real big?
Please share your experience and considerations in similar apps.
It really depends on the use-case for your queries.
If it's something that could be aggregated, I would say do this through a scheduled map/reduce function and store the smaller data size in separate collection(s).
If everything should be in the same collection and all data should be queried at the same time to generate the desired results, then you need to go with Sharding. Then depending on the data size for your queries, you could go with an in memory map/reduce or even doing it at the application layer.
As yourself pointed out, Sharding based on time is a very bad idea. It makes all the writes going to one shard, so define your shard key. MongoDB Docs, has a very good explanation on this.
If you can elaborate more on your specific needs for the queries would be easier to suggest something.
Hope it helps.
I think collection on monthly basis will help you to get some boost up but I was wondering why can not you use the hour field of your timestamp for sharding . You can add a column which will hold the HOUR part of time stamp and when you shard against it will be shared nicely as you have repeating hour daily basis. I have not tested it but thought it will may help you
Would suggest to go ahead with single collection, as suggested by #Devesh hour based shard should be fine, Need to take care of the new ' hour Key ' while querying to get better performance.

Prioritize specific long-running operation

I have a mongo collection with a little under 2 million documents in it, and I have a query that I wish to run that will delete around 700.000 of them, based on a Date-field.
The remove query looks something like this:
db.collection.remove({'timestamp': { $lt: ISODate('XXXXX') }})
The exact date is not important in this case, the syntax is correct and I know it will work. However, I also know it's going to take forever (last time we did something similar it took a little under 2 hours).
There is another process inserting and updating records at the same time that I cannot stop. However, as long as those insertions/updates "eventually" get executed, I don't mind them being deferred.
My question is: Is there any way to set the priority of a specific query / operation so that it runs faster / before all the queries sent afterwards? In this case, I assume mongo has to do a lot of swapping data in and out of the database which is not helping performance.
I don't know whether the priority can be fine-tuned, so there might be a better answer.
A simple workaround might be what is suggested in the documentation:
Note: For large deletion operations it may be more effect [sic] to copy the documents that you want to save to a new collection and then use drop() on the original collection.
Another approach is to write a simple script that fetches e.g. 500 elements and then deletes them using $in. You can add some kind of sleep() to throttle the deletion process. This was recommended in the newsgroup.
If you will encounter this problem in the future, you might want to
Use a day-by-day collection so you can simply drop the entire collection once data becomes old enough (this makes aggregation harder), or
use a TTL-Collection where items will time out automatically and don't need to be deleted in a bunch.
If your application needs to delete data older than a certain amount of time i suggest using TTL indexes. Ex (from the mongodb site):
db.log.events.ensureIndex( { "status": 1 }, { expireAfterSeconds: 3600 } )
This works like a capped collection, except data is deleted by time. The biggest win for you is that it works in a background thread, your inserts/updates will be mostly unhurt. I use this technique on a SaaS based product in production, works like a charm.
This may not be your use-case, but i hope that helped.