Why the Oplog decreasing Mongodb - mongodb

What type of query generates a decrease in the oplog? How do I find out which queries are affecting my oplog decrease (when comparing month to month)? Could you help me?

Oplog window is affected by heavy write queries and particularly mass deletes of big size documents since single range delete operation disolve in the oplog as multiple individual delete queries.

Related

MongoDB Index - Performance considerations on collections with high write-to-read ratio

I have a MongoDB Collection with around 100k inserts every day. Each document consumes around 1 MB space and has a lot of elements.
The Data is mainly stored for analytics purpose and read a couple times each day. I want to speed up the queries by adding indexes to a few fields which are usually used for filtering but stumpled accross this statement in the mongodb documentation:
Adding an index has some negative performance impact for write operations. For collections with high write-to-read ratio, indexes are expensive since each insert must also update any indexes.
source
I was wondering if this should bother me with 100k inserts vs. a couple of big read operations and if this would add a lot of overhead to the insert operations?
If yes, should i separete reads from writes in in separate collections and duplicate the data or are there any other solutions for this?

How insert large amount of data (about 1 million records) per minutes into MongoDB?

I want to insert about 1 million records per minutes into a single server MongoDB database. I have index on 6 fields. When the database was empty, I could insert data rapidly in less than a minute into my collection (using bulk insert and multi-processing). However, as the size of data in collection increased, the insertion speed greatly decreased. Is there any idea that how can I handle such data insertion?
(my data is about price changes)
Thanks
Indexes are beneficial in case of find operations where in it performs fast retrieval of documents contained into database but indexes should be created on those fields only which are used as filters for retrieval of selected information.Defining too many indexes result into overhead of insert and update operations as with every insert and update operation those modified records need to be added into index data structure too.
Figure out what your bottleneck is and address it.
Is the server CPU or disk bound? Increase CPU speed or add IOPS to disk.
What proportion of time is used for index writes? Remove all indexes and measure insertion rate at current data size, then add one index at a time while measuring insertion rate with each index addition.
Is insertion rate decreasing linearly with data set size growth? Faster or slower?
MongoDB exposes many server statistics, look through them and identify the ones relevant to throughput, see if you spot any patterns.

what does db.oplog.rs.stats().count mean in mongodb?

In mongo local db, you can check oplog related data by using db.oplog.rs.stats()
But what does the "count" field mean? And I see it's decreasing every second in my db server.
The replica set oplog (oplog.rs) is a capped collection, which means it has a maximum total size for data. The underlying implementation varies by storage engine (eg WiredTiger vs MMAPv1) but the conceptual outcome is the same: capped collections make room for new documents by overwriting or expiring the oldest documents in the collection in FIFO order (First In, First Out).
But what does the "count" field mean?
As with any collection, the count information in db.collection.stats() indicates the number of documents currently in the collection.
For an explanation of collection stats output, see collStats in the MongoDB documentation.
Note: The output will vary depending on your version of MongoDB server and storage engine used.
I see it's decreasing every second in my db server.
The count of documents in the oplog will vary over time based on the size of the write operations being applied, so this is expected to fluctuate for an active deployment. For example, single field updates will generally write smaller oplog entries than full document updates. Once your oplog reaches its maximum data size, the count may also decrease as the oldest oplog documents are removed to make room for new oplog entries.

Mongo TTL vs Capped collections for efficiency

I’m inserting data into a collection to store user history (about 100 items / second), and querying the last hour of data using the aggregation framework (once a minute)
In order to keep my collection optimal, I'm considering two possible options:
Make a standard collection with a TTL index on the creation date
Make a capped collection and query the last hour of data.
Which would be the more efficient solution? i.e. less demanding on the mongo boxes - in terms of I/O, memory usage, CPU etc. (I currently have 1 primary and 1 secondary, with a few hidden nodes. In case that makes a difference)
(I’m ok with adding a bit of a buffer on my capped collection to store 3-4 hours of data on average, and if users become very busy at certain times not getting the full hour of data)
Using a capped collection will be more efficient. Capped collections preserve the order of records by not allowing documents to be deleted or to update them in ways to increase their size, so it can always append to the current end of the collection. This makes insertion simpler and more efficient than with a standard collection.
A TTL-index needs to maintain an additional index for the TTL-field which needs to be updated with every insert, which is an additional slowdown on inserts (this point is of course irrelevant when you would also add an index on the timestamp when using a capped collection). Also, the TTL is enforced by a background job which runs at regular intervals and takes up performance. The job is low-priority and MongoDB is allowed to delay it when there are more high-priority tasks to do. That means you can not rely on the TTL being enforced accurately. So when exact accuracy of the time interval matters, you will have to include the time interval in your query even when you have a TTL set.
The big drawback of capped collections is that it is hard to anticipate how large they really need to be. If your application scales up and you receive a lot more or a lot larger documents than anticipated, you will begin to lose data. You should generally only use capped collections for cases where losing older documents prematurely is not that big of a deal.

TTL index on oplog or reducing the size of oplog?

I am using mongodb with elasticsearch for my application. Elasticsearch creates indexes by monitioring oplog collection. When both the applications are running constantly then any changes to the collections in mongodb are immediately indexed. The only problem I face is if for some reason I had to delete and recreate the index then it takes ages(2days) for the indexing to complete.
When I was looking at the size of my oplog by default it's capacity is 40gb and its holding around 60million transactions because of which creating a fresh index is taking a long time.
What would be the best way to optimize fresh index creation?
Is it to reduce the size of oplog so that it holds less number of transactions and still not affect my replication or is it possible to create a ttl index(which I failed to do on several attempts) on oplog.
I am using elasticsearch with mongodb using mongodb river https://github.com/richardwilly98/elasticsearch-river-mongodb/.
Any help to overcome the above mentioned issues is appreciated.
I am not a Elastic Search Pro but your question:
What would be the best way to optimize fresh index creation?
Does apply a little to all who use third party FTS techs with MongoDB.
The first thing to note is that if you have A LOT of records then there is no easy way around this unless you are prepared to lose some of them.
The oplog isn't really a good idea for this, you should probably seek out using a custom script using timers in the main collection to do this personally, or a change table giving you a single place to quickly query for new or updated records.
Unless you are filtering the oplog to get specific records, i.e. inserts, then you could be pulling out ALL oplog records including deletes, collection operations and even database operations. So you could try stripping out unneeded records from your oplog search, however, this then creates a new problem; the oplog has no indexes or index updating.
This means that if you start to read in a manner more appropiate you will actually use an unindexed query over these 60 million records. This will result in slow(er) performance.
The oplog having no index updating answers another one of your questions:
is it possible to create a ttl index(which I failed to do on several attempts) on oplog.
Nope.
As for the other one of your questions:
Is it to reduce the size of oplog so that it holds less number of transactions
Yes, but you will have a smaller recovery window of replication and not only that but you will lose records from your "fresh" index so only a part of your data is actually indexed. I am unsure, from your question, if this is a problem or not.
You can reduce the oplog for a single secondary member that no replica is synching from. Look up rs.syncFrom and "Change the Size of the Oplog" in the mongodb docs.