I'm developing an application with a Elastich Search and MongoDB. The elastic search is using the MongoDB oplog to index the content via a component called a river.
Is it possible to reset the MongoDB oplog so that all previous entries dissapear?
The oplog is for replication and shouldn't be tampered with.
The oplog is a capped collection:
You cannot delete documents from a capped collection. To remove all
records from a capped collection, use the ‘emptycapped’ command. To
remove the collection entirely, use the drop() method.
http://docs.mongodb.org/manual/core/capped-collections/
You might want to use a tailable cursor and tail the oplog in your river.
If your app is going to read the oplog continuously, it would need the ability to start at a particular timestamp (ts) value. Without that ability, if the app (or mongod) had to be restarted for any reason, it would have to re-process all the oplog entries that it had already processed but were still in the oplog. If the app does have the ability to start at a ts value, then just query the oplog for the max value of ts, and use that as the starting point.
Related
I want to convert the MongoDB local Oplog file into an actual real query so I can execute that query and get the exact copy database.
Is there any package, file, build-in tools, or script for it?
It's not possible to get the exact query from the oplog entry because MongoDB doesn't save the query.
The oplog has an entry for each atomic modification performed. Multi-inserts/updates/deletes performed on the mongo instance using a single query are converted to multiple entries and written to the oplog collection. For example, if we insert 10,000 documents using Bulk.insert(), 10,000 new entries will be created in the oplog collection. Now the same can also be done by firing 10,000 Collection.insertOne() queries. The oplog entries would look identical! There is no way to tell which one actually happened.
Sorry, but that is impossible.
The reason is that, that opLog doesn't have queries. OpLog includes only changes (add, update, delete) to data, and it's there for replication and redo.
To get an exact copy of DB, it's called "replication", and that is of course supported by the system.
To "replicate" changes to f.ex. one DB or collection, you can use https://www.mongodb.com/docs/manual/changeStreams/.
You can get the query from the Oplogs. Oplog defines multiple op types, for instance op: "i","u", "d" etc, are for insert, update, delete. For these types, check the "o"/"o2" fields which have corresponding data and filters.
Now based on the op types call the corresponding driver APIs db.collection.insert()/update()/delete().
A MongoDB collection is slow to provide data as it has grown huge overtime.
I need to add an index on a few fields and to reflect it immediately in search. So I seek for clarification on followings things:
Is it mandatory to restart MongoDB after indexing?
If yes, then is there any way to add index without restarting the server? I don't want any downtime...
MongoDB does not need to be restarted after indexing.
However, by default, the createIndex operation blocks read/write on the affected database (note that it is not only the collection but the db). You may change the behaviour using background mode like this:
db.collectionName.createIndex( { collectionKey: 1 }, { background: true } )
It might seem that your client is blocked when creating the index. The mongo shell session or connection where you are creating the index will block, but if there are more connections to the database, these will still be able to query and operate on the database.
Docs: https://docs.mongodb.com/manual/core/index-creation/
There is no need to restart MongoDB after you add an index!
However,an index could be created in the foreground which is the default.
What does it mean? MongoDB documentation states: ‘By default, creating an index on a populated collection blocks all other operations on a database. When building an index on a populated collection, the database that holds the collection is unavailable for reading or write operations until the index build completes. Any operation that requires a read or writes lock on all databases will wait for the foreground index build to complete’.
For potentially long-running index building operations on standalone deployments, the background option should be used. In that case, the MongoDB database remains available during the index building operation.
To create an index in the background, the following snippet should be used, see the image below.
In mongo local db, you can check oplog related data by using db.oplog.rs.stats()
But what does the "count" field mean? And I see it's decreasing every second in my db server.
The replica set oplog (oplog.rs) is a capped collection, which means it has a maximum total size for data. The underlying implementation varies by storage engine (eg WiredTiger vs MMAPv1) but the conceptual outcome is the same: capped collections make room for new documents by overwriting or expiring the oldest documents in the collection in FIFO order (First In, First Out).
But what does the "count" field mean?
As with any collection, the count information in db.collection.stats() indicates the number of documents currently in the collection.
For an explanation of collection stats output, see collStats in the MongoDB documentation.
Note: The output will vary depending on your version of MongoDB server and storage engine used.
I see it's decreasing every second in my db server.
The count of documents in the oplog will vary over time based on the size of the write operations being applied, so this is expected to fluctuate for an active deployment. For example, single field updates will generally write smaller oplog entries than full document updates. Once your oplog reaches its maximum data size, the count may also decrease as the oldest oplog documents are removed to make room for new oplog entries.
I am using mongodb with elasticsearch for my application. Elasticsearch creates indexes by monitioring oplog collection. When both the applications are running constantly then any changes to the collections in mongodb are immediately indexed. The only problem I face is if for some reason I had to delete and recreate the index then it takes ages(2days) for the indexing to complete.
When I was looking at the size of my oplog by default it's capacity is 40gb and its holding around 60million transactions because of which creating a fresh index is taking a long time.
What would be the best way to optimize fresh index creation?
Is it to reduce the size of oplog so that it holds less number of transactions and still not affect my replication or is it possible to create a ttl index(which I failed to do on several attempts) on oplog.
I am using elasticsearch with mongodb using mongodb river https://github.com/richardwilly98/elasticsearch-river-mongodb/.
Any help to overcome the above mentioned issues is appreciated.
I am not a Elastic Search Pro but your question:
What would be the best way to optimize fresh index creation?
Does apply a little to all who use third party FTS techs with MongoDB.
The first thing to note is that if you have A LOT of records then there is no easy way around this unless you are prepared to lose some of them.
The oplog isn't really a good idea for this, you should probably seek out using a custom script using timers in the main collection to do this personally, or a change table giving you a single place to quickly query for new or updated records.
Unless you are filtering the oplog to get specific records, i.e. inserts, then you could be pulling out ALL oplog records including deletes, collection operations and even database operations. So you could try stripping out unneeded records from your oplog search, however, this then creates a new problem; the oplog has no indexes or index updating.
This means that if you start to read in a manner more appropiate you will actually use an unindexed query over these 60 million records. This will result in slow(er) performance.
The oplog having no index updating answers another one of your questions:
is it possible to create a ttl index(which I failed to do on several attempts) on oplog.
Nope.
As for the other one of your questions:
Is it to reduce the size of oplog so that it holds less number of transactions
Yes, but you will have a smaller recovery window of replication and not only that but you will lose records from your "fresh" index so only a part of your data is actually indexed. I am unsure, from your question, if this is a problem or not.
You can reduce the oplog for a single secondary member that no replica is synching from. Look up rs.syncFrom and "Change the Size of the Oplog" in the mongodb docs.
I have capped local MongoDB collection A, and I'd like to replicate to a MongoDB collection B on a cloud server. Now, I want to keep the documents from A that will be deleted due to its capped size.
Collection A replicate Collection B
-------------- ----------> --------------
Capped at 50MB Infinite size!
local cloud server
Is this possible in MongoDB (as it is possible in CouchDB using filters)?
Or should I search for a totally different approach?
Thanks for your advice!
The deletes in a capped collection are not operations, and so they are not replicated via the oplog. Hence, all you need to do is make the collection non-capped on a secondary and it will simply continue to grow as you add data to the capped collection and those ops are replicated. Try something like this:
Add the secondary on the cloud server as normal
Stop that new secondary, restart it outside the set with no replset argument
Drop the capped collection, recreate with same name as a regular non-capped collection
(Optional) Re-import the data manually using mongodump/mongorestore
Restart the secondary with the original replica set parameters
The new normal collection will just keep growing
If you want to delete the collection or make other changes you will need to take the secondary out each time, but otherwise this should behave as you want. I haven't done this myself explicitly, but I have seen this happen accidentally and had to do the reverse :)