MongoDB filtered replication collection - mongodb

I have capped local MongoDB collection A, and I'd like to replicate to a MongoDB collection B on a cloud server. Now, I want to keep the documents from A that will be deleted due to its capped size.
Collection A replicate Collection B
-------------- ----------> --------------
Capped at 50MB Infinite size!
local cloud server
Is this possible in MongoDB (as it is possible in CouchDB using filters)?
Or should I search for a totally different approach?
Thanks for your advice!

The deletes in a capped collection are not operations, and so they are not replicated via the oplog. Hence, all you need to do is make the collection non-capped on a secondary and it will simply continue to grow as you add data to the capped collection and those ops are replicated. Try something like this:
Add the secondary on the cloud server as normal
Stop that new secondary, restart it outside the set with no replset argument
Drop the capped collection, recreate with same name as a regular non-capped collection
(Optional) Re-import the data manually using mongodump/mongorestore
Restart the secondary with the original replica set parameters
The new normal collection will just keep growing
If you want to delete the collection or make other changes you will need to take the secondary out each time, but otherwise this should behave as you want. I haven't done this myself explicitly, but I have seen this happen accidentally and had to do the reverse :)

Related

Mongo dynamic collection creation and locking

I am working on an app where I am looking into creating MongoDB collections on the fly as they are needed.
The app consumes data from a data source and maps the data to a collection. If the collection does not exist, the app:
creates the collection
kicks off appropriate indexes in the background
shards the collection
and then inserts the data into the collection.
While this is going on, other processes will be reading and writing from the database.
Looking at this MongodDB locking FAQ, it appears that the reads and writes in other collections of the database should not be affected by the dynamic collection creation snd setup i.e. they won't end up waiting on a lock we acquired to create the collection.
Question: Is the above assumption correct?
Thank you in advance!
No, when you insert into a collection which does not exist, then a new collection is created automatically. Of course, this new collection does not have any index (apart from _id) and is not sharded.
So, you must ensure the collection is created before the application starts any inserts.
However, it is no problem to create indexes and enable sharding on a collection which contains already some data.
Note, when you enable sharding on an empty collection or a collection with low amount of data, then all data is written initially to the primary shard. You may use sh.splitAt() to pre-split the upcoming data.

Mongo Collection with TTL index not reclaiming disk space

I have a mongo collection with TTL index. I see the documents are getting evicted as expected but i dont see that the disk space is getting reclaimed. Did anyone see this kind of issue?
Let me know if you need more details.
We have discussed this with mongo team and based on all information there are couple of things and its not that easy
If you have already have more space then TTL will delete documents and ideally space should be reclaimed but it will not.
If new documents will come then it will be reused
If size is going to remain constant then you need to run compact command. In case of sharded cluster it will be on each shard.
Other options are create new collection and move your data to newer collection. and once done drop the previous collection
Take backup of this collection and drop collection and then restore it.
After all this things, there is possibility that mongo holds all in memory and you need to restart cluster, once restarted it will release the storage.

use replica to store capped collection in mongodb

Is it possible to create a capped collection in a read only mode at a single replica in MongoDB? There is a main database with replica set and I need to use a capped collection with a AWS queue to listen for new insertions. In order to avoid possible listening overloads in the main database, I was wonder whether is possible to create a capped in one of the replicas.
This is not possible as at now. Also keep in mind that replication's main purpose it to have similar data through out the replica set.

TTL index on oplog or reducing the size of oplog?

I am using mongodb with elasticsearch for my application. Elasticsearch creates indexes by monitioring oplog collection. When both the applications are running constantly then any changes to the collections in mongodb are immediately indexed. The only problem I face is if for some reason I had to delete and recreate the index then it takes ages(2days) for the indexing to complete.
When I was looking at the size of my oplog by default it's capacity is 40gb and its holding around 60million transactions because of which creating a fresh index is taking a long time.
What would be the best way to optimize fresh index creation?
Is it to reduce the size of oplog so that it holds less number of transactions and still not affect my replication or is it possible to create a ttl index(which I failed to do on several attempts) on oplog.
I am using elasticsearch with mongodb using mongodb river https://github.com/richardwilly98/elasticsearch-river-mongodb/.
Any help to overcome the above mentioned issues is appreciated.
I am not a Elastic Search Pro but your question:
What would be the best way to optimize fresh index creation?
Does apply a little to all who use third party FTS techs with MongoDB.
The first thing to note is that if you have A LOT of records then there is no easy way around this unless you are prepared to lose some of them.
The oplog isn't really a good idea for this, you should probably seek out using a custom script using timers in the main collection to do this personally, or a change table giving you a single place to quickly query for new or updated records.
Unless you are filtering the oplog to get specific records, i.e. inserts, then you could be pulling out ALL oplog records including deletes, collection operations and even database operations. So you could try stripping out unneeded records from your oplog search, however, this then creates a new problem; the oplog has no indexes or index updating.
This means that if you start to read in a manner more appropiate you will actually use an unindexed query over these 60 million records. This will result in slow(er) performance.
The oplog having no index updating answers another one of your questions:
is it possible to create a ttl index(which I failed to do on several attempts) on oplog.
Nope.
As for the other one of your questions:
Is it to reduce the size of oplog so that it holds less number of transactions
Yes, but you will have a smaller recovery window of replication and not only that but you will lose records from your "fresh" index so only a part of your data is actually indexed. I am unsure, from your question, if this is a problem or not.
You can reduce the oplog for a single secondary member that no replica is synching from. Look up rs.syncFrom and "Change the Size of the Oplog" in the mongodb docs.

MongoDB very slow deletes

I've got a small replica set of three mongod servers (16GB RAM each, at least 4 CPU cores and real HDDs) and one dedicated arbiter. The replicated data has about 100,000,000 records currently. Nearly all of this data is in one collection with an index on _id (the auto-generated Mongo ID) and date, which is a native Mongo date field. Periodically I delete old records from this collection using the date index, something like this (from the mongo shell):
db.repo.remove({"date" : {"$lt" : new Date(1362096000000)}})
This does work, but it runs very, very slowly. One of my nodes has slower I/O than the other two, having just a single SATA drive. When this node is primary, the deletes run at about 5-10 documents/sec. By using rs.stepDown() I have demoted this slower primary and forced an election to get a primary with better I/O. On that server, I am getting about 100 docs/sec.
My main question is, should I be concerned? I don't have the numbers from before I introduced replication, but I know the delete was much faster. I'm wondering if the replica set sync is causing I/O wait, or if there is some other cause. I would be totally happy with temporarily disabling sync and index updates until the delete statement finishes, but I don't know of any way to do that currently. For some reason, when I disable two of the three nodes, leaving just one node and the arbiter, the remaining node is demoted and writes are impossible (isn't the arbiter supposed to solve that?).
To give you some indication of the general performance, if I drop and recreate the date index, it takes about 15 minutes to scan all 100M docs.
This is happening because even though
db.repo.remove({"date" : {"$lt" : new Date(1362096000000)}})
looks like a single command it's actually operating on many documents - as many as satisfy this query.
When you use replication, every change operation has to be written to a special collection in the local database called oplog.rs - oplog for short.
The oplog has to have an entry for each deleted document and every one of those entries needs to be applied to the oplog on each secondary before it can also delete the same record.
One thing I can suggest that you consider is TTL indexes - they will "automatically" delete documents based on expiration date/value you set - this way you won't have one massive delete and instead will be able to spread the load more over time.
Another suggestion that may not fit you, but it was optimal solution for me:
drop indeces from collection
iterate over all entries of collection and store id's of records to delete into memory array
each time array is big enough (for me it was 10K records), i removed these records by ids
rebuild indeces
It is the fastest way, but it requires stopping the system, which was suitable for me.