I have a large collection in Mongo. Around 1.7 billion records that take up around 5TB of storage space. I no longer need to keep this data indefinitely so I'm looking at options for getting rid of most of the data, preferably based on "createdAt".
I'm wondering what to expect if I add a ttl index to only keep records around for a month at the most. I have the following index currently:
{
"v" : 1,
"key" : {
"createdAt" : 1
},
"name" : "createdAt_1",
"ns" : "someNS.SomeCollection",
"background" : true
}
How quickly would mongo be able to delete all that data? From what I've read, the ttl process runs every 60 seconds. How much data does it delete each time around?
Adding a TTL index to a large collection like that can really impact performance. If you need to continue querying this collection while creating the TTL, you might consider initially creating the TTL index far in the past so that no documents would actually be expired. Once an index has been created with a TTL, you can later adjust how long documents are meant to stay around for.
Once you've created that index, you can either manually run queries to delete the old data until you're close to up-to-date and able to adjust the TTL, or bump up the TTL slowly so that you're able to control the performance impact.
(Source: advice from mlab on adding a TTL to a 1TB collection. If you don't need to maintain access to data while removing old documents, completely ignore this advice)
Timing of the Delete Operation
When you build a TTL index in the background, the TTL thread can begin deleting documents while the index is building. If you build a TTL index in the foreground, MongoDB begins removing expired documents as soon as the index finishes building.
The TTL index does not guarantee that expired data will be deleted immediately upon expiration. There may be a delay between the time a document expires and the time that MongoDB removes the document from the database.
The background task that removes expired documents runs every 60 seconds. As a result, documents may remain in a collection during the period between the expiration of the document and the running of the background task.
Because the duration of the removal operation depends on the workload
of your mongod instance, expired data may exist for some time beyond
the 60 second period between runs of the background task.
Related
I’m inserting data into a collection to store user history (about 100 items / second), and querying the last hour of data using the aggregation framework (once a minute)
In order to keep my collection optimal, I'm considering two possible options:
Make a standard collection with a TTL index on the creation date
Make a capped collection and query the last hour of data.
Which would be the more efficient solution? i.e. less demanding on the mongo boxes - in terms of I/O, memory usage, CPU etc. (I currently have 1 primary and 1 secondary, with a few hidden nodes. In case that makes a difference)
(I’m ok with adding a bit of a buffer on my capped collection to store 3-4 hours of data on average, and if users become very busy at certain times not getting the full hour of data)
Using a capped collection will be more efficient. Capped collections preserve the order of records by not allowing documents to be deleted or to update them in ways to increase their size, so it can always append to the current end of the collection. This makes insertion simpler and more efficient than with a standard collection.
A TTL-index needs to maintain an additional index for the TTL-field which needs to be updated with every insert, which is an additional slowdown on inserts (this point is of course irrelevant when you would also add an index on the timestamp when using a capped collection). Also, the TTL is enforced by a background job which runs at regular intervals and takes up performance. The job is low-priority and MongoDB is allowed to delay it when there are more high-priority tasks to do. That means you can not rely on the TTL being enforced accurately. So when exact accuracy of the time interval matters, you will have to include the time interval in your query even when you have a TTL set.
The big drawback of capped collections is that it is hard to anticipate how large they really need to be. If your application scales up and you receive a lot more or a lot larger documents than anticipated, you will begin to lose data. You should generally only use capped collections for cases where losing older documents prematurely is not that big of a deal.
I heared, that mongo can do it, but I can't find how.
Can mongo create collections, which will be autoremove in future, from time, which i can setup? Or Mongo can't do this magic?
mongodb cannot auto remove collections but it can auto remove BSON records. You just need to set ttl(Time to live) index on a date field that exists in BSON record .
You can read more here MongoDb: Expire Data from Collections by Setting TTL
Collections are auto created on the first write operation (insert, upsert, index creation). So this magic is covered.
If your removal is based on time, you could use cron or at to run this little script
mongo yourDBserver/yourDB --eval 'db.yourCollection.drop()’
As Sammaye pointed out, creating indices is a costly operation. I would assume there is something wrong with your data model. For semantically distinguishing documents, I'd rather create a field on them which does that and set an expiration date or a creation date and a time frame in which the documents are valid and use TTL indices to remove all of those documents.
For using an expiration date, you have to set a field to an ISODate and create a TTL index without a duration:
db.yourColl.ensureIndex({"yourExpirationDateField":1},{expireAfterSeconds:0})
In the case you want the documents to be valid for let's say a week after they are created, you would use the following:
db.yourColl.ensureIndex({"yourCreationDate":1},{expireAfterSeconds:604800})
Either way, here is what happens: Once every minute a background thread called TTLMonitor wakes up, gets all TTL indices for the server and starts processing them. It scans the TTL index, looking for the date values, adds the value given for "expireAfterSeconds" and deletes all documents which it determined to be invalid by now. This process takes some time, so don't expect the documents to be deleted on the very second they expire.
The big advantage of that approach: you don't need any triggering logic to be maintained, the deletes are done automagically in the background and you don't put any load on your application. Plus, using an expiration date, your have very granular control over when a document expires.
The drawback is ... ... Well, if I want to find one it would be that you have to insert a creation date for every document or calculate and insert an expiration date. And you have to send an administrative command to the mongod/mongos once in the application lifetime...
I have a system where 10s of client machines are sending objects to a single server. The job of the server is to aggregate all the objects (removing duplicates - and there are many) and produce a file every hour of the objects received the previous hour.
I tried MongoDB for this task and it did a good a job but there is the overhead of going over all the records by the end of each hour to produce the file. I am now thinking about gradually building the file as data is received stopping by the end of the hour and starting a new file and so on.
I don't need to do any searching or querying of the data, just dropping duplicates based on a key and producing a file of all the data. Also the first time I receive a record, the duplicates come within a maximum of 3 minutes afterwards.
Which system should I use? Do you recommend a different approach?
I would recommend, even though you state in your comments you don't like the idea of it, to use indexes. You can use a unique index on these fields and you use that as a method to insert.
This does, as you rightly point out, produce a full scan however whichever race condition free route you take (the only way to ensure non-duplicates really) you will need to do a full index scan, either by query or by index insertion.
Index insertion probably the best router here, at the end of the day the performance makes it not really matter.
As for dealing with removing your old records I would not use a TTL index. Instead it would be much better to just drop your collection when your ready to receive a new batch, not only will this be a lot faster but it will also send the collection to $freelists instead of adding the documents from the TTL index to a deleted bucket list potentially causing fragmentation and slowing down your system.
Consider this document:
{
"name" : "a",
"type" : "b",
"hourtag": 10,
"created": ISODate("2014-03-13T06:26:01.238Z")
}
Let's say we set up a unique index on this for name and type, another hourtag property, which the value you of you add to the document representing the hour of day it was inserted. Also add a created date if there is not something already and we set another index on that
db.collection.ensureIndex({ hourtag: 1, name: 1, type: 1})
db.collection.ensureIndex({ created: 1, { expireAfterSeconds: 7200 })
The second index is defined as a TTL index, and set the expireAfterSeconds value to be 2 hours.
So you insert your documents as you go, adding the property for the "current hour" that you are in, and the duplicate items will fail to insert.
At the end of the hour, get all the documents for the "last" hour value and process them.
Using the "TTL" index, the documents you no longer need get cleaned up after their expiry time.
That's the most simple implementation I can think of. Tweak the expiry time to your own needs.
Defining the hourtag first in the index order gives you a simple search, while maintaining your "duplicate" rules.
I am using mongodb with elasticsearch for my application. Elasticsearch creates indexes by monitioring oplog collection. When both the applications are running constantly then any changes to the collections in mongodb are immediately indexed. The only problem I face is if for some reason I had to delete and recreate the index then it takes ages(2days) for the indexing to complete.
When I was looking at the size of my oplog by default it's capacity is 40gb and its holding around 60million transactions because of which creating a fresh index is taking a long time.
What would be the best way to optimize fresh index creation?
Is it to reduce the size of oplog so that it holds less number of transactions and still not affect my replication or is it possible to create a ttl index(which I failed to do on several attempts) on oplog.
I am using elasticsearch with mongodb using mongodb river https://github.com/richardwilly98/elasticsearch-river-mongodb/.
Any help to overcome the above mentioned issues is appreciated.
I am not a Elastic Search Pro but your question:
What would be the best way to optimize fresh index creation?
Does apply a little to all who use third party FTS techs with MongoDB.
The first thing to note is that if you have A LOT of records then there is no easy way around this unless you are prepared to lose some of them.
The oplog isn't really a good idea for this, you should probably seek out using a custom script using timers in the main collection to do this personally, or a change table giving you a single place to quickly query for new or updated records.
Unless you are filtering the oplog to get specific records, i.e. inserts, then you could be pulling out ALL oplog records including deletes, collection operations and even database operations. So you could try stripping out unneeded records from your oplog search, however, this then creates a new problem; the oplog has no indexes or index updating.
This means that if you start to read in a manner more appropiate you will actually use an unindexed query over these 60 million records. This will result in slow(er) performance.
The oplog having no index updating answers another one of your questions:
is it possible to create a ttl index(which I failed to do on several attempts) on oplog.
Nope.
As for the other one of your questions:
Is it to reduce the size of oplog so that it holds less number of transactions
Yes, but you will have a smaller recovery window of replication and not only that but you will lose records from your "fresh" index so only a part of your data is actually indexed. I am unsure, from your question, if this is a problem or not.
You can reduce the oplog for a single secondary member that no replica is synching from. Look up rs.syncFrom and "Change the Size of the Oplog" in the mongodb docs.
I've got a small replica set of three mongod servers (16GB RAM each, at least 4 CPU cores and real HDDs) and one dedicated arbiter. The replicated data has about 100,000,000 records currently. Nearly all of this data is in one collection with an index on _id (the auto-generated Mongo ID) and date, which is a native Mongo date field. Periodically I delete old records from this collection using the date index, something like this (from the mongo shell):
db.repo.remove({"date" : {"$lt" : new Date(1362096000000)}})
This does work, but it runs very, very slowly. One of my nodes has slower I/O than the other two, having just a single SATA drive. When this node is primary, the deletes run at about 5-10 documents/sec. By using rs.stepDown() I have demoted this slower primary and forced an election to get a primary with better I/O. On that server, I am getting about 100 docs/sec.
My main question is, should I be concerned? I don't have the numbers from before I introduced replication, but I know the delete was much faster. I'm wondering if the replica set sync is causing I/O wait, or if there is some other cause. I would be totally happy with temporarily disabling sync and index updates until the delete statement finishes, but I don't know of any way to do that currently. For some reason, when I disable two of the three nodes, leaving just one node and the arbiter, the remaining node is demoted and writes are impossible (isn't the arbiter supposed to solve that?).
To give you some indication of the general performance, if I drop and recreate the date index, it takes about 15 minutes to scan all 100M docs.
This is happening because even though
db.repo.remove({"date" : {"$lt" : new Date(1362096000000)}})
looks like a single command it's actually operating on many documents - as many as satisfy this query.
When you use replication, every change operation has to be written to a special collection in the local database called oplog.rs - oplog for short.
The oplog has to have an entry for each deleted document and every one of those entries needs to be applied to the oplog on each secondary before it can also delete the same record.
One thing I can suggest that you consider is TTL indexes - they will "automatically" delete documents based on expiration date/value you set - this way you won't have one massive delete and instead will be able to spread the load more over time.
Another suggestion that may not fit you, but it was optimal solution for me:
drop indeces from collection
iterate over all entries of collection and store id's of records to delete into memory array
each time array is big enough (for me it was 10K records), i removed these records by ids
rebuild indeces
It is the fastest way, but it requires stopping the system, which was suitable for me.