Best way to drop big collection in MongoDB - mongodb

I have standalone MongoDB instance version 3.2. Storage engine is WiredTiger. What is the best way to drop big collection (>500Gb) to minimize time of exclusive database lock? Will be there time difference between 2 solution?
Remove all documents from collection, drop index, drop collection
Just drop collection
Additional information, probably it could be important:
Collection contains about 200.000.000 documents
Collection has only one index by _id
'_id' looks like {_id : {day: "2018-01-01", key :"someuniquekeybyday"}}

The correct answer is probably: "drop operation is not linear". It takes few seconds on 10Gb collection and it takes absolutely the same time with 500Gb collection.
I've deleted 1TB collection many times, it took several seconds.
p.s. To offer you something new, not seen in comments: you have the third option - to make a copy of other collections in this database and then switch database in your application.

I have dropped over 1.4 TB collection on version 4.0. Operation took less than 1 sec.
2022-03-15T01:17:25.688+0000 I REPL [replication-2] Completing collection drop for order.system.drop.1647307045i163t6.feeds with drop optime { ts: Timestamp(1647307045, 163), t: 6 } (notification optime: { ts: Timestamp(1647307045, 163), t: 6 })
As per documentation drop operation will obtain lock on affected database & will block all the operations, so application will suffer minor latencies in database operations for short duration. before dropping large collection make sure...
remove all read/writes from application on that collection
rename to temporary collection to avoid any operation during drop.
choose least traffic time to drop temporary collection.

Related

Best way to remove all elements of a large mongo collection without lock and performance impact? [duplicate]

How do I truncate a collection in MongoDB or is there such a thing?
Right now I have to delete 6 large collections all at once and I'm stopping the server, deleting the database files and then recreating the database and the collections in it. Is there a way to delete the data and leave the collection as it is? The delete operation takes very long time. I have millions of entries in the collections.
To truncate a collection and keep the indexes use
db.<collection>.remove({})
You can efficiently drop all data and indexes for a collection with db.collection.drop(). Dropping a collection with a large number of documents and/or indexes will be significantly more efficient than deleting all documents using db.collection.remove({}). The remove() method does the extra housekeeping of updating indexes as documents are deleted, and would be even slower in a replica set environment where the oplog would include entries for each document removed rather than a single collection drop command.
Example using the mongo shell:
var dbName = 'nukeme';
db.getSiblingDB(dbName).getCollectionNames().forEach(function(collName) {
// Drop all collections except system ones (indexes/profile)
if (!collName.startsWith("system.")) {
// Safety hat
print("WARNING: going to drop ["+dbName+"."+collName+"] in 5s .. hit Ctrl-C if you've changed your mind!");
sleep(5000);
db[collName].drop();
}
})
It is worth noting that dropping a collection has different outcomes on storage usage depending on the configured storage engine:
WiredTiger (default storage engine in MongoDB 3.2 or newer) will free the space used by a dropped collection (and any associated indexes) once the drop completes.
MMAPv1 (default storage engine in MongoDB 3.0 and older) will
not free up preallocated disk space. This may be fine for your use case; the free space is available for reuse when new data is inserted.
If you are instead dropping the database, you generally don't need to explicitly create the collections as they will be created as documents are inserted.
However, here is an example of dropping and recreating the database with the same collection names in the mongo shell:
var dbName = 'nukeme';
// Save the old collection names before dropping the DB
var oldNames = db.getSiblingDB(dbName).getCollectionNames();
// Safety hat
print("WARNING: going to drop ["+dbName+"] in 5s .. hit Ctrl-C if you've changed your mind!")
sleep(5000)
db.getSiblingDB(dbName).dropDatabase();
// Recreate database with the same collection names
oldNames.forEach(function(collName) {
db.getSiblingDB(dbName).createCollection(collName);
})
the below query will delete all records in a collections and will keep the collection as is,
db.collectionname.remove({})
remove() is deprecated in MongoDB 4.
You need to use deleteMany or other functions:
db.<collection>.deleteMany({})
There is no equivalent to the "truncate" operation in MongoDB. You can either remove all documents, but it will have a complexity of O(n), or drop the collection, then the complexity will be O(1) but you will loose your indexes.
Create the database and the collections and then backup the database to bson files using mongodump:
mongodump --db database-to-use
Then, when you need to drop the database and recreate the previous environment, just use mongorestore:
mongorestore --drop
The backup will be saved in the current working directory, in a folder named dump, when you use the command mongodump.
The db.drop() method obtains a write lock on the affected database and will block other operations until it has completed.
I think using the db.remove({}) method is better than db.drop().

Slow Upserts with PyMongoDB

I'm trying to insert ~800 million records into MongoDB using PyMongo on a macbook air 1.7GHz i7 with no multi-threading, the documents are structured as below:
Records I'm reading are the following tuple:
(user_id,imp_date,imp_creative,imp_pid,geo_id)
I'm creating my own _id field based on the user_id in the file I'm reading from.
{_id:user_id,
'imp_date':[array of dates],
'imp_creative':[array of numeric ids],
'imp_pid':[array of numeric ids],
'geo_id':numeric id}
I'm using an upsert with $push to append date, creative id, and pid for the corresponding arrays
self.collection.update({'_id':uid},
{"$push":{'imp_date':<datevalue>,
'imp_creative':<creative_id>,
'imp_pid':<pid>}},safe=True,upsert=True)
I'm using an upsert with $set to overwrite the geographic location (only care about most recent.
self.collection.update({'_id':uid},
{"$set":{'geo_id':<geo id>}},safe=True,upsert=True)
I'm only writing about 1,500 records per second (8,000 if I set safe=False). My question is: what can I do to speed this up further (ideally 20k/second or faster)?
Ideas I can't find a definitive recommendation on:
-Using multiple threads to insert data
-Sharding
-Padding arrays (my arrays grow very slowly, each document array will have an average length of ~4 at the end of the file)
-Turning journaling off
Apologies if I've left out any required information, this is my first post.
1- You could add an index to speed it up, and index would help you to find the documents faster although the inserts would be slower (you have to update the index as well). If the improvement in the retrieving phase compensates the extra time to update the index depends on how many records you have in the collections, how many indexes you have and how complicated those indexes are.
However, in your case you are only querying with the _id so there's no much more you can do with indexes.
2- Are you using two consecutive updates? I mean, one for the $set and one for the $push?
If that's true, then you should definetelly use just one:
self.collection.update({'_id':uid},
{"$push":{'imp_date':<datevalue>,
'imp_creative':<creative_id>,
'imp_pid':<pid>},
"$set":{'geo_id':<geo id>}},
safe=True,upsert=True)
3- The update operation is an atomic operation which might locks other queries. If the document you are about to update is not already in RAM but it is in the disk, mongo will have to first fetch it from the disk and then update it. If you do a find operation first (which doesn't block as it's a read-only operation) the document will be in RAM for sure so the update operation (the locking one) will be faster:
self.collection.findOne({'_id':uid})
self.collection.update({'_id':uid},
{"$push":{'imp_date':<datevalue>,
'imp_creative':<creative_id>,
'imp_pid':<pid>},
"$set":{'geo_id':<geo id>}},
safe=True,upsert=True)
4-If your documents don't grow too much as you have said, it won't be necessary to bother about padding factor and reallocation issues. Furthermore, in some recent versions (can't remember if it was since 2.2 or 2.4) collections are created with the powerOfTwo option enabled by default.

Partial doc updates to a large mongo collection - how to not lock up the database?

I've got a mongo db instance with a collection in it which has around 17 million records.
I wish to alter the document structure (to add a new attribute in the document) of all 17 million documents, so that I dont have to problematically deal with different structures as well as make queries easier to write.
I've been told though that if I run an update script to do that, it will lock the whole database, potentially taking down our website.
What is the easiest way to alter the document without this happening? (I don't mind if the update happens slowly, as long as it eventually happens)
The query I'm attempting to do is:
db.history.update(
{ type : { $exists: false }},
{
$set: { type: 'PROGRAM' }
},
{ multi: true }
)
You can update the collection in batches(say half a million per batch), this will distribute the load.
I created a collection with 20000000 records and ran your query on it. It took ~3 minutes to update on a virtual machine and i could still read from the db in a separate console.
> for(var i=0;i<20000000;i++){db.testcoll.insert({"somefield":i});}
The locking in mongo is quite lightweight, and it is not going to be held for the whole duration of the update. Think of it like 20000000 separate updates. You can read more here:
http://docs.mongodb.org/manual/faq/concurrency/
You do actually care if your update query is slow, because of the write lock problem on the database you are aware of, both are tightly linked. It's not a simple read query here, you really want this write query to be as fast as possible.
Updating the "find" part is part of the key here. First, since your collection has millions of documents, it's a good idea to keep the field name size as small as possible (ideally one single character : type => t). This helps because of the schemaless nature of mongodb collections.
Second, and more importantly, you need to make your query use a proper index. For that you need to workaround the $exists operator which is not optimized (several ways to do it there actually).
Third, you can work on the field values themselves. Use http://bsonspec.org/#/specification to estimate the size of the value you want to store, and eventually pick a better choice (in your case, you could replace the 'PROGRAM' string by a numeric constant for example and gain a few bytes in the process, multiplied by the number of documents to update for each update multiple query). The smaller the data you want to write, the faster the operation will be.
A few links to other questions which can inspire you :
Can MongoDB use an index when checking for existence of a field with $exists operator?
Improve querying fields exist in MongoDB

TTL index on oplog or reducing the size of oplog?

I am using mongodb with elasticsearch for my application. Elasticsearch creates indexes by monitioring oplog collection. When both the applications are running constantly then any changes to the collections in mongodb are immediately indexed. The only problem I face is if for some reason I had to delete and recreate the index then it takes ages(2days) for the indexing to complete.
When I was looking at the size of my oplog by default it's capacity is 40gb and its holding around 60million transactions because of which creating a fresh index is taking a long time.
What would be the best way to optimize fresh index creation?
Is it to reduce the size of oplog so that it holds less number of transactions and still not affect my replication or is it possible to create a ttl index(which I failed to do on several attempts) on oplog.
I am using elasticsearch with mongodb using mongodb river https://github.com/richardwilly98/elasticsearch-river-mongodb/.
Any help to overcome the above mentioned issues is appreciated.
I am not a Elastic Search Pro but your question:
What would be the best way to optimize fresh index creation?
Does apply a little to all who use third party FTS techs with MongoDB.
The first thing to note is that if you have A LOT of records then there is no easy way around this unless you are prepared to lose some of them.
The oplog isn't really a good idea for this, you should probably seek out using a custom script using timers in the main collection to do this personally, or a change table giving you a single place to quickly query for new or updated records.
Unless you are filtering the oplog to get specific records, i.e. inserts, then you could be pulling out ALL oplog records including deletes, collection operations and even database operations. So you could try stripping out unneeded records from your oplog search, however, this then creates a new problem; the oplog has no indexes or index updating.
This means that if you start to read in a manner more appropiate you will actually use an unindexed query over these 60 million records. This will result in slow(er) performance.
The oplog having no index updating answers another one of your questions:
is it possible to create a ttl index(which I failed to do on several attempts) on oplog.
Nope.
As for the other one of your questions:
Is it to reduce the size of oplog so that it holds less number of transactions
Yes, but you will have a smaller recovery window of replication and not only that but you will lose records from your "fresh" index so only a part of your data is actually indexed. I am unsure, from your question, if this is a problem or not.
You can reduce the oplog for a single secondary member that no replica is synching from. Look up rs.syncFrom and "Change the Size of the Oplog" in the mongodb docs.

Truncate a collection

How do I truncate a collection in MongoDB or is there such a thing?
Right now I have to delete 6 large collections all at once and I'm stopping the server, deleting the database files and then recreating the database and the collections in it. Is there a way to delete the data and leave the collection as it is? The delete operation takes very long time. I have millions of entries in the collections.
To truncate a collection and keep the indexes use
db.<collection>.remove({})
You can efficiently drop all data and indexes for a collection with db.collection.drop(). Dropping a collection with a large number of documents and/or indexes will be significantly more efficient than deleting all documents using db.collection.remove({}). The remove() method does the extra housekeeping of updating indexes as documents are deleted, and would be even slower in a replica set environment where the oplog would include entries for each document removed rather than a single collection drop command.
Example using the mongo shell:
var dbName = 'nukeme';
db.getSiblingDB(dbName).getCollectionNames().forEach(function(collName) {
// Drop all collections except system ones (indexes/profile)
if (!collName.startsWith("system.")) {
// Safety hat
print("WARNING: going to drop ["+dbName+"."+collName+"] in 5s .. hit Ctrl-C if you've changed your mind!");
sleep(5000);
db[collName].drop();
}
})
It is worth noting that dropping a collection has different outcomes on storage usage depending on the configured storage engine:
WiredTiger (default storage engine in MongoDB 3.2 or newer) will free the space used by a dropped collection (and any associated indexes) once the drop completes.
MMAPv1 (default storage engine in MongoDB 3.0 and older) will
not free up preallocated disk space. This may be fine for your use case; the free space is available for reuse when new data is inserted.
If you are instead dropping the database, you generally don't need to explicitly create the collections as they will be created as documents are inserted.
However, here is an example of dropping and recreating the database with the same collection names in the mongo shell:
var dbName = 'nukeme';
// Save the old collection names before dropping the DB
var oldNames = db.getSiblingDB(dbName).getCollectionNames();
// Safety hat
print("WARNING: going to drop ["+dbName+"] in 5s .. hit Ctrl-C if you've changed your mind!")
sleep(5000)
db.getSiblingDB(dbName).dropDatabase();
// Recreate database with the same collection names
oldNames.forEach(function(collName) {
db.getSiblingDB(dbName).createCollection(collName);
})
the below query will delete all records in a collections and will keep the collection as is,
db.collectionname.remove({})
remove() is deprecated in MongoDB 4.
You need to use deleteMany or other functions:
db.<collection>.deleteMany({})
There is no equivalent to the "truncate" operation in MongoDB. You can either remove all documents, but it will have a complexity of O(n), or drop the collection, then the complexity will be O(1) but you will loose your indexes.
Create the database and the collections and then backup the database to bson files using mongodump:
mongodump --db database-to-use
Then, when you need to drop the database and recreate the previous environment, just use mongorestore:
mongorestore --drop
The backup will be saved in the current working directory, in a folder named dump, when you use the command mongodump.
The db.drop() method obtains a write lock on the affected database and will block other operations until it has completed.
I think using the db.remove({}) method is better than db.drop().