Should I be removing tombstones from my Cloudant database? - ibm-cloud

I have a Cloudant database with 4 million documents and 27 million deleted documents ("tombstones"). Is having so many tombstones a problem and, if so, how do I get rid of them?

"Tombstones" occupy space and so contribute to your bill. They also increase the time for new replications to complete or new indexes to build.
So in general it is good practice to periodically remove these tombstones.
The best way to do it is to replicate your database with a filter that leaves deleted documents behind.
Replications are started by creating a document in the _replicator database like so:
{
"_id": "myfirstreplication",
"source" : "http://<username1>:<password1>#<account1>.cloudant.com/<sourcedb>",
"target" : "http://<username2:<password2>#<account2>.cloudant.com/<targetdb>",
"selector": {
"_deleted": {
"$exists": false
}
}
}
where the source is the original database and the target is the new empty database. The selector is the filter that checks each document before replicating - in this case we only want documents without a deleted attribute (a document that hasn’t been deleted).
This replication will result in a brand new database with no tombstones. Point your application to this new database and then delete the old one with the tombstones.
In this blog post there are other, more complex, scenarios that you may want to consider.

Related

Purge documents in MongoDB without impacting the working set

We have a collection of documents and each document has a time window associated to it. (For example, fields like 'fromDate' and 'toDate'). Once the document is expired (i.e. toDate is in the past), the document isn't accessed anymore by our clients.
So we wanted to purge these documents to reduce the number of documents in the collection and thus making our queries faster. However we later realized that this past data could be important to analyze the pattern of data changes, so we decided to archive it instead of purging it completely. So this is what I've come up with.
Let's say we have a "collectionA" which has past versions of documents
Query all the past documents in "collectionA". (queries are made on secondary server)
Insert them to a separate collection called "collectionA-archive"
Delete the documents from collectionA that are successfully inserted in the archive
Delete documents in "collectionA-archive" that meet a certain condition. (we do not want to keep a huge archive)
My question here is, even though I'm making the queries on Secondary server, since the insertions are happening in Primary, does the documents inserted in archive collection make it to the working set of Primary ? The last thing we need is these past documents getting stored in RAM of Primary which could affect the performance of our live API.
I know, one solution could be to insert the past documents into a separate DB server. But acquiring another server is a bit of hassle. So would like to know if this is achievable within one server.

Managing transaction documents with MongoDB

Imagine you have millions of users those who perform transactions on your platform. Assuming each transaction is a document in your MongoDB collection there would be millions of documents generated everyday thus exploding your database in no time. I have received the following solutions from friends and family.
Having TTL index on the document - This won't work because we need those document stored somewhere so that it can be retrieved at a later point in time when the user demands for it.
Sharding the collection with timestamp as the key - This won't help us control the time frame we want the data to be backed up.
I would like to understand and implement a strategy somewhat similar to what banks follow. They keep your transactions upto a certain point (eg: 6 months) after which you have to request them via support or any other channel. I am assuming they follow a Hot/Cold storage pattern but I am not completely sure about it.
The entire point is to manage transaction documents and on a daily basis back up or move the older records to another place where it can be read from. Any idea how that is possible with MongoDB?
Update: Sample Document (Please note there are few other keys from the document that have been redacted)
{
"_id" : ObjectId("5d2c92d547d273c1329b49f0"),
"transactionType" : "type_3",
"transactionTimestamp" : ISODate("2019-07-15T14:51:54.444Z"),
"transactionValue" : 0.2,
"userId" : ObjectId("5d2c92f947d273c1329b49f1")
}
First Create a Table Where you want to save all records. (As you mentioned the sample, let's take this entry stored on a collection named A).
After that Create a backup at daily midnight and then after successful backup restored the collection with named timestamp.
After successful entry stored on table, you can truncate the original table.
By this approach you have a limited entry table on the collection and also have all records.

Dropping Mongo collection not clearing disk space

I have a collection with 750,000 documents, it's taking around 7Gb on the disk.
I've dropped the collection, but the files (test.0...test.11) are still on the disk.
If I delete them, then I loose all the collections, not just the one I dropped
Shouldn't Mongo be deleting them?
Just noticed that the database stats have an error.
{
"ok" : 0,
"errmsg" : "Collection [test.loadTest-2016-02-06 15:05:34Z] not found."
}
You have dropped a collection, but not the database containing it. Dropping the collection does not compact the data files, nor does deleting a document. If you really want to compact the database, either drop it entirely and reimport it, or compact it using repairDatabase (see the docs). Beware though, you cannot compact the database online I think if you just have one node.
If you have a replica set, adding new nodes and removing the old ones is the safest way of compacting the database online. I do that from time to time and it's easy.

Partial doc updates to a large mongo collection - how to not lock up the database?

I've got a mongo db instance with a collection in it which has around 17 million records.
I wish to alter the document structure (to add a new attribute in the document) of all 17 million documents, so that I dont have to problematically deal with different structures as well as make queries easier to write.
I've been told though that if I run an update script to do that, it will lock the whole database, potentially taking down our website.
What is the easiest way to alter the document without this happening? (I don't mind if the update happens slowly, as long as it eventually happens)
The query I'm attempting to do is:
db.history.update(
{ type : { $exists: false }},
{
$set: { type: 'PROGRAM' }
},
{ multi: true }
)
You can update the collection in batches(say half a million per batch), this will distribute the load.
I created a collection with 20000000 records and ran your query on it. It took ~3 minutes to update on a virtual machine and i could still read from the db in a separate console.
> for(var i=0;i<20000000;i++){db.testcoll.insert({"somefield":i});}
The locking in mongo is quite lightweight, and it is not going to be held for the whole duration of the update. Think of it like 20000000 separate updates. You can read more here:
http://docs.mongodb.org/manual/faq/concurrency/
You do actually care if your update query is slow, because of the write lock problem on the database you are aware of, both are tightly linked. It's not a simple read query here, you really want this write query to be as fast as possible.
Updating the "find" part is part of the key here. First, since your collection has millions of documents, it's a good idea to keep the field name size as small as possible (ideally one single character : type => t). This helps because of the schemaless nature of mongodb collections.
Second, and more importantly, you need to make your query use a proper index. For that you need to workaround the $exists operator which is not optimized (several ways to do it there actually).
Third, you can work on the field values themselves. Use http://bsonspec.org/#/specification to estimate the size of the value you want to store, and eventually pick a better choice (in your case, you could replace the 'PROGRAM' string by a numeric constant for example and gain a few bytes in the process, multiplied by the number of documents to update for each update multiple query). The smaller the data you want to write, the faster the operation will be.
A few links to other questions which can inspire you :
Can MongoDB use an index when checking for existence of a field with $exists operator?
Improve querying fields exist in MongoDB

Limit the number of documents in a mongodb collection , without FIFO policy

I'm building an application to handle ticket sales and expect to have really high demand. I want to try using MongoDB with multiple concurrent client nodes serving a node.js website (and gracefully handle failure of clients).
I've read "Limit the number of documents in a collection in mongodb" (which is completely unrelated) and "Is there a way to limit the number of records in certain collection" (but that talks about capped collections, where the new documents overwrite the oldest documents).
Is it possible to limit the number of documents in a collection to some maximum size, and have documents after that limit just be rejected. The simple example is adding ticket sales to the database, then failing if all the tickets are already sold out.
I considered having a NumberRemaining document, which I could atomically decerement until it reaches 0 but that leaves me with a problem if a node crashes between decrementing that number, and saving the purchase of the ticket.
Store the tickets in a single MongoDB document. As you can only atomically set one document at a time, you shouldn't have a problem with document dependencies that could have been solved by using a traditional transactional database system.
As a document can be up to 16MB, by storing only a ticket_id in a master document, you should be able to store plenty of tickets without needing to do any extra complex document management. While it could introduce a hot spot, the document likely won't be very large. If it does get large, you could use more than one document (by splitting them into multiple documents as one document "fills", activate another).
If that doesn't work, 10gen has a pattern that might fit.
My only solution so far (I'm hoping someone can improve on this):
Insert documents into an un-capped collection as they arrive. Keep the implicit _id value of ObjectID, which can be sorted and will therefore order the documents by when they were added.
Run all queries ordered by _id and limited to the max number of documents.
To determine whether an insert was "successful", run an additional query that checks that the newly inserted document is within the maximum number of documents.
My solution was: I use an extra count variable in another collection. This collection has a validation rule that avoids count variables to become negative. Count variable should always be non negative integer number.
"count": { "$gte": 0 }
The algorithm was simple. Decrement the count by one. If it succeed insert the document. If it fails it means there is no space left.
Vice versa for deletion.
Also you can use transactions to prevent failures(Count is decremented but service is failed just before insertion operation).