How to remove old revisions of the documents in a couchdb database? - nosql

I have a very large database with some GB of data. And when I try to compact it's taking me more than 12 hours. Is there any other way to delete old revisions? Does the _revs_limit help in this. I can see that the revs limit of all databases is set to 1000. Does that mean even after compaction, 1000 revisions will remain on the couchdb?

ou cannot delete an old revision of a single document. This is because the old revisions are only used by CouchDB internally for concurrency control and you shouldn't have to worry about these revisions. If you want to remove all old revisions in order to shrink the size of your database, you can run compaction.

Related

Is it possible to issue a durable delete in Aerospike with asinfo using 'truncate'?

I wanted to avoid using Aerospike clients (e.g. for Python) and delete records from a set using native asinfo command 'truncate' as it allows to do it quickly. But after I restarted Aerospike all deleted records were back. I saw this aerospike: delete all record in a set but it doesn't answer my question. Neither does this page from AS docs. It says, that a tombstone should be written after a durable delete, do I have to create it manually or are there some other ways?
UPD:
Thanks to #kporter who provided the accepted answer below I was able to look into differences between Community and Enterprise edition of Aerospike and found more information on the problem, some may find it helpful as well:
Persisted Delete [Community Edition]
This answer and the whole discussion from AS forum
And this thread
If I understood all of it correctly the best way to get your records deleted completely in CE is to ensure that they have right TTL and can expire naturally. And if for some reason you have a lot of old records without TTL as in my case, you can issue truncate command via asinfo and do not restart AS server until data on SSD is eventually overwritten. Or just truncate sets with old records on every restart.
Also I wonder if it is possible to wipe AS storage completely and then restore it from a backup of already truncated data as an emergency measure?
UPD1:
So, I was able to wipe SSD with Aerospike storage and restore only needed records from a backup. Here is how I did it:
Firstly, you need to remove old records from sets via asinfo and truncate command, links to docs are above
Then backup namespaces you want to save with asbackup
Stop your AS server, mine was in Docker container, so I just stopped said container
Zero out the disk that is used as AS storage, mine was /dev/sdb
Create necessary partitions on this disk
Start AS server
Restore data from the backup using arestore
Useful links: how to remove and clean up an aerospike server installation, AS docs on SSD setup
I am not sure if it is a good solution for large production setups but it worked as intended in my case with only one AS node and an opportunity to stop it for a while.
This way I was able to reduce the size of data in my AS from 160Gb to 11Gb and because of that my server now fully restarts only in half an hour instead of approximately eight hours as before.
You can find more information about truncating a set here:
https://www.aerospike.com/docs/operations/manage/sets/
As mentioned there, truncation is not durable in Aerospike Community.
In the Enterprise Edition, truncation is durable and preserves record deletions through a cold-restart. In the Community Edition, similar to record deletes, records in previously truncated sets are not durable and deletes can return through a cold-start.

Archive old data in Postgresql

I'm currently expecting for somebody to advice me on the process which I'm gonna take forward for DB archiving.
I've database (DB-1) which has 2 very large tables, one table having 25 GB of data and another is 20 GB of data. Which cause major performance issues even I have indexes.
So, we considered to archive the old data with the below process,
Clone a new database (DB-2) from existing database (DB-1).
Delete the old data from DB-1, so it will have only the last 2 years records. In case If I need old data can connect DB-2.
Every month should move an old data from DB-1 to DB-2, and delete the moved rows from DB-1.
That is the wrong approach.
What you are looking for is partitioning.
You can create range partitions covering one year each. To remove old data all you need to do is to drop the partition for the year(s) no longer needed.
If you need to keep the data for some reasons, you can also just detach the partition from the table. Then the data is still "lying around", but would not show up in the (partitioned) table. You could query the (detached) partition directly to access that data. You could even move that (detached) partition to a slower harddisk to free up space on your fast disks if you have more than one.
But you might even see that partitioning alone might already improves performance, but that depends a lot on your queries.
Note that you should use Postgres 11 for that, as partitioning wasn't that sophisticated in older versions.
While you should no doubt upgrade your current version (I'd suggest moving away from the EDB system you are working on now, and going to community based Postgres 11) even if you can't upgrade, partitioning is still a much better answer than creating a second database.
By recreating your table as a set of partitions within the same database, you will be able to add/remove data in a much cleaner fashion, and it will make dealing with Vacuums much easier. Even in 9.5, you can take advantage of table inheritance to build out partitions by first adding partitions for incoming data, and then creating partitions at various intervals (probably monthly, since you want to run monthly cleanup) and moving the data into those partitions. This can be accomplished atomically with a series of INSERT INTO partition SELECT * FROM table WHERE <timestamp> style statements.
I suspect you can probably manage this yourself (you need basic sql and the ability to write simple triggers/functions... here is a link to the 9.5 docs), but if you need help, you can engage with one of the Postgres chat communities, or contact a support company if you want a deeper dive.

Sitecore 8.1 update 2 MongoDB backup

I am using replica set (2 mongo, 1 arbitor) for my Sitecore CD servers.
Assuming all mongo DB data get flushed to Reporting SQL DB; do we need to take backup of MongoDB database on production CD ?
If yes what is best approach and frequency to do it; considering My application is moderately using anaytics feature (Personalization , Campaign etc).
Unfortunately, your assumption is bad - the MongoDB is the definitive source of analytic data, not the reporting db. The reporting db contains only the aggregate info needed for generating the report (mostly). In fact, if (when) something goes wrong with the SQL DB, the idea is that it is rebuilt from the source MongoDB. Remember: You can't un-add two numbers after you've added them!
Backup vs Replication
A backup is a point-in-time view of the database, where replication is multiple active copies of a current database. I would advocate for replication over backup for this type of data. Why? Glad you asked!
Currency - under what circumstance would you want to restore a 50GB MongoDB? What if it was a week old? What if it was a month? Really the only useful data is current data, and websites are volatile places - log data backups are out of date within an hour. If you personalise on stale data is that providing a good user experience?
Cost - backing up large datasets is costly in terms of time, storage capacity and compute requirements; they are also a pain to restore and the bigger they are the more likely there's a corruption somewhere
Run of business
In a production MongoDB environment you really should have 2-3 replicas. That's going to save your arse if one of the boxes dies, which they sometimes do - MongoDB works the disks very hard.
These replicas are self-healing, and always current (pretty-much) so they are much better than taking backups. The chances that you lose all your replicas at once is really low except for one particular edge case... upgrades. So a backup is really only protection against hardware failure or data corruption which, in a multi-instance replica set, is already very effectively handled. Unless you're paranoid, you're never going to use that backup and it'll cost you plenty to have it.
Sitecore Upgrades
This is the killer edge-case - always make backups (see Back Up and Restore with MongoDB Tools) before running an upgrade because you can corrupt all of your replicas in one motion and you'll want to be able to roll back.
Data Trimming (side-note)
You didn't ask this, but at some point you'll be thinking "how the heck can I back up this 170GB monster db every day? this is ridiculous" - and you'll be right.
There are various schools of thought around how long this data should be persisted for - that's a question only you or your client can answer. I suggest keeping it until there's too much, then make a decision on how much you have to get rid of. Keep as much as you can tolerate.

Sphinx: Real-Time Search w/Expiration?

I am designing a search that will be fed around 50 to 200 GB of text data per day (similar to logs) and it only needs to retain that data for week or two. This data will be piped at a constant rate (5,000/per second for example), non-stop, 24 hours a day. After a week or two, the document should drop out of the index never to be heard from again.
The index should be searchable with free-form text across only 1 field (pretty small in size, around 512 characters max). At most, the schema could have 2 attributes that could be categorized.
The system needs to be indexed in near real-time as data is fed to it. A delay of 15 to 30 seconds is acceptable.
We prefer to stream data into the indexer/service with a constant stream of pipe data.
Lastly, a single stand-alone solution is prefer over any type of distribution setup (this will be part of a package to deploy and setup on local machines for testers).
I'm looking closely at Sphinx search engine with RT updates via the API as it checks off most of these. But, I am not seeing an easy way to expire documents after a certain length of time.
I am aware that I could track the IDs and a timestamp and issue a batch DELETE through the Sphinx API. But, that creates an issue of tracking large amounts of IDs in a separate datastore that will need the same kind of 5,000/per second inserts and deleting them when done.
I also have a concern around Sphinx Fragmentation of mass-inserting, and mass-deleting in the middle of inserting.
We would really prefer the search engine/indexer to handle the expiration itself.
I think I can perform a WHERE timestamp < UNIXTIMESTAMP-OF-TWO-WEEEKS-AGO as the where clause in the Sphinx API in order to gather the Document IDs to delete. The problem with that is if the system does not stay ontop of the deletes, the total number of documents/search results will be in the 10s of millions, maybe even billions in count after a two week timeframe if it has to gather a few days worth of document ids to delete. That's not a feasible query.
You can actually run
DELETE FROM rt WHERE timestamp < UNIXTIMESTAMP-OF-TWO-WEEEKS-AGO
As a query to delete the old documents, which is much simpler :)
You will also need to call OPTIMIZE INDEX from time to time.
Both these will have to be called on some sort of 'cron' schedule, as they wont be run automatically.
You might be better not using Sphinxes DELETE function at all. When writing RT indexes, as soon as the RAM chunk is full its writen out as a disk chunk. So you end up with a number of disk chunks on the disk. The oldest documents will be in the oldest chunk, sequentially.
So to clear out the oldest documents, you could just dispose of the oldest chunks. (on a rolling basis)
The problem is sphinx does not include a function to delete individual chunks.
Will need to shutdown searchd, delete the chunk(s), manipulate the header files and then restart Sphinx. Not an easy process.
But in the more general sense, not sure if sphinx will be able to keep up with a continuous stream of 5,000/documents per second (even ignoreing delete for a moment) - Sphinx is generally designed for write-infrequently, read-frequently. It builds a (for the most part) monolithic inverted index. This is great for querying, but is very hard to keep updated. Its not great for incremental updates.

Lucene NRT: When to commit?

We're refactoring our Lucene host (Lucene.NET 2.9.2), and are implementing Lucene NRT (Near Realtime).
What is the best time/threshold to commit the changes to disk? Is there a golden rule? If it is when the internal ramdisk holds a certain amount of data, how do I get the size?
Once a commit happens we update our database, so I'm not that fearfull of power failures (once the process starts again, it will reindex those documents that have not been committed).
I have just implemented what sounds like the same scheme in our system. I decided to do a commit when I have over 1000 uncommitted documents. I think the number really depends on how many docs/sec you will be adding. I am also not sure if I can run the commit on a different thread than where I am adding the docs.