Is Postgresql Continuous Archiving practical for maintaning a complete database history? - postgresql

If I enable Continuous Archiving from day one, are the resulting logs a practical method of keeping a complete point-in-time history of all database operations? I guess transaction volume will be a factor, so assume less than 1000 transactions a day.

That depends on what you mean by "a complete point-in-time history of all database operations".
A base backup and all of the Write-Ahead Log (WAL) files (also often referred to as the transaction log or xlog) from the start of the backup forward should allow you to recover to any point in time. To minimize recovery time, though, it is a good idea to take a fresh base backup periodically. (Many people do this weekly or monthly, but I've heard of people doing is much less frequently.)
These logs are oriented toward physical storage of the data, not logical statements, so it is currently not possible to determine the SQL statements which generated the xlog. So if you're looking for an audit trail of what happened, it is not currently suitable for that.
There is a team of PostgreSQL developers working on logical replication, to allow broader uses of xlog data, for probable release in version 9.3, which won't be out for over a year. Until then, people use trigger-based logging for such audit trails.

Related

Handling backups of a large table (>1 TB) in Postgres?

I have a 1TB table (X) that is a pain to backup.
The table X contains historical log data that is not often updated after creation. We usually only access a single row at a time, so performance is still very good.
We currently make nightly full logical backups, and exclude X for the sake of backup time and space. We do not need historical backups of X, since the log files from which it is populated are backed up themselves. However, recovery of X by re-processing of the log files would take an unnecessary long time.
I'd like to include X in our backup strategy so that our recovery time can be much faster. It doesn't seem feasible to include X in the nightly logical backup.
Ideally, I'd like a single full backup for X that is updated incrementally (purely to save time).
I lack the experience to investigate solutions alone, and I'm wondering what my options are?
Barman for incremental updates? Partition X? Both?
After doing some more reading, I'm inclined to partition the table and write a nightly script to perform logical backups only on the changed table partitions (and replace the previous backups). However, this strategy may still take a long time during recovery with a pg_restore... Thoughts?
Thanks!
I think using barman with the rsync/SSH + WAL streaming option and performing incremental backups is the best option in your case. Going this way makes your nightly backups easier & less costly, since you don't have to do much logic yourself once you configure barman. I will update this with my blog shortly that details the steps.
Logical backups may not be the right approach for periodic backups when dealing with large databases. When using physical backups even though your backup size is large its more than compensated in the acquisition & restore cost (performance, speed & simplicity).
Thanks
UPDATE (2020-08-27):
Below is a git repo with the end-end demonstration. There are many versions of implementations out there that have done it, but if you wanted to do something from the scratch & keep the image simple (avoiding unnecessary dependencies), please take a look at this implementation,
https://github.com/softwarebrahma/PostgreSQL-Disaster-Recovery-With-Barman
Thanks

Sitecore 8.1 update 2 MongoDB backup

I am using replica set (2 mongo, 1 arbitor) for my Sitecore CD servers.
Assuming all mongo DB data get flushed to Reporting SQL DB; do we need to take backup of MongoDB database on production CD ?
If yes what is best approach and frequency to do it; considering My application is moderately using anaytics feature (Personalization , Campaign etc).
Unfortunately, your assumption is bad - the MongoDB is the definitive source of analytic data, not the reporting db. The reporting db contains only the aggregate info needed for generating the report (mostly). In fact, if (when) something goes wrong with the SQL DB, the idea is that it is rebuilt from the source MongoDB. Remember: You can't un-add two numbers after you've added them!
Backup vs Replication
A backup is a point-in-time view of the database, where replication is multiple active copies of a current database. I would advocate for replication over backup for this type of data. Why? Glad you asked!
Currency - under what circumstance would you want to restore a 50GB MongoDB? What if it was a week old? What if it was a month? Really the only useful data is current data, and websites are volatile places - log data backups are out of date within an hour. If you personalise on stale data is that providing a good user experience?
Cost - backing up large datasets is costly in terms of time, storage capacity and compute requirements; they are also a pain to restore and the bigger they are the more likely there's a corruption somewhere
Run of business
In a production MongoDB environment you really should have 2-3 replicas. That's going to save your arse if one of the boxes dies, which they sometimes do - MongoDB works the disks very hard.
These replicas are self-healing, and always current (pretty-much) so they are much better than taking backups. The chances that you lose all your replicas at once is really low except for one particular edge case... upgrades. So a backup is really only protection against hardware failure or data corruption which, in a multi-instance replica set, is already very effectively handled. Unless you're paranoid, you're never going to use that backup and it'll cost you plenty to have it.
Sitecore Upgrades
This is the killer edge-case - always make backups (see Back Up and Restore with MongoDB Tools) before running an upgrade because you can corrupt all of your replicas in one motion and you'll want to be able to roll back.
Data Trimming (side-note)
You didn't ask this, but at some point you'll be thinking "how the heck can I back up this 170GB monster db every day? this is ridiculous" - and you'll be right.
There are various schools of thought around how long this data should be persisted for - that's a question only you or your client can answer. I suggest keeping it until there's too much, then make a decision on how much you have to get rid of. Keep as much as you can tolerate.

Does Firebird have "Group Commit"

As I understand it, this is where a background thread is responsible for writing transactions to disk in "careful write" order so that the user does not have to wait for the actual writing to disk to occur.
I have seen references to this (e.g. here) from a long time ago relating to interbase but I could not see it mentioned in relation to firebird anywhere.
Using gfix utility you can set FORCED WRITES flag on or off for a database file. When turned on, the server will wait until actual disk write occur. When turned off the server will continue execution leaving to OS to decide when to write data to a disk. Performance gains are up to 3x but then there is a posibility that some data would be written in a wrong order if power failure occurs.
We strongly advice our customers toward using RAID controller with independent power source for a cache memory together with FORCED WRITES = ON.
Based on the comments on this thread and searching online it seems that firebird does not have GROUP COMMIT

Database for long running transactions with huge updates

I build a tool for data extraction and transformation. Typical use case - transactionally processing lots of data.
Numbers are - about 10sec - 5min duration, 200-10000 row updated (long duration caused not by the database itself but by outside services that used during transaction).
There are two types of agents that access database - multiple read agents, and only one write agent (so, there are never multiple concurrent write).
During the transaction:
Read agents should be able to read database and see it in the current state.
Write agent should be able to read database (it does both - read and write during transaction) and see it in the new (not yet committed) state.
Is PostgreSQL a good choice for that type of load? I know it uses MVCC - so it should be ok in general, but is it ok to use long and big transactions extensively?
What other open-source transactional databases may be a good choice (I am not limited to SQL)?
P.S.
I do not know if the sharding may affect the performance. The database will be sharded. For every shard there will be multiple readers and only one writer, but multiple different shards can be written to at the same time.
I know that it's better not to use outside services during transaction, but in that case - it's the goal. The database used as a reliable and consistent index for some heavy, huge, slow and eventually-consistent data processing tool.
Huge disclaimer: as always, only real life test can tell you the truth.
But, I think PostgreSQL will not let you down, if you use most recent version (at least 9.1, better 9.2) and tune it properly.
I have somewhat similar load in my server, but with slightly worse R/W ratio: about 10:1. Transactions range from few milliseconds up to 1 hour (and sometimes even more), and one transaction can insert or update up to 100k rows. Total number of concurrent writers with long transactions can reach 10 and more.
So far so good - I don't really have any serious issues, performance is great (certainly not worse than I expected).
What really helps is that my hot working data set almost fits into available memory.
So, give it a try, it should work great for your load.
Have a look at this link. Maximum transaction size in PostgreSQL
Basically there can be some technical limits on the software side to how large your transaction can be.

SQL vs NoSQL for an inventory management system

I am developing a JAVA based web application. The primary aim is to have inventory for products being sold on multiple websites called channels. We will act as manager for all these channels.
What we need is:
Queues to manage inventory updates for each channel.
Inventory table which has a correct snapshot of allocation on each channel.
Keeping Session Ids and other fast access data in a cache.
Providing a facebook like dashboard(XMPP) to keep the seller updated asap.
The solutions i am looking at are postgres(our db till now in a synchronous replication mode), NoSQL solutions like Cassandra, Redis, CouchDB and MongoDB.
My constraints are:
Inventory updates cannot be lost.
Job Queues should be executed in order and preferably never lost.
Easy/Fast development and future maintenance.
I am open to any suggestions. thanks in advance.
Queues to manage inventory updates for each channel.
This is not necessarily a database issue. You might be better off looking at a messaging system(e.g. RabbitMQ)
Inventory table which has a correct snapshot of allocation on each channel.
Keeping Session Ids and other fast access data in a cache.
session data should probably be put in a separate database more suitable for the task(e.g. memcached, redis, etc)
There is no one-size-fits-all DB
Providing a facebook like dashboard(XMPP) to keep the seller updated asap.
My constraints are:
1. Inventory updates cannot be lost.
There are 3 ways to answer this question:
This feature must be provided by your application. The database can guarantee that a bad record is rejected and rolled back, but not guarantee that every query will get entered.
The app will have to be smart enough to recognize when an error happens and try again.
some DBs store records in memory and then flush memory to disk peridocally, this could lead to data loss in the case of a power failure. (e.g Mongo works this way by default unless you enable journaling. CouchDB always appends to the records(even a delete is a flag appended to the record so data loss is extremely difficult))
Some DBs are designed to be extremely reliable, even if an earthquake, hurricane or other natural disaster strikes, they remain durable. these include Cassandra, Hbase, Riak, Hadoop, etc
Which type of durability are your referring to?
Job Queues should be executed in order and preferably never lost.
Most noSQL solutions prefer to run in parallel. so you have two options here.
1. use a DB that locks the entire table for every query(slower)
2. build your app to be smarter or evented(client side sequential queuing)
Easy/Fast development and future maintenance.
generally, you will find that SQL is faster to develop at first, but changes can be harder to implement
noSQL may require a little more planning, but is easier to do ad hoc queries or schema changes.
The questions you probably need to ask yourself are more like:
"Will I need to have intense queries or deep analysis that a Map/Reduce is better suited to?"
"will I need to my change my schema frequently?
"is my data highly relational? in what way?"
"does the vendor behind my chosen DB have enough experience to help me when I need it?"
"will I need special feature such as GeoSpatial indexing, full text search, etc?"
"how close to realtime will I need my data? will it hurt if I don't see the latest records show up in my queries until 1sec later? what level of latency is acceptable?"
"what do I really need in terms of fail-over"
"how big is my data? will it fit in memory? will it fit on one computer? is each individual record large or small?
"how often will my data change? is this an archive?"
If you are going to have multiple customers(channels?) each with their own inventory schemas, a document based DB might have it's advantages. I remember one time I looked at an ecommerce system with inventory and it had almost 235 tables!
Then again, if you have certain relational data, a SQL solution can really have some advantages too.
I can certainly see how I could build a solution using mongo, couch, riak or orientdb with the given constraints. But as for which is the best? I would try talking directly DB vendors, and maybe watch the nosql tapes
Addressing your constraints:
Most NoSQL solutions give you a configurable tradeoff of consistency vs. performance. In MongoDB, for instance, you can decide how durable a write should be. If you want to, you can force the write to be fsync'ed on all your replica set servers. On the other extreme, you can choose to send the command and don't even wait for the server's response.
Executing job queues in order seems to be an application code issue. I'd say a timestamp in the db and an order by type of query should do for most applications. If you have multiple application servers and your queues need to be perfect, you'd have to use a truly distributed algorithm that provides ordering, but that is not a typical requirement, and it's very tricky indeed.
We've been using MongoDB for some time now, and I'm convinced this gives your app development speed a real boost. There's no big difference in maintenance, maintaining data is a pain either way. Not having a schema gives you added flexibility (lazy migrations), but it's more elaborate and requires some care.
In summary, I'd say you can do it both ways. The NoSQL is more code driven, and transactions and relational integrity are mostly managed by your code. If you're uncomfortable with that, go for a relational DB.
However, if you're data grows huge, you'll have to code some of this logic manually because you probably wouldn't want to do real-time joins on a 10B row database. Still, you can implement that with SQL as well.
A good way to find the boundary for different databases is to consider what you can cache. Data that can be cached and reconstructed at any time are a great way to start introducing a new layer, because there's no big risks there. Also, cached data usually doesn't keep any relations so you're not sacrificing any consistency here.
NoSQL is not correct for this application.
I mean, you can use it sure, but you will end up re-implementing a lot of what SQL offers for you. For example I see a lot of relations there. You also want ACID (although some NoSQL solutions do offer that).
There is no reason you can't use both - keep relational data in relational databases, and non-relational data in key/value stores.