How do I model a PostgreSQL failover cluster with Docker/Kubernetes? - postgresql

I'm still wrapping my head around Kubernetes and how that's supposed to work. Currently, I'm struggling to understand how to model something like a PostgreSQL cluster with streaming replication, scaling out and automatic failover/failback (pgpool-II, repmgr, pick your poison).
My main problem with the approach is the dual nature of a PostgreSQL instance, configuration-wise -- it's either a master or a cold/warm/hot standby. If I increase the number of replicas, I'd expect them all to come up as standbys, so I'd imagine creating a postgresql-standby replication controller separately from a postgresql-master pod. However I'd also expect one of those standbys to become a master in case current master is down, so it's a common postgresql replication controller after all.
The only idea I've had so far is to put the replication configuration on an external volume and manage the state and state changes outside the containers.
(in case of PostgreSQL the configuration would probably already be on a volume inside its data directory, which itself is obviously something I'd want on a volume, but that's beside the point)
Is that the correct approaach, or is there any other cleaner way?

There's an example in OpenShift: https://github.com/openshift/postgresql/tree/master/examples/replica The principle is the same in pure Kube (it's not using anything truly OpenShift specific, and you can use the images in plain docker)

You can give PostDock a try, either with docker-compose or Kubernetes. Currently I have tried it in our project with docker-compose, with the schema as shown below:
pgmaster (primary node1) --|
|- pgslave1 (node2) --|
| |- pgslave2 (node3) --|----pgpool (master_slave_mode stream)----client
|- pgslave3 (node4) --|
|- pgslave4 (node5) --|
I have tested the following scenarios, and they all work very well:
Replication: changes made at the primary (i.e., master) node will be replicated to all standby (i.e., slave) nodes
Failover: stops the primary node, and a standby node (e.g., node4) will automatically take over the primary role.
Prevention of two primary nodes: resurrect the previous primary node (node1), node4 will continue as the primary node, while node1 will be in sync but as a standby node.
As for the client application, these changes are all transparent. The client just points to the pgpool node, and keeps working fine in all the aforementioned scenarios.
Note: In case you have problems to get PostDock up running, you could try my forked version of PostDock.
Pgpool-II with Watchdog
A problem with the aforementioned architecture is that pgpool is the single point of failure. So I have also tried enabling Watchdog for pgpool-II with a delegated virtual IP, so as to avoid the single point of failure.
master (primary node1) --\
|- slave1 (node2) ---\ / pgpool1 (active) \
| |- slave2 (node3) ----|---| |----client
|- slave3 (node4) ---/ \ pgpool2 (standby) /
|- slave4 (node5) --/
I have tested the following scenarios, and they all work very well:
Normal scenario: both pgpools start up, with the virtual IP automatically applied to one of them, in my case, pgpool1
Failover: shutdown pgpool1. The virtual IP will be automatically applied to pgpool2, which hence becomes active.
Start failed pgpool: start again pgpool1. The virtual IP will be kept with pgpool2, and pgpool1 is now working as standby.
As for the client application, these changes are all transparent. The client just points to the virtual IP, and keeps working fine in all the aforementioned scenarios.
You can find this project at my GitHub repository on the watchdog branch.

Kubernetes's statefulset is a good base for setting up the stateful service. You will still need some work to configure the correct membership among PostgreSQL replicas.
Kubernetes has one example for it. http://blog.kubernetes.io/2017/02/postgresql-clusters-kubernetes-statefulsets.html

You can look at one of the below postgresql open-source tools
1 Crunchy data postgresql
Patroni postgresql
.

Related

Postgres failback after failover

We are using streaming replication in a Postgres v10 system. We have managed to get the failover from Primary to Standby to work correctly, but we cannot work out how to failback to the original setup.
Our setup may be a bit unusual, as each server is in a different data-center. Failover consists of not only moving the primary database to a different data-center, but the associated client applications. So due to speed/bandwidth reasons it is important that after a failover we can revert (failback) to our normal configuration.
Our setup is as follows:
Primary1 ----replicate---> Standby1 ---replicate---> HotStandby2
As stated we can kill the master and failover to the following situation
Standby1 ---replicate---> HotStandby2
We now would like to failback, so we build a new database called Primary2 and set it as a replication of Standby1
Standby1 ---replicate---> HotStandby2
---replicate---> Primary2
After a while Primary2 is in sync with Standby1, so we would like to promote Primary2 to be a active server, and then reconfigure Standby2 to be a replication of Primary2 to get back to the original state
Primary2 ----replicate---> Standby1 ---replicate---> HotStandby2
It is easy to promote Primary2 to be active, but we cannot find any way of moving Standby2 from an active to standby state.
We have tried
insuring that Primary2 is in sync with Standby1
Changing the recover.done to recover.conf
adding recovery_target_timeline = 'latest' to the recover.conf
Setting up new replication slots
Using pg_rewind
Using pg_rewind along with archive WAR files
Most of our attempts result in Standby1 still running as a master and some strange messages in the logs files about diverging timelines....
Has anyone have any ideas or can point us to some articles - tutorials, etc.
Is this even possible in Postgres? Or should we then rebuild Standby1 and if so what about HotStandby2?

Replicate via pglogical on a hot_standby setup

I am running two databases (PostgreSQL 9.5.7) in a master/slave setup. My application is connecting to a pgpool instance which routes to the master database (and slave for read only queries).
Now I am trying to scale out some data to another read-only database instance containing only a few tables.
This works perfectly using pglogical directly on the master database.
However if the master transitions to slave for some reason, pglogical can't replicate any longer because the node is in standby.
Tried following things:
subscribed on the slave since it's less likely to go down, or overheated: Can't replicate on standby node.
subscribed via pgpool server: pgpool doesn't accept replication connections.
subscribed to both servers: pglogical config gets replicated along, so can't give them different node names.
The only thing I can think of now is to write my own tcp proxy which regularly checks for the state of the server to which I can subscribe to.
Is there any other/easier way I can solve this ?
Am I using the wrong tools perhaps ?
Ok so it seems that there are no solutions for this problem just yet.
Since the data in my logically replicated database is not changing fast, there is no harm if the replication stops for a moment.
Actions on failover could be:
Re-subscribe to the promoted master.
or promote standby node back to master after failover.

ejabberd cluster: Multi-master or Master-slave

So far what I've come across is this -
Setting up ejabberd cluster in a master-slave configuration, there would be a single point of failure and people have experienced issues when even after fixing the master (if it goes down), the cluster doesn't become operable again. Also sometimes, ejabberd instances of every slave would have to be revisited again to get them working properly, or mnesia commands would have to be in-putted again to make master communicate with the slaves.
Setting up ejabberd cluster in a multi-master configuration then any of the nodes can be taken out of the cluster without bringing the whole cluster down. Basically, there is no single point of failure and, this is also the way in which the official documentation for ejabberd tells you to do via the join_cluster argument they expose in the ejabberdctl script. HOWEVER, in this case, all the data is replicated across both nodes which is a big performance overhead in my opinion.
So it boils down to this.
What is the best/recommended/popular mode in which an ejabberd cluster of 2 nodes should be set up mostly with respect to performance but keeping other critical factors (fault tolerance, load balancing) in mind as well.
There is only a single mode in ejabberd. Basically, it works like what you describe as multi-master. master-slave would basically be the same setup without any traffic sent to the second node by load balancing mechanism.
So case 2 is the way to go.

How does kubernetes replication controller handle data?

If I setup a replication controller for something like a database, how does it keep the data in the replicas in-sync? If one of the replica goes down, how does it bring it back up with the latest data?
A replication controller ensures that the desired number of pods with the same template are kept running in the system. The replication controller itself does not know anything about what it is running, and doesn't have any special hooks for containers running databases. This means that if you want to run a container with a database with more than one replica, then it is easiest to run a database that can natively do replication and discovery (possibly with the injection of some environment variables).
An alternative is to run a pod with two containers, where one container is a vanilla database, and the second "side-car" container is used to implement the necessary replication / synchronization / master election or whatever extra functionality you need to provide to make the database run in a clustered environment. This is more flexible (you can run a database that wasn't initially designed to run in a clustered environment) but also requires more custom work to make it scale.

MongoDB share-nothing slaves

I'd like to use mongodb to distribute a cached database to some distributed worker nodes I'll be firing up in EC2 on demand. When a node goes up, a local copy of mongo should connect to a master copy of the database (say, mongomaster.mycompany.com) and pull down a fresh copy of the database. It should continue to replicate changes from the master until the node is shut down and released from the pool.
The requirements are that the master need not know about each individual slave being fired up, nor should the slave have any knowledge of other nodes outside the master (mongomaster.mycompany.com).
The slave should be read only, the master will be the only node accepting writes (and never from one of these ec2 nodes).
I've looked into replica sets, and this doesn't seem to be possible. I've done something similar to this before with a master/slave setup, but it was unreliable. The master/slave replication was prone to sudden catastrophic failure.
Regarding replicasets: While I don't imagine you could have a set member invisible to the primary (and other nodes), due to the need for replication, you can tailor a particular node to come pretty close to what you want:
Set the newly-launched node to priority 0 (meaning it cannot become primary)
Set the newly-launched node to 'hidden'
Here are links to more info on priority 0 and hidden nodes.