I am trying to configure pgpool with postgresql and repmgr. But after configuring that I found it is not working as expected.
When I switchover standby to primary using repmgr it is working expected but at pgpool primary connection info not changing.
My question is that how can I sync repmgr info with pgpool such as switchover, failover etc.?
In the below screenshot I run pcp_watchdog_info but show server1 is leader and server2,server3 is standby.
But at repmgr after switchover new primary is server2 and server3,server4 is standby.
Why pgpool still showing server1 is a leader? Though as per architecture watchdog only monitor pgpool status.
Is there any relation with repmgr primary with pgpool LEADER?
Need expert opinion, thanks in advance.
If you are making repmgr handle automatic failover, you need to add another step in follow_command parameter in repmgr.conf file as follow:
follow_command='/usr/bin/repmgr standby follow -f /etc/repmgr/15/repmgr.conf --log-to-file --upstream-node-id=%n && /usr/sbin/pcp_attach_node -w -h 172.16.0.100 -U pgpool -p 9898 -n %n'
Where 172.16.0.100 is the VIP of Pgpool-II cluster.
Why?
Because, when the primary node is unavailable and another standby is promoted to be the new primary, repmgr needs to resync other standby nodes in the cluster to the new primary and it will shutdown the PostgreSQL service before doing this step and this would detach this standby node from the Pgpool-II cluster.
So, you need to attach the node to the Pgpool-II cluster in follow_command.
Related
We currently have a postgres server on AWS pulling updates from an internal server. We want to stop the replication, but keep the data on both the internal and the AWS server - the internal server is no longer being updated.
Is it just a case of removing the
host replication replicator x.x.x.x/32 md5
line from the pg_hba.conf file on both master and slave and restarting Postgres?
and then running
pg_ctl -D /target_directory promote
on the old slave server to promote the read only slave to a read / write master.
You should drop replication slot on master db. (physical replication)
SELECT pg_drop_replication_slot('your_slot_name');
Then promote replica server.
/usr/lib/postgresql/12/bin/pg_ctl promote -D /pg_data/12/main/
Edit pg_hab.file, remove replication line and run on master select pg_reload_conf();.
Primary and salve server started showing a "requested WAL segment has already been removed" error after I've changed the PostgreSQL configs.
So, I decided to restore the backup from the primary server using the following steps:
Shut down the replica server.
Removed PostgreSQL data directory on the replica (/var/lib/postgresql/12/main)
Performed the base backup (sudo -u postgres pg_basebackup -h [PRIMARY_IP] -D /var/lib/postgresql/12/main -U replication -P -v -R) and completed successfully.
Started replica again.
But the error mentioned above still showing.
Configs on primary and slave:
wal_level = 'replica'
archive_mode = on
archive_command = 'cd .'
max_wal_senders = 48
wal_keep_segments = 50
hot_standby = on
When I run pg_basebackup .. command, I got this logs:
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 188/92000148 on timeline 1
pg_basebackup: starting background WAL receiver
I'm curious why it says from 188, not 0.
You need a replication slot to keep the primary from removing WAL that is still needed by the standby.
Create the replication slot with pg_create_logical_replication_slot on the primary.
Use the replication slot with the -S option of pg_basebackup.
Make sure primary_slot_name is set in the standby configuration. pg_basebackup's -R option will do that automatically.
I'm trying to build a custom postgres master and slave setup in Kubernetes.
However, as part of streaming configuration of the master I have to register the slave's IP address.
echo "host replication replica $IP_SLAVE/24 md5" >> /etc/postgresql/${VERSION}/main/pg_hba.conf
Where the $IP_SLAVE is the ip address of the slave pods.
Likewise, on slave
pg_basebackup -h ${MASTER_HOST} -D /var/lib/postgresql/${VERSION}/main -U replica -v -P
where the $MASTER_HOST is the ip address of the master pod.
This is to say for every newly created slave replica the master should always receive the ip address of those slave replica pods.
I'm new to kubernetes and would like to find a away, how I could do it in a programming manner if possible. If not, just want to find out the ways in doing such.
Reason why I'm doing this is because the existing solutions for such setup are not compatible with ARM kubernetes cluster devices such as helm from bitnami and kubedb.
Or if there is an existing solution compatible with ARM I would more than happy to know it.
Postgresql database replication has two servers one for master and the other for a slave. Due to some reason the master IP address got changed which was being used at several places in the slave server. With the new IP address, after replacing the old ones with the latest one in the slave server the replication is not working as before. Can someone help to resolve this issue?
Following are the steps used in setting up the slave server :
1.add the master IP address in the pg_hba.conf file for the user replication
nano /etc/postgresql/11/main/pg_hba.conf host
replication master-IP/24 md5
2.modify the following lines in the PostgreSQL.conf file of slave server where listen_addresses should be the IP of the slave server
nano /etc/postgresql/11/main/postgresql.conf
listen_addresses = 'localhost,slave-IP'
wal_level = replica
max_wal_senders = 10
wal_keep_segments = 64
3. Take the backup of the master server by entering the IP
pg_basebackup -h master-ip -D /var/lib/postgresql/11/main/ -P -U
replication --wal-method=fetch
4.create a recovery file and adding the following commands
standby_mode = 'on'
primary_conninfo = 'host=master-ip port=5432 user=replication password= '
trigger_file = '/tmp/MasterNow'
Below is the error from the log file:
started streaming WAL from primary at A/B3000000 on timeline 2
FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000020000000A000000B3 has already been removed
FATAL: could not connect to the primary server: could not connect to server: Connection timed out
Is the server running on host "master ip" and accepting
TCP/IP connections on port 5432?
record with incorrect prev-link 33018C00/0 at 0/D8E15D18
The standby server was down long enough that the primary server does not have the required transaction log information any more.
There are three remedies:
set the restore_command parameter in the standby server's recovery configuration to restore WAL segments from the archive (that should be the inverse of archive_command on your primary server). Then restart the standby.
This is the only option that allows you to recover without rebuilding the standby server from scratch.
Set wal_keep_segments on the primary server high enough that it retains enough WAL to cover the outage.
This won't help you recover now, but it will avoid the problem in the future.
Define a physical replication slot on the primary and put its name in the primary_slot_name parameter in the standby server's recovery configuration.
This won't help you recover now, but it will avoid the problem in the future.
Note: When using replication slots, monitor the replication. Otherwise a standby that is down will lead to WAL segments piling up on the primary, eventually filling up the disk.
All but the first options require that you rebuild your standby with pg_basebackup, because the required WAL information is no longer available.
host replication master-IP/24 md5
This line is missing a field. The USER field.
listen_addresses = 'localhost,slave-IP'
It is rarely necessary for this to be anything other than '*'. If you don't try to micromanage it, that is one less thing you will need to change. Also, changing wal_keep_segments on the replica doesn't do much unless you are using cascading replication. It needs to be changed on the master.
pg_basebackup -h master-ip -D /var/lib/postgresql/11/main/ -P -U
replication --wal-method=fetch
Did this indicate that it succeeded?
FATAL: could not receive data from WAL stream: ERROR: requested WAL
segment 000000020000000A000000B3 has already been removed
FATAL: could not connect to the primary server: could not connect to
server: Connection timed out
Is the server running on host "master ip" and accepting
TCP/IP connections on port 5432?
This is strange. In order to be informed that the file "has already been removed", it necessarily had to have connected. But the next line says it can't connect. It is not unusual to have a misconfiguration that prevents you from connecting, but in that case it wouldn't have been able to connect the first time. Did you change configuration between these two log messages? Is your network connection flaky?
I am trying to setup a pgpool loadbalancer for a Postgresql streaming replication cluster.
I am using postgresql-12 and pgpool2-4.1.0 from the Postgresql repo https://apt.postgresql.org/pub/repos/apt/ on Debian 10.2 (latest stable).
I have setup Postgresql cluster with streaming replication using physical slots (not WAL shipping) and everything seems to be working properly. The secondaries connect replicate data without any issues.
Then I installed the pgpool2-4.1.0 on the same servers. I have made the proper modifications to pgpool.conf according to the pgpool wiki and I have enabled the watchdog process.
When I start pgpool, on all three nodes, I can see that watchdog is working properly, quorum exists and pgpool elects a master (pgpool node) which also enables the virtual IP from the configuration.
I can connect to the postgres backend via pgpool and issue read and write commands successfully.
The problem appears on the pgpool logs, from syslog, I get:
Jan 13 15:10:30 debian10 pgpool[9826]: 2020-01-13 15:10:30: pid 9870: LOG: failed to connect to PostgreSQL server on "pg1:5433", getsockopt() detected error "Connection refused"
Jan 13 15:10:30 debian10 pgpool[9826]: 2020-01-13 15:10:30: pid 9870: LOCATION: pool_connection_pool.c:680
When checking the PID mentioned above, I get the pgpool healthcheck process. I
pg1, pg2, pg3 are the database servers listening on all addresses on port 5433, pg1 is the primary.
pgpool listens on 5432.
The database user that is used for the healthcheck is "pgpool", I have verified that I can connect to the database using that user from all hosts on the particular subnet.
When I disable the healthcheck the issue goes away, but the defeats the purpose.
Any ideas?
Turns out it was name resolution in the /etc/hosts file and the postgresql.conf.
Specifically, the /etc/hosts was like this:
vagrant#pg1:~$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 pg1
....
10.10.20.11 pg1
....
And postgresql.conf like this:
....
listen_addresses = 'localhost,10.10.20.11' # what IP address(es) to listen on;
....
So when healthcheck tried to reach the local node on every machine, it would check via hostname (pg1, pg2, etc). With the hosts file above that leads to 127.0.1.1 that postgresql doesn't listen, so it would fail, hence the error, and then try with the 10.10.20.11 which would be successful. That also explains why there was no error from healthchecks of remote hosts.
I changed the hosts file to the following:
vagrant#pg1:~$ cat /etc/hosts
127.0.0.1 localhost
127.0.1.1 pg1-local
....
10.10.20.11 pg1
....
And the logs are clear.
This is Debian specific, as Red Hat-based distros don't have a
127.0.1.1 hostname
record in their /etc/hosts