How to resolve error reading result of streaming command? - postgresql

I have a master database doing logical replication with a publication and a slave database subscribing to that publication. It is on the slave that I am occasionally getting the following error:
ERROR: error reading result of streaming command:
LOG: logical replication table synchronization worker for subscription ABC, table XYZ
How do I stop the above error from happening?
Below is a screenshot of the log to demonstrate the error:
Here is the same information as text:
2020-11-25 06:50:51.736 UTC [91572] LOG: background worker "logical replication worker" (PID 96504) exited with exit code 1
2020-11-25 06:50:51.740 UTC [96505] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "contractservices_devicekioskrating" has started
2020-11-25 06:50:52.197 UTC [96505] ERROR: error reading result of streaming command:
2020-11-25 06:50:52.200 UTC [91572] LOG: background worker "logical replication worker" (PID 96505) exited with exit code 1
2020-11-25 06:50:52.203 UTC [96506] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "workorders_sectorbranchinformation" has started
2020-11-25 06:50:52.286 UTC [96506] ERROR: error reading result of streaming command:
2020-11-25 06:50:52.288 UTC [91572] LOG: background worker "logical replication worker" (PID 96506) exited with exit code 1
2020-11-25 06:50:52.292 UTC [96507] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "contractservices_kioskstatetransitions" has started
2020-11-25 06:52:14.887 UTC [96339] ERROR: error reading result of streaming command:
2020-11-25 06:52:14.896 UTC [91572] LOG: background worker "logical replication worker" (PID 96339) exited with exit code 1
2020-11-25 06:52:14.900 UTC [96543] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "contractservices_sensordatafeed" has started
2020-11-25 06:52:21.385 UTC [96507] ERROR: error reading result of streaming command:
2020-11-25 06:52:21.393 UTC [91572] LOG: background worker "logical replication worker" (PID 96507) exited with exit code 1
2020-11-25 06:52:21.397 UTC [96547] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "contractservices_sitemappoint" has started
2020-11-25 06:52:21.523 UTC [96547] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "contractservices_sitemappoint" has finished
2020-11-25 06:52:21.528 UTC [96548] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "core_event" has started
2020-11-25 06:55:35.401 UTC [96543] ERROR: error reading result of streaming command:
2020-11-25 06:55:35.408 UTC [91572] LOG: background worker "logical replication worker" (PID 96543) exited with exit code 1
2020-11-25 06:55:35.412 UTC [96642] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "contractservices_doorevents" has started
2020-11-25 06:56:43.633 UTC [96642] ERROR: error reading result of streaming command:
2020-11-25 06:56:43.641 UTC [91572] LOG: background worker "logical replication worker" (PID 96642) exited with exit code 1
2020-11-25 06:56:43.644 UTC [96678] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "workorders_sectorbranchinformation" has started
2020-11-25 06:56:43.776 UTC [96678] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "workorders_sectorbranchinformation" has finished
2020-11-25 06:56:43.782 UTC [96679] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "core_batteryhistory" has started
2020-11-25 06:57:04.166 UTC [96679] ERROR: error reading result of streaming command:
2020-11-25 06:57:04.174 UTC [91572] LOG: background worker "logical replication worker" (PID 96679) exited with exit code 1
2020-11-25 06:57:04.178 UTC [96685] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "contractservices_attendantvisittime" has started
2020-11-25 06:57:06.100 UTC [96685] ERROR: error reading result of streaming command:
2020-11-25 06:57:06.160 UTC [91572] LOG: background worker "logical replication worker" (PID 96685) exited with exit code 1
2020-11-25 06:57:06.164 UTC [96693] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "contractservices_kioskstatetransitions" has started
2020-11-25 06:59:50.375 UTC [96548] ERROR: error reading result of streaming command:
2020-11-25 06:59:50.382 UTC [91572] LOG: background worker "logical replication worker" (PID 96548) exited with exit code 1
2020-11-25 06:59:50.389 UTC [96755] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "contractservices_sensordatafeed" has started
2020-11-25 07:00:56.844 UTC [96693] ERROR: error reading result of streaming command:
2020-11-25 07:00:56.852 UTC [91572] LOG: background worker "logical replication worker" (PID 96693) exited with exit code 1
2020-11-25 07:00:56.856 UTC [96779] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "workorders_wastestream" has started
2020-11-25 07:00:57.391 UTC [96779] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "workorders_wastestream" has finished
2020-11-25 07:00:57.397 UTC [96780] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "core_event" has started
2020-11-25 07:02:39.650 UTC [96755] ERROR: error reading result of streaming command:
2020-11-25 07:02:39.658 UTC [91572] LOG: background worker "logical replication worker" (PID 96755) exited with exit code 1
2020-11-25 07:02:39.662 UTC [96824] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "contractservices_devicekioskrating" has started
2020-11-25 07:02:40.276 UTC [96824] ERROR: error reading result of streaming command:
2020-11-25 07:02:40.279 UTC [91572] LOG: background worker "logical replication worker" (PID 96824) exited with exit code 1
2020-11-25 07:02:40.283 UTC [96825] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "contractservices_kioskstatetransitions" has started
2020-11-25 07:04:07.222 UTC [96825] ERROR: error reading result of streaming command:
2020-11-25 07:04:07.230 UTC [91572] LOG: background worker "logical replication worker" (PID 96825) exited with exit code 1
2020-11-25 07:04:07.234 UTC [96862] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "contractservices_attendantvisit" has started
2020-11-25 07:04:49.971 UTC [96862] ERROR: error reading result of streaming command:
2020-11-25 07:04:49.978 UTC [91572] LOG: background worker "logical replication worker" (PID 96862) exited with exit code 1
2020-11-25 07:04:50.432 UTC [97013] LOG: logical replication table synchronization worker for subscription "snf5_cba_isp_db_staging_app1_srv_sub", table "core_batteryhistory" has started
Despite this error in postgresql v13.0, the tables on the slave database seem to be replicating okay. However, I would like to resolve this error.
I also tried downloading postgresql v13.1, and noticed that I still get this error and that it does not replicate okay.
There was this post that I found:
https://www.postgresql-archive.org/BUG-16643-PG13-Logical-replication-initial-startup-never-finishes-and-gets-stuck-in-startup-loop-td6156051.html
The guy there (Henry Hinze) said it was a bug and it was fixed by installing version 13 RC1.
But my experience was the reverse, in postgresql v13.0 it was not getting stuck in the start up loop but I noticed after installing postgresql v13.1 it was doing that.
I can confirm that I am using postgresql version 13.1 as /usr/lib/postgresql/13/bin/postgres -V gives me the following output:
postgres (PostgreSQL) 13.1 (Ubuntu 13.1-1.pgdg18.04+1)
I am using Ubuntu v18.04.
I have uninstalled postgresql completely and reinstalled it and it has not resolved the issue.
The postgresql.conf settings on the slave are the default settings.
The relevant postgresql.conf on the master are as follows:
wal_level = logical
checkpoint_timeout = 5min
max_wal_size = 1GB
min_wal_size = 80MB

Related

Invalid resource manager ID in primary checkpoint record

I've update my Airbyte image from 0.35.2-alpha to 0.35.37-alpha.
[running in kubernetes]
When the system rolled out the db pod wouldn't terminate and I [a terrible mistake] deleted the pod.
When it came back up, I get an error -
PostgreSQL Database directory appears to contain a database; Skipping initialization
2022-02-24 20:19:44.065 UTC [1] LOG: starting PostgreSQL 13.6 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit
2022-02-24 20:19:44.065 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
2022-02-24 20:19:44.065 UTC [1] LOG: listening on IPv6 address "::", port 5432
2022-02-24 20:19:44.071 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-02-24 20:19:44.079 UTC [21] LOG: database system was shut down at 2022-02-24 20:12:55 UTC
2022-02-24 20:19:44.079 UTC [21] LOG: invalid resource manager ID in primary checkpoint record
2022-02-24 20:19:44.079 UTC [21] PANIC: could not locate a valid checkpoint record
2022-02-24 20:19:44.530 UTC [1] LOG: startup process (PID 21) was terminated by signal 6: Aborted
2022-02-24 20:19:44.530 UTC [1] LOG: aborting startup due to startup process failure
2022-02-24 20:19:44.566 UTC [1] LOG: database system is shut down
Pretty sure the WAL file is corrupted, but I'm not sure how to fix this.
Warning - there is a potential for data loss
This is a test system, so I wasn't concerned with keeping the latest transactions, and had no backup.
First I overrode the container command to keep the container running but not try to start postgres.
...
spec:
containers:
- name: airbyte-db-container
image: airbyte/db
command: ["sh"]
args: ["-c", "while true; do echo $(date -u) >> /tmp/run.log; sleep 5; done"]
...
And spawned a shell on the pod -
kubectl exec -it -n airbyte airbyte-db-xxxx -- sh
Run pg_reset_wal
# dry-run first
pg_resetwal --dry-run /var/lib/postgresql/data/pgdata
Success!
pg_resetwal /var/lib/postgresql/data/pgdata
Write-ahead log reset
Then removed the temp command in the container, and postgres started up correctly!

ERROR: invalid logical replication message type "T"

I am getting below error from Postgres 10.3 logical replication.
Setup
In master, postgresql used 12.3
In logical, postgres 10.3
Logs
2021-03-22 13:06:57.332 IST # 25929 LOG: checkpoints are occurring too frequently (22 seconds apart)
2021-03-22 13:06:57.332 IST # 25929 HINT: Consider increasing the configuration parameter "max_wal_size".
2021-03-22 14:34:21.263 IST # 21461 ERROR: invalid logical replication message type "T"
2021-03-22 14:34:21.315 IST # 3184 LOG: logical replication apply worker for subscription "elk_subscription_133" has started
2021-03-22 14:34:21.367 IST # 3184 ERROR: invalid logical replication message type "T"
2021-03-22 14:34:21.369 IST # 25921 LOG: worker process: logical replication worker for subscription 84627 (PID 3184) exited with exit code 1
2021-03-22 14:34:22.259 IST # 25921 LOG: worker process: logical replication worker for subscription 84627 (PID 21461) exited with exit code 1
2021-03-22 14:34:27.281 IST # 3187 LOG: logical replication apply worker for subscription "elk_subscription_133" has started
2021-03-22 14:34:27.311 IST # 3187 ERROR: invalid logical replication message type "T"
2021-03-22 14:34:27.313 IST # 25921 LOG: worker process: logical replication worker for subscription 84627 (PID 3187) exited with exit code 1
2021-03-22 14:34:32.336 IST # 3188 LOG: logical replication apply worker for subscription "elk_subscription_133" has started
2021-03-22 14:34:32.362 IST # 3188 ERROR: invalid logical replication message type "T"
The documentation describes message T:
Truncate
      Byte1('T')
              Identifies the message as a truncate message.
Support for TRUNCATE was added in v11, so the primary server must be v11 or better.
You will have to remove the table from the publication, refresh the subscription, truncate the table manually, add it to the publication and refresh the subscription again.
Avoid TRUNCATE and change the publication:
ALTER PUBLICATION name SET (publish = 'insert, update, delete');

PostgreSQL 9.4.1 Switchover & Switchback without recover_target_timeline=latest

I have tested different scenarios to do switchover and switchback in postgreSQL 9.4.1 Version.
Scenario 1:- PostgreSQL Switchover and Switchback in 9.4.1
Scenario 2:- Is it mandatory parameter recover_target_timeline='latest' in switchover and switchback in PostgreSQL 9.4.1?
Scenario 3:- On this page
To test scenario 3 I have followed below steps to perform.
1) Stop the application connected to primary server.
2) Confirm all application was stopped and all thread was disconnected from primary DB.
#192.x.x.129(Primary)
3) Clean shutdown primary using
pg_ctl -D$PGDATA stop --mf
#DR(192.x.x.128) side check sync status:
postgres=# select pg_last_xlog_receive_location(),pg_last_xlog_replay_location();
-[ RECORD 1 ]-----------------+-----------
pg_last_xlog_receive_location | 4/57000090
pg_last_xlog_replay_location | 4/57000090
4)Stop DR server.DR(192.x.x.128)
pg_ctl -D $PGDATA stop -mf
pg_log:
2019-12-02 13:16:09 IST LOG: received fast shutdown request
2019-12-02 13:16:09 IST LOG: aborting any active transactions
2019-12-02 13:16:09 IST LOG: shutting down
2019-12-02 13:16:09 IST LOG: database system is shut down
#192.x.x.128(DR)
5) Make following changes on DR server.
mv recovery.conf recovery.conf_bkp
6)make changes in 192.x.x.129(Primary):
[postgres#localhost data]$ cat recovery.conf
standby_mode = 'on'
primary_conninfo = 'user=replication password=postgres host=192.x.x.128 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'
restore_command = 'cp %p /home/postgres/restore/%f'
trigger_file='/tmp/promote'
7)Start DR as read write mode:
pg_ctl -D $DATA start
pg_log:
2019-12-02 13:20:21 IST LOG: database system was shut down in recovery at 2019-12-02 13:16:09 IST
2019-12-02 13:20:22 IST LOG: database system was not properly shut down; automatic recovery in progress
2019-12-02 13:20:22 IST LOG: consistent recovery state reached at 4/57000090
2019-12-02 13:20:22 IST LOG: invalid record length at 4/57000090
2019-12-02 13:20:22 IST LOG: redo is not required
2019-12-02 13:20:22 IST LOG: database system is ready to accept connections
2019-12-02 13:20:22 IST LOG: autovacuum launcher started
(END)
We can see in above log OLD primary is now DR of Primary(Which was OLD DR) and not showing any error because timeline id same on new primary which is already exit in new DR.
8)Start Primary as read only mode:-
pg_ctl -D$PGDATA start
logs:
2019-12-02 13:24:50 IST LOG: database system was shut down at 2019-12-02 11:14:50 IST
2019-12-02 13:24:51 IST LOG: entering standby mode
cp: cannot stat ‘pg_xlog/RECOVERYHISTORY’: No such file or directory
cp: cannot stat ‘pg_xlog/RECOVERYXLOG’: No such file or directory
2019-12-02 13:24:51 IST LOG: consistent recovery state reached at 4/57000090
2019-12-02 13:24:51 IST LOG: record with zero length at 4/57000090
2019-12-02 13:24:51 IST LOG: database system is ready to accept read only connections
2019-12-02 13:24:51 IST LOG: started streaming WAL from primary at 4/57000000 on timeline 9
2019-12-02 13:24:51 IST LOG: redo starts at 4/57000090
(END)
Question 1:- In This scenario i have perform only switch-over to show you. using this method we can do switch-over and switchback. but using below method Switch-over-switchback is work, then why PostgreSQL Community invented recovery_target_timeline=latest and apply patches see blog: https://www.enterprisedb.com/blog/switchover-switchback-in-postgresql-9-3 from PostgrSQL 9.3...to latest version.
Question 2:- What mean to say in above log cp: cannot stat ‘pg_xlog/RECOVERYHISTORY’: No such file or directory ?
Question 3:- I want to make sure from scenarios 1 and scenario 3 which method/Scenarios is correct way to do switchover and switchback? because scenario 2 is getting error because we must use recover_target_timeline=latest which all community experts know.
Answers:
If you shut down the standby cleanly, then remove recovery.conf and restart it, it will come up, but has to perform crash recovery (database system was not properly shut down).
The proper way to promote a standby to a primary is by using the trigger file or running pg_ctl promote (or, from v12 on, by running the SQL function pg_promote). Then you have no down time and don't need to perform crash recovery.
Promoting the standby will make it pick a new time line, so you need recovery_target_timeline = 'latest' if you want the new standby to follow that time line switch.
That is caused by your restore_command.
The method shown in 1. above is the correct one.

Postgresql 10 logical replication not working

I install postgresql 10 using scripts:
$ wget -q https://www.postgresql.org/media/keys/ACCC4CF8.asc -O - | sudo apt-key add -
$ sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ `lsb_release -cs`-pgdg main" >> /etc/apt/sources.list.d/pgdg.list'
$ sudo apt-get update
$ sudo apt-get install postgresql postgresql-contrib
On Master server : xx.xxx.xxx.xx-:
And after that in postgresql.conf:
set wal_level = logical
On Slave server -:
in postgresql.conf:
set wal_level = logical
And after all i use below query on master server:
create table t1 (id integer primary key, val text);
create user replicant with replication;
grant select on t1 to replicant;
insert into t1 (id, val) values (10, 'ten'), (20, 'twenty'), (30, 'thirty');
create publication pub1 for table t1;
And at Slave server:
create table t1 (id integer primary key, val text, val2 text);
create subscription sub1 connection 'dbname=dbsrc user=replicant' publication pub1;
But the problem which is i am facing is table are not syncing and as per logical replication when i insert new row on master server, slave server not getting that row.
I am new to postgresql please help me.
Thanks for your precious time.
Now here is my postgresql log for Master server:
2017-10-17 11:06:16.644 UTC [10713] replicant#postgres LOG: starting logical decoding for slot "sub_1"
2017-10-17 11:06:16.644 UTC [10713] replicant#postgres DETAIL: streaming transactions committing after 1/F45EB0C8, reading WAL from 1/F45EB0C8
2017-10-17 11:06:16.645 UTC [10713] replicant#postgres LOG: logical decoding found consistent point at 1/F45EB0C8
2017-10-17 11:06:16.645 UTC [10713] replicant#postgres DETAIL: There are no running transactions.
Here is my slave server postgresql log:
2017-10-17 19:14:45.622 CST [7820] WARNING: out of logical replication worker slots
2017-10-17 19:14:45.622 CST [7820] HINT: You might need to increase max_logical_replication_workers.
2017-10-17 19:14:45.670 CST [7821] WARNING: out of logical replication worker slots
2017-10-17 19:14:45.670 CST [7821] HINT: You might need to increase max_logical_replication_workers.
2017-10-17 19:14:45.680 CST [7822] WARNING: out of logical replication worker slots
2017-10-17 19:14:45.680 CST [7822] HINT: You might need to increase max_logical_replication_workers.
2017-10-17 19:14:50.865 CST [7820] WARNING: out of logical replication worker slots
2017-10-17 19:14:50.865 CST [7820] HINT: You might need to increase max_logical_replication_workers.
2017-10-17 19:14:50.917 CST [7821] WARNING: out of logical replication worker slots
2017-10-17 19:14:50.917 CST [7821] HINT: You might need to increase max_logical_replication_workers.
2017-10-17 19:14:50.928 CST [7822] WARNING: out of logical replication worker slots
2017-10-17 19:14:50.928 CST [7822] HINT: You might need to increase max_logical_replication_workers.
2017-10-17 19:14:55.871 CST [7820] WARNING: out of logical replication worker slots
2017-10-17 19:14:55.871 CST [7820] HINT: You might need to increase max_logical_replication_workers.
And after increasing the max_logical_replication_workers i am getting this:
2017-10-17 19:44:45.898 CST [7987] LOG: logical replication table synchronization worker for subscription "sub2", table "t1" has started
2017-10-17 19:44:45.982 CST [7988] LOG: logical replication table synchronization worker for subscription "myadav_test", table "test_replication" h$
2017-10-17 19:44:45.994 CST [7989] LOG: logical replication table synchronization worker for subscription "sub3", table "t1" has started
2017-10-17 19:44:48.621 CST [7987] ERROR: could not start initial contents copy for table "staging.t1": ERROR: permission denied for schema staging
2017-10-17 19:44:48.623 CST [7962] LOG: worker process: logical replication worker for subscription 20037 sync 20027 (PID 7987) exited with exit co$
2017-10-17 19:44:48.705 CST [7988] ERROR: could not start initial contents copy for table "staging.test_replication": ERROR: permission denied for$
2017-10-17 19:44:48.707 CST [7962] LOG: worker process: logical replication worker for subscription 20025 sync 20016 (PID 7988) exited with exit co$
2017-10-17 19:44:48.717 CST [7989] ERROR: duplicate key value violates unique constraint "t1_pkey"
2017-10-17 19:44:48.717 CST [7989] DETAIL: Key (id)=(10) already exists.
2017-10-17 19:44:48.717 CST [7989] CONTEXT: COPY t1, line 1
2017-10-17 19:44:48.718 CST [7962] LOG: worker process: logical replication worker for subscription 20038 sync 20027 (PID 7989) exited with exit co$
2017-10-17 19:44:51.629 CST [8008] LOG: logical replication table synchronization worker for subscription "sub2", table "t1" has started
2017-10-17 19:44:51.712 CST [8009] LOG: logical replication table synchronization worker for subscription "myadav_test", table "test_replication" h$
2017-10-17 19:44:51.722 CST [8010] LOG: logical replication table synchronization worker for subscription "sub3", table "t1" has started
Now i finally realize that logical replication is working for postgres database but not for my other database on same server. I am getting permission issue on schema that is is log.
The row changes are applied using the rights of the user who owns the subscription. By default that's the user who created the subscription.
So make sure the subscription is owned by a user with sufficient rights. Grant needed rights to tables, or if you can't be bothered, make the subscription owned by a superuser who has full rights to everything.
See:
CREATE SUBSCRIPTION
logical replication - security
logical replication

Postgres Replication with pglogical: ERROR: connection to other side has died

Got this error (on replica) while replicating between 2 Postgres instances:
ERROR: connection to other side has died
Here is the logs on the replica/subscriber:
2017-09-15 20:03:55 UTC [14335-3] LOG: apply worker [14335] at slot 7 generation 109 crashed
2017-09-15 20:03:55 UTC [2961-1732] LOG: worker process: pglogical apply 16384:3661733826 (PID 14335) exited with exit code 1
2017-09-15 20:03:59 UTC [14331-2] ERROR: connection to other side has died
2017-09-15 20:03:59 UTC [14331-3] LOG: apply worker [14331] at slot 2 generation 132 crashed
2017-09-15 20:03:59 UTC [2961-1733] LOG: worker process: pglogical apply 16384:3423246629 (PID 14331) exited with exit code 1
2017-09-15 20:04:02 UTC [14332-2] ERROR: connection to other side has died
2017-09-15 20:04:02 UTC [14332-3] LOG: apply worker [14332] at slot 4 generation 125 crashed
2017-09-15 20:04:02 UTC [2961-1734] LOG: worker process: pglogical apply 16384:2660030132 (PID 14332) exited with exit code 1
2017-09-15 20:04:02 UTC [14350-1] LOG: starting apply for subscription parking_sub
2017-09-15 20:04:05 UTC [14334-2] ERROR: connection to other side has died
2017-09-15 20:04:05 UTC [14334-3] LOG: apply worker [14334] at slot 6 generation 119 crashed
2017-09-15 20:04:05 UTC [2961-1735] LOG: worker process: pglogical apply 16384:394989729 (PID 14334) exited with exit code 1
2017-09-15 20:04:06 UTC [14333-2] ERROR: connection to other side has died
Logs on master/provider:
2017-09-15 23:22:43 UTC [22068-5] repuser#ga-master ERROR: got sequence entry 1 for toast chunk 1703536315 instead of seq 0
2017-09-15 23:22:43 UTC [22068-6] repuser#ga-master LOG: could not receive data from client: Connection reset by peer
2017-09-15 23:22:44 UTC [22067-5] repuser#ga-master ERROR: got sequence entry 1 for toast chunk 1703536315 instead of seq 0
2017-09-15 23:22:44 UTC [22067-6] repuser#ga-master LOG: could not receive data from client: Connection reset by peer
2017-09-15 23:22:48 UTC [22070-5] repuser#ga-master ERROR: got sequence entry 1 for toast chunk 1703536315 instead of seq 0
2017-09-15 23:22:48 UTC [22070-6] repuser#ga-master LOG: could not receive data from client: Connection reset by peer
2017-09-15 23:22:49 UTC [22069-5] repuser#ga-master ERROR: got sequence entry 1 for toast chunk 1703536315 instead of seq 0
2017-09-15 23:22:49 UTC [22069-6] repuser#ga-master LOG: could not receive data from client: Connection reset by peer
Config on master/provider:
archive_mode = on
archive_command = 'cp %p /data/pgdata/wal_archives/%f'
max_wal_senders = 20
wal_level = logical
max_worker_processes = 100
max_replication_slots = 100
shared_preload_libraries = pglogical
max_wal_size = 20GB
Config on the replica/subscriber:
max_replication_slots = 100
shared_preload_libraries = pglogical
max_worker_processes = 100
max_wal_size = 20GB
I'm having a total of 18 subscriptions for 18 schemas. It seemed to work fine in the beginning, but it quickly deteriorated and some subscriptions started to bounce between down and replicating statuses, with the error posted above.
Question
What could be the possible causes? Do I need to change my Pg configurations?
Also, I noticed that when replication is going on, the CPU usage on the master/provider is pretty high.
/# ps aux | sort -nrk 3,3 | head -n 5
postgres 18180 86.4 1.0 415168 162460 ? Rs 22:32 19:03 postgres: getaround getaround 10.240.0.7(64106) CREATE INDEX
postgres 20349 37.0 0.2 339428 38452 ? Rs 22:53 0:07 postgres: wal sender process repuser 10.240.0.7(49742) idle
postgres 20351 33.8 0.2 339296 36628 ? Rs 22:53 0:06 postgres: wal sender process repuser 10.240.0.7(49746) idle
postgres 20350 28.8 0.2 339016 44024 ? Rs 22:53 0:05 postgres: wal sender process repuser 10.240.0.7(49744) idle
postgres 20352 27.6 0.2 339420 36632 ? Rs 22:53 0:04 postgres: wal sender process repuser 10.240.0.7(49750) idle
Thanks in advance!
I had a similar problem which was fixed by setting the: wal_sender_timeout config on the master/provider to 5 minutes (default is 1 minute). It will drop the connection if it times out - this seems to have fixed the problem for me.