I have a database hosted at AWS RDS. The db was hosted for close to 3 years now and seemd to work without issues. Without any significant changes (besides writing new records in the DB), the connections to it started to hang/timeout. Sometimes a connection goes through but it takes about 30-60 seconds, and then the connection drops shortly after.
In the postgres error log, there is nothing particularly concerning. The most relevant thing I found is:
2022-11-15 14:08:11 UTC::#:[365]:LOG: checkpoint starting: time
2022-11-15 14:08:12 UTC::#:[365]:LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=0.277 s, sync=0.040 s, total=0.838 s; sync files=1, longest=0.040 s, average=0.040 s; distance=65535 kB, estimate=65536 kB
2022-11-15 14:10:37 UTC::#:[368]:WARNING: worker took too long to start; canceled
2022-11-15 14:10:52 UTC::#:[3350]:WARNING: autovacuum worker started without a worker entry
Running telnet to the DB seems to pass, and on the AWS dashboard I don't see anything particularly concerning - normal number of connections, CPU/mem usage.
I tried restarting the db instance, upgrading to a newer version of postgres, and restarting the clients, but to no avail.
Any ideas what I can look at, in order to pinpoint the issue?
Related
Summary
We are using max_slot_wal_keep_size from Postgresql 13 to prevent master from being killed by a lagging replication. It seems, that in our case, WAL storage wasn't freed up after exceeding this parameter which resulted in a replication failure. WAL which, as I believe, should have been freed up did not seem to be needed by any other transaction at a time. I wonder how this should work and why WAL segments were not removed?
Please find the details below.
Configuration
master & one replica - streaming replication using a slot
~700GB available for pg_wal
max_slot_wal_keep_size = 600GB
min_wal_size = 20GB
max_wal_size = 40GB
default checkpoint_timeout = 5 minutes (no problem with checkpoints)
archiving is on and is catching up well
What happened
Under heavy load (large COPY/INSERT transactions, loading hundreds of GB of data), the replication started falling behind. Available space on pg_wal was being reduced in the same rate as safe_slot pg_replication_slot.safe_wal_size - as expected. At some point safe_wal_size went negative and streaming stopped working. It wasn't a problem, because replica started recovery from WAL archive. I expected that once the slot is lost, WALs will be removed up to max_wal_size. This did not happen though. It seems that Postgres tried to maintain something close to max_slot_wal_keep_size (600GB) available, in case replica starts catching up again. Over the time, there was no single transaction which would require this much WAL to be kept. archiving wasn't behind either.
Q1: Is it the case that PG will try to maintain max_slot_keep_size of WALs available?
Q2: If not, why PG did not remove excessive WAL when they were not needed neither by archiver, nor by any transactions running on the system?
Amount of free space on pg_wal was more or less 70GB for most of the time, however at some point, during heavy autovacuuming, it dipped to 0 :( This is when PG crashed and (auto-recovered soon after). After getting back up, there was 11GB left on pg_wal and no transaction running, no loading. This lasted for hours. During this time replica finally caught up from the archive and restored the replication with no delay. None of the WALs were removed. I manually run checkpoint but it did not clear any WALs. I finally restarted Postgresql and during the restarting pg_wal were finally cleared.
Q3: Again - why PG did not clear WAL? WALs, even more clearly, were not needed by any process.
Many thanks!
This was a PostgreSQL bug, and it's fixed. Thanks for reporting!
It should be available in 13.4 according to release notes (look for "Advance oldest required WAL segment")
I am testing logical replication between 2 PostgreSQL 11 databases for use on our production (I was able to set it thanks to this answer - PostgreSQL logical replication - create subscription hangs) and it worked well.
Now I am testing scripts and procedure which would set it automatically on production databases but I am facing strange problem with logical replication slots.
I had to restart logical replica due to some changes in setting requiring restart - which of course could happen on replicas also in the future. But logical replication slot on master did not disconnect and it is still active for certain PID.
I dropped subscription on master (I am still only testing) and tried to repeat the whole process with new logical replication slot but I am facing strange situation.
I cannot create new logical replication slot with the new name. Process running on the old logical replication slot is still active and showing wait_event_type=Lock and wait_event=transaction.
When I try to use pg_create_logical_replication_slot to create new logical replication slot I get similar situation. New slot is created - I see it in pg_catalog but it is marked as active for the PID of the session which issued this command and command hangs indefinitely. When I check processes I can see this command active with same waiting values Lock/transaction.
I tried to activate parameter "lock_timeout" in postgresql.conf and reload configuration but it did not help.
Killing that old hanging process will most likely bring down the whole postgres because it is "walsender" process. It is visible in processes list still with IP of replica with status "idle wating".
I tried to find some parameter(s) which could help me to force postgres to stop this walsender. But settings wal_keep_segments or wal_sender_timeout did not change anything. I even tried to stop replica for longer time - no effect.
Is there some way to do something with this situation without restarting the whole postgres? Like forcing timeout for walsender or lock for transaction etc...
Because if something like this happens on production I would not be able to use restart or any other "brute force". Thanks...
UPDATE:
"Walsender" process "died out" after some time but log does not show anything about it so I do not know when exactly it happened. I can only guess it depends on tcp_keepalives_* parameters. Default on Debian 9 is 2 hours to keep idle process. So I tried to set these parameters in postgresql.conf and will see in following tests.
Strangely enough today everything works without any problems and no matter how I try to simulate yesterday's problems I cannot. Maybe there were some network communication problems in the cloud datacenter involved - we experienced some occasional timeouts in connections into other databases too.
So I really do not know the answer except for "wait until walsender process on master dies" - which can most likely be influenced by tcp_keepalives_* settings. Therefore I recommend to set them to some reasonable values in postgresql.conf because defaults on OS are usually too big.
Actually we use it on our big analytical databases (set both on PostgreSQL and OS) because of similar problems. Golang and nodejs programs calculating statistics from time to time failed to recognize that database session ended or died out in some cases and were hanging until OS ended the connection after 2 hours (default on Debian). All of it seemed to be always connected with network communication problems. With proper tcp_keepalives_* setting reaction is much quicker in case of problems.
After old walsender process dies on master you can repeat all steps and it should work. So looks like I just had bad luck yesterday...
I am trying to set logical replication between 2 cloud instances both with Debian 9 and PG 11.1. The command CREATE PUBLICATION on master was successful, but when I start the command CREATE SUBSCRIPTION on the intended logical replica, the command hangs indefinitely.
On the master I can see that the replication slot was created and is active and I can see a new walsender process created and "waiting" and in the log on the master I see these these lines:
2019-01-14 14:20:39.924 UTC [8349] repl_user#db LOG: logical decoding found initial starting point at 7B0/6C777D10
2019-01-14 14:20:39.924 UTC [8349] repl_user#db DETAIL: Waiting for transactions (approximately 2) older than 827339177 to end.
But that is all. The command CREATE SUBSCRIPTION never ends.
Master is a db with heavy inserts, like 100s per minute, but they are all always committed. So there should not be any long time uncommitted transactions.
I tried to google for this problem but did not find anything. What am I missing?
Since the databases are “in the cloud”, you don't know where they really are.
Odds are that they are actually in the same database cluster, which would explain the deadlock you see: CREATE SUBSCRIPTION waits until all concurrent transactions on the cluster that contains the replication source database are finished before it can create its replication slot, but since both databases are in the same cluster, it waits for itself to finish, which obviously won't happen.
The solution is to explicitly create a logical replication slot in the source database and to use that existing slot when you create the subscription.
[RED] restartpoint complete: wrote 3392 buffers (10.9%);
0 transaction log file(s) added, 0 removed, 9 recycled;
write=340.005 s, sync=0.241 s, total=340.257 s;
sync files=86, longest=0.121 s, average=0.002 s
and
[RED] recovery restart point at 2AA/A90FB758 Detail: last completed transaction was at log time 2015-01-21 14:00:18.782442+00
RED database is a follower of master database. What these log entries mean??
These are informational log messages telling you that the server finished creating a restart point completed. There should be other messages telling that the server started creating the restart point.
Restart points are described at http://www.postgresql.org/docs/9.0/static/wal-configuration.html
A restart point is a point in the transaction history at which a future run of this server could restart without needing to use any write-ahead log file, if the server crashes. The system creates restarts points periodically, based on how much transaction history happens and the passage of time.
The log_checkpoints parameter in the Postgresql configuration file determines whether you get these log messages when the restart points get created.
I have noticed that my Google Cloud SQL instance is losing connectivity periodically and it seems to be associated with some read spikes on the Cloud SQL instance. See the attached screenshot for examples.
The instance is mostly idle, but my application recycles connections in the connection pool every 60 seconds so this is not a wait_timeout issue. I have verified that the connection are recycled Also, it occurred twice in 30 minutes and the wait_timeout is 8 hours.
I would suspect a backup process but you can see from the screenshot that no backups have run.
The first instance lasted 17 seconds from the time the connection loss was detected until it was reestablished. The second was only 5 seconds, but given that my connections are idle for 60 seconds the actual downtime could be up to 1:17 and 1:05 respectively. They occurred at 2014-06-05 15:29:08 UTC and 2014-06-05 16:05:32 UTC respectively. The read spikes are not initiated by me. My app continued to be idle during the issue so this is some sort of internal GCS process.
This is not a big deal for my idle app, but it will become a big deal when the app is not idle.
Has anyone else run into this issue? Is this a known issue with Google Cloud SQL? Is there a known fix?
Any help would be appreciated.
****Update****
The root cause of the symptoms above has been identified as a restart of the MySQL instance. I did not restart the instance and the operations section of the web console does not list any events at that time, so now the question becomes, what would cause the instance to restart twice in 30 minutes? Why would a production database instance restart period?
That was caused by one of our regular release. Because of the way the update takes place an instance might be restarted more than once during the push to production.
Was your instance restarted ? During the restart the spinning down/up of an instance will trigger read/write.
That may be one reason why you are seeing the activity for read/write.