How do I synchronize the subscription when the server terminated unexpectedly - postgresql

I have a publisher and a subscriber. Every so often I get:
ERROR: could not receive data from WAL stream: server closed the connection unexpectedly
This probably means the server terminated abnormally before or while processing the request.
I can guess why it terminates abnormally, one of the computers turn off. However, when the two computers are connected again, it doesn't restart automatically.
The only thing that works is to truncate all the tables in the subscription, delete the subscription and publication and create the subscription and publication again.
I tried looking at the WAL, they're very nice. Not sure what to do otherwise.
Here are some pictures:

It should not be necessary to re-initialize logical replication just because there was a connection problem. The logical replication slot on the primary will make sure that all required information is retained on the server so that replication can be resumed later on.
Reading you primary log, it looks like you are just hitting a timeout because there is nothing to replicate. That shouldn't be a problem, but you can set wal_sender_timeout = 0 on the primary to disable the timeout.

Related

Postgres: processes terminated after connetion break / invalidation

I don't understand some of Postgres mechanism and it makes me quite upset.
I usually use DBeaver as SQL client to query external pg base. If run create.. or insert.. queries and then connection for some reason is broken or invalidated, the pid is still running and finishes transaction.
But for some more complicated PL/pgSQL functions (with temp tables, loops, inserts, etc.) we wrote, breaking connection always causes process termination (it disappears from session list just before making next sql operation, eg. inserting a row in logtable). No matter if it's DBeaver editor or psql command.
I know that maybe disconnecting is critical problem, which should be eliminated and maybe I shouldn't expect process to successfully continue, but I do:) Or just to know why it happened and is it possible to prevent it?
If the network connection fails, the database server can detect that in two ways:
if it tries to send data to the client, it will figure out pretty quickly that the connection is down
if it tries to receive data from the client, it will only notice when the kernel's TCP keepalive mechanism has determined that the connection is down
When you say that sometimes execution of a function is terminated right away, I would say that is because the function returned data to the client.
In the case where a query keeps running, it is not attempting to return any data yet.
There is no cure for the former, but in PostgreSQL v14 you can prevent the latter by setting client_connection_check_interval. In addition, you have to set the PostgreSQL keepalive parameters so that the dead connection becomes known quickly.
See my article for more.

Postgres Logical Replication - Monitor Subscriber Without Accessing the Publisher Server

I would like each subscriber server to monitor its health without accessing the publisher server
1.
I use the following code from the publisher to get the lag. Is it possible to compute the lag also from the subscriber server
SELECT
slot_name, active, confirmed_flush_lsn, pg_current_wal_lsn(),
(pg_current_wal_lsn() - confirmed_flush_lsn) AS bytes_lag
FROM pg_replication_slots;
If I use from the subscriber the following
select received_lsn, latest_end_lsn from pg_stat_subscription
I will still need the following from the publisher select pg_current_wal_lsn();
Is there a way to know the lag without accessing the publisher?
2. I have a duplicate value at one of the tables that caused the replication to stop, but
select srsubstate from pg_subscription_rel
is showing as 'r' for all tables.
How can I know which table is problematic
How can I know what is the reason that the replication stopped
3. How can a subscriber know that its logical slot or even publisher was dropped
No, you cannot get that information from the subscriber. The subscriber doesn't know what there is to receive that it has not yet received.
To figure out the cause when replication breaks, you have to look at the subscriber's log file. Yes, that is manual activity, but so is conflict resolution.
You will quickly figure out if the replication slot has been dropped, because there will be nasty error messages in the log. This is quite similar to dropped tables.

How to avoid long delay before finally getting "40001 could not serialize access due to concurrent update"

We have a Postgres 12 system running one master master and two async hot-standby replica servers and we use SERIALIZABLE transactions. All the database servers have very fast SSD storage for Postgres and 64 GB of RAM. Clients connect directly to master server if they cannot accept delayed data for a transaction. Read-only clients that accept data up to 5 seconds old use the replica servers for querying data. Read-only clients use REPEATABLE READ transactions.
I'm aware that because we use SERIALIZABLE transactions Postgres might give us false positive matches and force us to repeat transactions. This is fine and expected.
However, the problem I'm seeing is that randomly a single line INSERT or UPDATE query stalls for a very long time. As an example, one error case was as follows (speaking directly to master to allow modifying table data):
A simple single row insert
insert into restservices (id, parent_id, ...) values ('...', '...', ...);
stalled for 74.62 seconds before finally emitting error
ERROR 40001 could not serialize access due to concurrent update
with error context
SQL statement "SELECT 1 FROM ONLY "public"."restservices" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
We log all queries exceeding 40 ms so I know this kind of stall is rare. Like maybe a couple of queries a day. We average around 200-400 transactions per second during normal load with 5-40 queries per transaction.
After finally getting the above error, the client code automatically released two savepoints, rolled back the transaction and disconnected from database (this cleanup took 2 ms total). It then reconnected to database 2 ms later and replayed the whole transaction from the start and finished in 66 ms including the time to connect to the database. So I think this is not about performance of the client or the master server as a whole. The expected transaction time is between 5-90 ms depending on transaction.
Is there some PostgreSQL connection or master configuration setting that I can use to make PostgreSQL to return the error 40001 faster even if it caused more transactions to be rolled back? Does anybody know if setting
set local statement_timeout='250'
within the transaction has dangerous side-effects? According to the documentation https://www.postgresql.org/docs/12/runtime-config-client.html "Setting statement_timeout in postgresql.conf is not recommended because it would affect all sessions" but I could set the timeout only for transactions by this client that's able to automatically retry the transaction very fast.
Is there anything else to try?
It looks like someone had the parent row to the one you were trying to insert locked. PostgreSQL doesn't know what to do about that until the lock is released, so it blocks. If you failed rather than blocking, and upon failure retried the exact same thing, the same parent row would (most likely) still be locked and so would just fail again, and you would busy-wait. Busy-waiting is not good, so blocking rather than failing is generally a good thing here. It blocks and then unblocks only to fail, but once it does fail a retry should succeed.
An obvious exception to blocking-better-than-failing being if when you retry, you can pick a different parent row to retry with, if that make sense in your context. In this case, maybe the best thing to do is explicitly lock the parent row with NOWAIT before attempting the insert. That way you can perhaps deal with failures in a more nuanced way.
If you must retry with the same parent_id, then I think the only real solution is to figure out who is holding the parent row lock for so long, and fix that. I don't think that setting statement_timeout would be hazardous, but it also wouldn't solve your problem, as you would probably just keep retrying until the lock on the offending row is released. (Setting it on the other session, the one holding the lock, might be helpful, depending on what that session is doing while the lock is held.)

Service Fabric failed to reconfigure replica

During a load test of the application (with dynamic load reporting services) the whole application stopped working because one replica of a stateful partition gives a warning.
Warning System.RAP IStatefulServiceReplica.ChangeRole(S)Duration Thu, 21 Jul 2016 3:36:03 GMT Infinity 131135817636324745 false false Start Time (UTC): 2016-07-21 13:35:43.632
This happens after a load balancing of the replica, this happened to the 4th replica of the partition eventhough we only target 3. So even if SF just kills it the application should be fine (As the primary and 2 other secondaries are up). However the whole thing jams. (from logging I can see at least 10k events still need to be processed but the whole thing stops)
In the images above you can see the details of the particular replica. The only differences between this replica and the other Secondary replica's is in the following values:
Read Status
Write Status
Current Service Operation
Queue Memory Size (in Replication Queue)
First Sequence Number (in Replication Queue)
Last Replication Operation Received Time Utc
Last Copy Operation Received Time Utc
Last Acknowledgement Sent Time Utc
I also find it odd that the Replica Status says: Ready and not Reconfiguring anymore. As the read/write status says it is still reconfiguring
I'm running the newest SDK (2.1.163, released 18-07-2016). I thought the bugfix was in there but eventhough it became much harder to reproduce it still occurred. Does anyone know what might be causing this or how to fix this?
edit: Screenshot of the failing partition
Edit: Results of debugging, based on the answer of Vaclav (22-7-2016)
After the response of Vaclav I started to log everything in the RunAsync to determine what was actually causing the problem. So what part of the code did not exit if the cancellation was requested. As Vaclav pointed out the method did not stop when the cancellation was requested. However it seems like the code section in which it gets stuck is native Service Fabric.
using(ITransaction tx = StateManager.CreateTransaction())
{
await queue.TryDequeueAsync(tx, _queueTimeout, cancellationToken);
await tx.CommitAsync();
}
The queue is a ReliableQueue, the time-out is set to the default 4 seconds and the cancelationtoken is from RunAsync. After adding the logging between each line we got the following logging pattern
//pre transaction
using(ITransaction tx = StateManager.CreateTransaction())
{
//pre dequeue
await queue.TryDequeueAsync(tx, _queueTimeout, cancellationToken);
//dequeued
await tx.CommitAsync();
//committed
}
//post transaction
At each line I logged the value of the cancelationrequest aswell, and a background task would log when the cancelation request was fired. As a result we got for example this:
pre transaction: False
predequeue: False
dequeued: False
CancelationTokenFired: True
The precise location could vary but the last log before CancelationTokenFired was always
pre transaction
predequeue
dequeued
As stated before this is done on the most recent SDK (18-7-2016) which supposedly had a bug-fix for a similar problem. The problem also occurred on the older SDK and even more frequent back then. But even on the new version it is still reproducable each run.
This warning means your service isn't exiting RunAsync when a primary replica of your service is changing role during reconfiguration (look at the health warning in your last screenshot). Make sure you honor that cancellation token in every possible code path. This also applies to communication listeners - make sure they are responding to CloseAsync().
Given what you're saying, here's what most likely happened:
We built a new replica on a new node (likely for load balancing). At this point, temporarily, you have 4 replicas until reconfiguration completes.
We attempt to swap primaries to this new replica.
Your current primary is told to change role, which means cancel RunAsync and close communication listeners.
Your current primary isn't completing its role change - RunAsync isn't exiting or your communication listeners aren't closing.
Reconfiguration is stuck waiting for the current primary to finish changing role.
Health warnings are issued.
Once reconfiguration completes, your replica set size will be reduced back to your target of 3.
We won't kill your slow replica because we don't know that your application will be fine - maybe it's taking a long time to safely process valuable data - we don't know. Service Fabric is very paranoid about safety and won't do anything that could possibly cause your service to lose data.
Service Fabric Explorer unfortunately doesn't show the reconfiguring state, it is showing you the expected end result. But if you run Get-ServiceFabricPartition in PowerShell, it will show you the reconfiguring state of the partition.
I've seen this a lot and have been banging my head against a brick wall for some time.
However check out the latest release - 5.1.163 and 2.1.163 - this appears to have solved the issues for me.

Concerns about zookeeper's lock-recipe

While reading the ZooKeeper's recipe for lock, I got confused. It seems that this recipe for distributed locks can not guarantee "any snapshot in time no two clients think they hold the same lock". But since ZooKeeper is so widely adopted, if there were such mistakes in the reference documentation, someone should have pointed it out long ago, so what did I misunderstand?
Quoting the recipe for distributed locks:
Locks
Fully distributed locks that are globally synchronous, meaning at any snapshot in time no two clients think they hold the same lock. These can be implemented using ZooKeeeper. As with priority queues, first define a lock node.
Call create( ) with a pathname of "locknode/guid-lock-" and the sequence and ephemeral flags set.
Call getChildren( ) on the lock node without setting the watch flag (this is important to avoid the herd effect).
If the pathname created in step 1 has the lowest sequence number suffix, the client has the lock and the client exits the protocol.
The client calls exists( ) with the watch flag set on the path in the lock directory with the next lowest sequence number.
if exists( ) returns false, go to step 2. Otherwise, wait for a notification for the pathname from the previous step before going to step 2.
Consider the following case:
Client1 successfully acquired the lock (in step 3), with ZooKeeper node "locknode/guid-lock-0";
Client2 created node "locknode/guid-lock-1", failed to acquire the lock, and is now watching "locknode/guid-lock-0";
Later, for some reason (say, network congestion), Client1 fails to send a heartbeat message to the ZooKeeper cluster on time, but Client1 is still working away, mistakenly assuming that it still holds the lock.
But, ZooKeeper may think Client1's session is timed out, and then
delete "locknode/guid-lock-0",
send a notification to Client2 (or maybe send the notification first?),
but can not send a "session timeout" notification to Client1 in time (say, due to network congestion).
Client2 gets the notification, goes to step 2, gets the only node ""locknode/guid-lock-1", which it created itself; thus, Client2 assumes it hold the lock.
But at the same time, Client1 assumes it holds the lock.
Is this a valid scenario?
The scenario you describe could arise. Client 1 thinks it has the lock, but in fact its session has timed out, and Client 2 acquires the lock.
The ZooKeeper client library will inform Client 1 that its connection has been disconnected (but the client doesn't know the session has expired until the client connects to the server), so the client can write some code and assume that his lock has been lost if he has been disconnected too long. But the thread which uses the lock needs to check periodically that the lock is still valid, which is inherently racy.
...But, Zookeeper may think client1's session is timeouted, and then...
From the Zookeeper documentation:
The removal of a node will only cause one client to wake up since
each node is watched by exactly one client. In this way, you avoid
the herd effect.
There is no polling or timeouts.
So I don't think the problem you describe arises. It looks to me as thought there could be a risk of hanging locks if something happens to the clients that create them, but the scenario you describe should not arise.
from packt book - Zookeeper Essentials
If there was a partial failure in the creation of znode due to connection loss, it's
possible that the client won't be able to correctly determine whether it successfully
created the child znode. To resolve such a situation, the client can store its session ID
in the znode data field or even as a part of the znode name itself. As a client retains
the same session ID after a reconnect, it can easily determine whether the child znode
was created by it by looking at the session ID.