(Databricks) - ERROR: canceling statement due to conflict with recovery - postgresql

I have a schedule job that appeared an error today:
Error on Databricks
org.postgresql.util.PSQLException: ERROR: canceling statement due to conflict with recovery
There's a way to solve this on Databricks? How? Thanks!!
Tried to find something on the configs, but didn't have success.

If you are querying a secondary installation (replica) and there are conflicting changes that need to be replayed from the primary there are two options:
Delay replaying the changes (replication will fall further behind)
Cancel the running query that conflicts with the changes
PostgreSQL will do 1 until it hits a configured timeout and then do 2 so your replication doesn't fall too far behind.
https://www.postgresql.org/docs/current/hot-standby.html#HOT-STANDBY-CONFLICT

Related

ADF Dataflow stuck IN progress and fail with below errors

ADF Pipeline DF task is Stuck in Progress. It was working seamlessly last couple of months but suddenly Dataflow stuck in progress and Time out after certain time. We are using IR managed Virtual Network. I am using forereach loop to run data flow for multiple entities parallel, it always randomly get stuck on last Entity.
What can I try to resolve this?
Error in Dev Environment
Error Code 4508
Spark cluster not found
Error in Prod Environment:
Error code
5000
Failure type
User configuration issue
Details
[plugins.*** ADF.adf-ir-001 WorkspaceType:<ADF> CCID:<f289c067-7c6c-4b49-b0db-783e842a5675>] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
Images:
I tried below steps:
By changing IR configuring as below
Tried DF Retry and retry Interval
Also, tried For each loop one batch at a time instead of 4 batch parallel. None of the above trouble-shooting steps worked. These PL is running last 3-4 months without a single failure, suddenly they started to fail last 3 days consistently. DF flow always stuck in progress randomly for different entity and times out in one point by throwing above errors.
Error Code 4508 Spark cluster not found.
This error can cause because of two reasons.
The debug session is getting closed till the dataflow finish its transformation in this case recommendation is to restart the debug session
the second reason is due to resource problem, or an outage in that particular region.
Error code 5000 Failure type User configuration issue Details [plugins.*** ADF.adf-ir-001 WorkspaceType: CCID:] [Monitoring] Livy Endpoint=[https://hubservice1.eastus.azuresynapse.net:8001/api/v1.0/publish/815b62a1-7b45-4fe1-86f4-ae4b56014311]. Livy Id=[0] Job failed during run time with state=[dead].
A temporary error is one that says "Livy job state dead caused by unknown error." At the backend of the dataflow, a spark cluster is used, and this error is generated by the spark cluster. to get the more information about error go to StdOut of sparkpool execution.
The backend cluster may be experiencing a network problem, a resource problem, or an outage.
If error persist my suggestion is to raise Microsoft support ticket here

AWS DMS task fails to retrieve tables

I'm trying to migrate existing data and replicate ongoing changes
the source database is PostgreSQL it's managed by aws.
the target is kafka.
I'm facing the below issue.
Last Error No tables were found at task initialization. Either the selected table(s) or schemas(s) no longer exist or no match was found for the table selection pattern(s). If you would like to start a Task that does not initially capture any tables, set Task Setting FailOnNoTablesCaptured to false and restart task. Stop Reason FATAL_ERROR Error Level FATAL

How to check about Redshift maintenance windows

As usual we set Redshift maintenance windows on Saturday morning, and we got several errors during that maintenance windows time.
* Query Processing Error AM5:07:01
[Amazon](500051) ERROR processing query/statement. Error: Query execution failed
[SQL State=HY000, DB Errorcode=500051]
* Connection Error AM5:07:27.79
[Amazon](500150) Error setting/closing connection: Connection refused: connect.
I guess that's due to Redshift internal maintenance.
May I ask how to check any evidence to prove that on Redshift? I checked the svl_qlog with aborted=1, but couldn't find perfect one.
And is there any way to set maintenance window to skip when the user session is running on?
--
Thanks to useful information from Schepo and Bill, we could prove that connection error was due to reboot on Redshift Maintenance Window.
Also, we checked Redshift Event at Console, exactly what time Redshift reboot started and ended.
Probably the best way to check if the connection errors were due to Redshift maintenance would be to check the Maintenance tab in your cluster configuration. In the example screenshot below, it's some time between 06:30 and 07:00 am every Wednesday.
There's no way to stop it happening while user sessions are connected. Although you do have the option of deferring all maintenance for up to 45 days if you need (follow the Edit button on the same screen).
For evidence to prove, you can check the audit log of past maintenance events by looking in the AWS Config service under the "timeline" of your cluster. Follow the View Config Timeline button to open AWS Config for that cluster. In the below example screenshot you can see the exact time (08:49:20) of one maintenance window in the past.
Another way to document that the maintenance window was used is to check the "healthy" dashboard metric on the console or in CloudWatch. If the cluster went unhealthy then returned to healthy during the maintenance window is very likely that AWS performed an update on the systems.

PostgreSQL logical replication - create subscription hangs

I am trying to set logical replication between 2 cloud instances both with Debian 9 and PG 11.1. The command CREATE PUBLICATION on master was successful, but when I start the command CREATE SUBSCRIPTION on the intended logical replica, the command hangs indefinitely.
On the master I can see that the replication slot was created and is active and I can see a new walsender process created and "waiting" and in the log on the master I see these these lines:
2019-01-14 14:20:39.924 UTC [8349] repl_user#db LOG: logical decoding found initial starting point at 7B0/6C777D10
2019-01-14 14:20:39.924 UTC [8349] repl_user#db DETAIL: Waiting for transactions (approximately 2) older than 827339177 to end.
But that is all. The command CREATE SUBSCRIPTION never ends.
Master is a db with heavy inserts, like 100s per minute, but they are all always committed. So there should not be any long time uncommitted transactions.
I tried to google for this problem but did not find anything. What am I missing?
Since the databases are “in the cloud”, you don't know where they really are.
Odds are that they are actually in the same database cluster, which would explain the deadlock you see: CREATE SUBSCRIPTION waits until all concurrent transactions on the cluster that contains the replication source database are finished before it can create its replication slot, but since both databases are in the same cluster, it waits for itself to finish, which obviously won't happen.
The solution is to explicitly create a logical replication slot in the source database and to use that existing slot when you create the subscription.

Replication on Postgresql pauses when Querying and replication are happening simultaneously

Postgress follows MVCC rules. So any query that is run on a table doesn't conflict with the writes that happen on the table. The query returns the result based on the snapshot at the point of running the query.
Now i have a master and slave. The slave is used by analysts to run queries and to perform analysis. When the slave is replicating and when analyst are running their queries simultaneously, i can see the replication lag for a long time.If the queries are long running, the replication lags a long duration and if the number of writes on the master happens to be pretty high, then i end up losing the WAL files and replication can longer proceed. I just have to spin up another slave. Why does this happen ? How do i allow queries and replication to happen simultaneously on postures ? Is there any parameter setting that i can apply to make this happen ?
The replica can't apply more WAL from the master because the master might've overwritten data blocks still needed by queries running on the replica that're older than any still running on the master. The replica needs older row versions than the master. It's exactly because of MVCC that this pause is necessary.
You probably set a high max_standby_streaming_delay to avoid "canceling statement due to conflict with recovery" errors.
If you turn hot_standby_feedback on, the replica can instead tell the master to keep those rows. But the master can't clean up free space as efficiently then, and it might run out of space in pg_xlog if the standby gets way too far behind.
See PostgreSQL manual: Handling Query Conflicts.
As for the WAL retention part: enable WAL archiving and a restore_command for your standbys. You should really be using it anyway, for point-in-time recovery. PgBarman now makes this easy with the barman get-wal command. If you don't want WAL archiving you can instead set your replica servers up to use a replication slot to connect to the master, so the master knows to retain the WAL they need indefinitely. Of course, that can cause the master to run out of space in pg_xlog and stop running so you need to monitor more closely if you do that.