How to resolve ReadWithUncertainityInterval in cockroach database with springboot - postgresql

While performing write/read operations in cockroach database with springboot, we are getting below error intermittently. Any solutions here is appreciated. Thanks
Caused by:
org.postgresql.util.PSQLException: ERROR: restart transaction: TransactionRetryWithProtoRefreshError: ReadWithinUncertaintyIntervalError: read at time 1640760553.619962171,0 encountered previous write with future timestamp

The documentation states that this error is a sign of contention. It also suggests four ways of solving it:
Be prepared to retry on uncertainty (and other) errors, as described in client-side retry handling.
Use historical reads with SELECT ... AS OF SYSTEM TIME.
Design your schema and queries to reduce contention. For more information about how contention occurs and how to avoid it, see Understanding and avoiding transaction contention. In particular, if you are able to send all of the statements in your transaction in a single batch, CockroachDB can usually automatically retry the entire transaction for you.
If you trust your clocks, you can try lowering the --max-offset option to cockroach start, which provides an upper limit on how long a transaction can continue to restart due to uncertainty.
Did you already try these?

Related

RDS Postgres "canceling statement due to conflict with recovery" [duplicate]

I'm getting the following error when running a query on a PostgreSQL db in standby mode. The query that causes the error works fine for 1 month but when you query for more than 1 month an error results.
ERROR: canceling statement due to conflict with recovery
Detail: User query might have needed to see row versions that must be removed
Any suggestions on how to resolve? Thanks
No need to touch hot_standby_feedback. As others have mentioned, setting it to on can bloat master. Imagine opening transaction on a slave and not closing it.
Instead, set max_standby_archive_delay and max_standby_streaming_delay to some sane value:
# /etc/postgresql/10/main/postgresql.conf on a slave
max_standby_archive_delay = 900s
max_standby_streaming_delay = 900s
This way queries on slaves with a duration less than 900 seconds won't be cancelled. If your workload requires longer queries, just set these options to a higher value.
Running queries on hot-standby server is somewhat tricky — it can fail, because during querying some needed rows might be updated or deleted on primary. As a primary does not know that a query is started on secondary it thinks it can clean up (vacuum) old versions of its rows. Then secondary has to replay this cleanup, and has to forcibly cancel all queries which can use these rows.
Longer queries will be canceled more often.
You can work around this by starting a repeatable read transaction on primary which does a dummy query and then sits idle while a real query is run on secondary. Its presence will prevent vacuuming of old row versions on primary.
More on this subject and other workarounds are explained in Hot Standby — Handling Query Conflicts section in documentation.
There's no need to start idle transactions on the master. In postgresql-9.1 the
most direct way to solve this problem is by setting
hot_standby_feedback = on
This will make the master aware of long-running queries. From the docs:
The first option is to set the parameter hot_standby_feedback, which prevents
VACUUM from removing recently-dead rows and so cleanup conflicts do not occur.
Why isn't this the default? This parameter was added after the initial
implementation and it's the only way that a standby can affect a master.
As stated here about hot_standby_feedback = on :
Well, the disadvantage of it is that the standby can bloat the master,
which might be surprising to some people, too
And here:
With what setting of max_standby_streaming_delay? I would rather
default that to -1 than default hot_standby_feedback on. That way what
you do on the standby only affects the standby
So I added
max_standby_streaming_delay = -1
And no more pg_dump error for us, nor master bloat :)
For AWS RDS instance, check http://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Appendix.PostgreSQL.CommonDBATasks.html
The table data on the hot standby slave server is modified while a long running query is running. A solution (PostgreSQL 9.1+) to make sure the table data is not modified is to suspend the replication and resume after the query:
select pg_xlog_replay_pause(); -- suspend
select * from foo; -- your query
select pg_xlog_replay_resume(); --resume
I'm going to add some updated info and references to #max-malysh's excellent answer above.
In short, if you do something on the master, it needs to be replicated on the slave. Postgres uses WAL records for this, which are sent after every logged action on the master to the slave. The slave then executes the action and the two are again in sync. In one of several scenarios, you can be in conflict on the slave with what's coming in from the master in a WAL action. In most of them, there's a transaction happening on the slave which conflicts with what the WAL action wants to change. In that case, you have two options:
Delay the application of the WAL action for a bit, allowing the slave to finish its conflicting transaction, then apply the action.
Cancel the conflicting query on the slave.
We're concerned with #1, and two values:
max_standby_archive_delay - this is the delay used after a long disconnection between the master and slave, when the data is being read from a WAL archive, which is not current data.
max_standby_streaming_delay - delay used for cancelling queries when WAL entries are received via streaming replication.
Generally, if your server is meant for high availability replication, you want to keep these numbers short. The default setting of 30000 (milliseconds if no units given) is sufficient for this. If, however, you want to set up something like an archive, reporting- or read-replica that might have very long-running queries, then you'll want to set this to something higher to avoid cancelled queries. The recommended 900s setting above seems like a good starting point. I disagree with the official docs on setting an infinite value -1 as being a good idea--that could mask some buggy code and cause lots of issues.
The one caveat about long-running queries and setting these values higher is that other queries running on the slave in parallel with the long-running one which is causing the WAL action to be delayed will see old data until the long query has completed. Developers will need to understand this and serialize queries which shouldn't run simultaneously.
For the full explanation of how max_standby_archive_delay and max_standby_streaming_delay work and why, go here.
It might be too late for the answer but we face the same kind of issue on the production.
Earlier we have only one RDS and as the number of users increases on the app side, we decided to add Read Replica for it. Read replica works properly on the staging but once we moved to the production we start getting the same error.
So we solve this by enabling hot_standby_feedback property in the Postgres properties.
We referred the following link
https://aws.amazon.com/blogs/database/best-practices-for-amazon-rds-postgresql-replication/
I hope it will help.
Likewise, here's a 2nd caveat to #Artif3x elaboration of #max-malysh's excellent answer, both above.
With any delayed application of transactions from the master the follower(s) will have an older, stale view of the data. Therefore while providing time for the query on the follower to finish by setting max_standby_archive_delay and max_standby_streaming_delay makes sense, keep both of these caveats in mind:
the value of the follower as a standby / backup diminishes
any other queries running on the follower may return stale data.
If the value of the follower for backup ends up being too much in conflict with hosting queries, one solution would be multiple followers, each optimized for one or the other.
Also, note that several queries in a row can cause the application of wal entries to keep being delayed. So when choosing the new values, it’s not just the time for a single query, but a moving window that starts whenever a conflicting query starts, and ends when the wal entry is finally applied.

How to resolve slow downs induced by HADR_FLAGS = STANDBY_RECV_BLOCKED?

We had a severe slow down of our applications in our HADR environment. We are seeing the following when we run db2pd -hadr:
HADR_FLAGS = STANDBY_RECV_BLOCKED
STANDBY_RECV_BUF_PERCENT = 100
STANDBY_SPOOL_PERCENT = 100
These recovered later and seem better now with STANDBY_SPOOL_PERCENT coming down gradually. Can you please help understand the implications of the values of the above parameters and what needs to be done to ensure we don't get into such a situation?
This isssue is most likely triggered by a peak amount of transactions occuring on the primary. The standby receive buffer and spool got saturated. Unless you are running with configuration parameter HADR_SYNCMODE in SUPERASYNC mode, you could fall into this situation. The slow down on application was induced by the primary waiting for an acknowledgement from the standby that it had received the log file, but since its spool and receive buffers were full at the time, the standby was delaying this acknowledgement.
You could consider setting HADR_SYNCMODE to SUPERASYNC, but this would also mean that the system will be more vulnerable to data loss should there be a failure on the primary. To manage these temporary peaks, you can make either of the following configuration changes:
Increase the size of the log receive buffer on the standby database
by modifying the value of the DB2_HADR_BUF_SIZE registry
variable.
Enable log spooling on a standby database by setting the
HADR_SPOOL_LIMIT
For further details, you can refer to HADR Performance Guide

Postgres connection should be always on? or connect before running each query?

I am debating if I should keep my postgres connection always on, and check/re-connect before running query. Or I should connect it before run each query and close the connection as soon as it is done. Thanks!
As long as the Postgres server isn't totally jammed with connections (i.e. this is not an app that will be creating a gigantic number of perpetual connections), I don't think it's a problem to maintain the connection. I would also recommend checking the connection and handling reconnects prior to each query however. Many libraries offer ways to do this. For example, with MyBatis (Java), you can have it issue a test query each time, which can be specified. I use the lightweight SELECT 1 for this.
I would say the key thing to consider is to keep the connection idle in transaction for as little time as possible, as when that happens, it can have a variety of different impacts on performance (such as slowing down other queries, preventing high-turnover tables from being vacuumed in a timely manner, etc.). This is not to say that any time spent in idle in transaction is automatically bad, but it should be considered and minimized where possible. (e.g. if you have some calculations that are going several minutes to run, make sure to either commit or rollback prior to doing those (which one would depend on context).
If you're doing a bunch of SELECTs, and don't have anything you need to commit, I would recommend doing a rollback to help keep the idle in transaction states to a minimum.
I just realized the postgres connection string has a bunch of setting for the connection pooling, for example:
User ID=root;Password=myPassword;Host=localhost;Port=5432;Database=myDataBase;
Pooling=true;Min Pool Size=0;Max Pool Size=100;Connection Lifetime=0;
So in my code, I can just close the connection after the command finished execution. But behind the scene, the connection is actually still alive and stored in the connection pool to be used again.

Postgres: Concurrent queries in a connection

In Postgres is there a limitation of having just one executing query per connection (and other queries in the connection wait for the first to complete before they start)? I think I am seeing this in one driver so I want to be sure this is a db behavior and not a specific driver limitation.
In Postgres is there a limitation of having just one executing query per connection
Yes. PostgreSQL doesn't let you suspend and resume transactions, nor does it support background (asynchronous) queries on the server back-end.
You can still run multiple concurrent queries, you just need one connection per concurrent query. You can use threads (one thread per connection) but it's usually better to use asynchronous query interfaces in your client library.
Without knowing what you're trying to achieve, and what what programming language (and thus what client library) you're using it's hard to offer more detailed advice.

wait for transactional replication in ADO.NET or TSQL

My web app uses ADO.NET against SQL Server 2008. Database writes happen against a primary (publisher) database, but reads are load balanced across the primary and a secondary (subscriber) database. We use SQL Server's built-in transactional replication to keep the secondary up-to-date. Most of the time, the couple of seconds of latency is not a problem.
However, I do have a case where I'd like to block until the transaction is committed at the secondary site. Blocking for a few seconds is OK, but returning a stale page to the user is not. Is there any way in ADO.NET or TSQL to specify that I want to wait for the replication to complete? Or can I, from the publisher, check the replication status of the transaction without manually connecting to the secondary server.
[edit]
99.9% of the time, The data in the subscriber is "fresh enough". But there is one operation that invalidates it. I can't read from the publisher every time on the off chance that it's become invalid. If I can't solve this problem under transactional replication, can you suggest an alternate architecture?
There's no such solution for SQL Server, but here's how I've worked around it in other environments.
Use three separate connection strings in your application, and choose the right one based on the needs of your query:
Realtime - Points directly at the one master server. All writes go to this connection string, and only the most mission-critical reads go here.
Near-Realtime - Points at a load balanced pool of subscribers. No writes go here, only reads. Used for the vast majority of OLTP reads.
Delayed Reporting - In your environment right now, it's going to point to the same load-balanced pool of subscribers, but down the road you can use a technology like log shipping to have a pool of servers 8-24 hours behind. These scale out really well, but the data's far behind. It's great for reporting, search, long-term history, and other non-realtime needs.
If you design your app to use those 3 connection strings from the start, scaling is a lot easier, especially in the case you're experiencing.
You are describing a synchronous mirroring situation. Replication cannot, by definition, support your requirement. Replication must wait for a transaction to commit before reading it from the log and delivering it to the distributor and from there to the subscriber, which means replication by definition has a window of opportunity for data to be out of sync.
If you have a requirement an operation to read the authorithative copy of the data, then you should make that decission in the client and ensure you read from the publisher in that case.
While you can, in threory, validate wether a certain transaction was distributed to the subscriber or not, you should not base your design on it. Transactional replication makes no latency guarantee, by design, so you cannot rely on a 'perfect day' operation mode.