Serious issue with balance update in Aurora Postgres Serverless

Serious issue with balance update in Aurora Postgres Serverless - postgresql

I am using aurora serverless v10.x
webhook data is streamed into AWS SQS which triggers a lambda function to process.
NOTE: Lambda is set to "concurrency=1" to ensure it runs just one invocation at any time.
When the incoming data is just one record every few seconds, the balance update is fine.
But when the input is a batch of 5 records(just 5), I am seeing some weird "UPDATES to the Balance" field.
Anyone face such issue? Any tips?
The critical SQL is a single SQL as shown below. it takes 20 milliseconds per execution. yet the next transactions, is not able to see the balance and update correctly.
with new_balance as (update accounts set account_balance=account_balance-1.0, reserved_balance=reserved_balance-1.0 where iban='49' returning account_balance as new_acct_balance)
INSERT INTO ledger_transactions ( iban, end_to_end_txn_id, amount, balance) select '49', 'b946733af9',1.0 ,new_acct_balance from new_balance;
What have I tried so far:
I have tried all options of transaction isolation. "repeatable read, serialization, read uncommitted, and read committed" at the db level and at the transaction level.
tried to delay the SQS queue to provide the payload one every 5 seconds. Wierdly, I see the insert timestamp of the rows in the ledger_transactions table happening rapidly. The 5 seconds delay I thought I would see if the SQS delivers the data 1 row every 5 seconds is not happening.
3.wrote the above SQL as many invididual SQLs
If I create a "temp_account_balance" table and force an insert of the data coming in, and compute and populate account_balance, that is working fine. But this is an unnecessary overkill.

Related

Issue in creating refreshed stream

I have a use case that can be described as follows:
Dump that is generated each day at 4 am
Online stream that is run from 12 is for 24 hours.
We use dump as lookup, any content that exists in the online stream and also in the dump will have a special offer, but we face a problem as our proposed solution is limited. We created a stream that joins between lookup dump stream and online stream for 24 but we face a problem as there is a gap because the dump is not ready before 4 am so the join find nothing in those 4 hours and if we changed the window period for more time we will lose each day refreshment data.
Any help?

This question was cross-posted on the ksqlDB GitHub page: https://github.com/confluentinc/ksql/issues/7935
Copying my reply from there:
This is tricky... You need to understand that a joins have temporal semantics. For a stream-table join, it implies that a stream record joins to the table "version" that is valid according to the event-time of the stream record. Thus, if you table updates are timestamps at 4am, all stream-records from 12am-4am happen before the table update and are not eligible "see" those updates.
And even if you can change the timestamp of the table updates to 12am, the issue is that ksqlDB does not support a GRACE PRIOD for stream-table joins yet (we are already working on this though)...
Bottom line might be, that you would need to work around the current ksqlDB limitations to change your upstream ingestion to "wait" until after all table updates got published, before you publish the stream updates... Not sure if this is possible in your end-to-end setup.
To learn more about temporal join semantics, check out this Kafka Summit talk: https://www.confluent.io/events/kafka-summit-europe-2021/temporal-joins-in-kafka-streams-and-ksqldb/

How to avoid long delay before finally getting "40001 could not serialize access due to concurrent update"

We have a Postgres 12 system running one master master and two async hot-standby replica servers and we use SERIALIZABLE transactions. All the database servers have very fast SSD storage for Postgres and 64 GB of RAM. Clients connect directly to master server if they cannot accept delayed data for a transaction. Read-only clients that accept data up to 5 seconds old use the replica servers for querying data. Read-only clients use REPEATABLE READ transactions.
I'm aware that because we use SERIALIZABLE transactions Postgres might give us false positive matches and force us to repeat transactions. This is fine and expected.
However, the problem I'm seeing is that randomly a single line INSERT or UPDATE query stalls for a very long time. As an example, one error case was as follows (speaking directly to master to allow modifying table data):
A simple single row insert
insert into restservices (id, parent_id, ...) values ('...', '...', ...);
stalled for 74.62 seconds before finally emitting error
ERROR 40001 could not serialize access due to concurrent update
with error context
SQL statement "SELECT 1 FROM ONLY "public"."restservices" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
We log all queries exceeding 40 ms so I know this kind of stall is rare. Like maybe a couple of queries a day. We average around 200-400 transactions per second during normal load with 5-40 queries per transaction.
After finally getting the above error, the client code automatically released two savepoints, rolled back the transaction and disconnected from database (this cleanup took 2 ms total). It then reconnected to database 2 ms later and replayed the whole transaction from the start and finished in 66 ms including the time to connect to the database. So I think this is not about performance of the client or the master server as a whole. The expected transaction time is between 5-90 ms depending on transaction.
Is there some PostgreSQL connection or master configuration setting that I can use to make PostgreSQL to return the error 40001 faster even if it caused more transactions to be rolled back? Does anybody know if setting
set local statement_timeout='250'
within the transaction has dangerous side-effects? According to the documentation https://www.postgresql.org/docs/12/runtime-config-client.html "Setting statement_timeout in postgresql.conf is not recommended because it would affect all sessions" but I could set the timeout only for transactions by this client that's able to automatically retry the transaction very fast.
Is there anything else to try?

It looks like someone had the parent row to the one you were trying to insert locked. PostgreSQL doesn't know what to do about that until the lock is released, so it blocks. If you failed rather than blocking, and upon failure retried the exact same thing, the same parent row would (most likely) still be locked and so would just fail again, and you would busy-wait. Busy-waiting is not good, so blocking rather than failing is generally a good thing here. It blocks and then unblocks only to fail, but once it does fail a retry should succeed.
An obvious exception to blocking-better-than-failing being if when you retry, you can pick a different parent row to retry with, if that make sense in your context. In this case, maybe the best thing to do is explicitly lock the parent row with NOWAIT before attempting the insert. That way you can perhaps deal with failures in a more nuanced way.
If you must retry with the same parent_id, then I think the only real solution is to figure out who is holding the parent row lock for so long, and fix that. I don't think that setting statement_timeout would be hazardous, but it also wouldn't solve your problem, as you would probably just keep retrying until the lock on the offending row is released. (Setting it on the other session, the one holding the lock, might be helpful, depending on what that session is doing while the lock is held.)

EF core request cannot wake-up Azure Sql (serverless sku) database and times out

I'm using EF Core with one of my apps to query an Azure Sql database. It's the serverless sku, that scales down to zero (goes to sleep) after 1h of inactivity.
Now, in that app there is scheduled function to query the database at certain points in time. This often is in a time, where the DB is sleeping. To compensate for this, I'm using the the following in the DbContext.cs
optionsBuilder.UseSqlServer(connection, opt => opt.EnableRetryOnFailure(
maxRetryCount: 20,
maxRetryDelay: TimeSpan.FromSeconds(30),
errorNumbersToAdd: null
));
If the delay is evenly distributed, that results in an avg of 15s, with 20 retries => timeout after 5mins.
I thought this should be plenty, since when querying a sleeping database with SSMS it usaully takes well under 1min to get going. However, this is not the case, the functions regularly time-out and the queries fail.
Is there a better way to deal with this than just even more increasing the timeout? Should 5mins really not be enough?
Cheers

I think I got it working now. The above code snippet from EF core is relevant to any command timeout occurences. However, since the database was sleeping during the request it was rather a connection timeout issue. I fixed this, by providing adding Connect Timeout=120 in the connection string itself.

How to monitor async streaming replica delay from the slave?

We have a system with PostgreSQL 12.x where all changes are being written to master database server and two read-only streaming async replicas are used to reduce load from the master server for read-only transactions that can deal with slight delay.
Because the async replica may be delayed from the master in some cases we need a method to query the latency (delay) for the replication. We do not want to contact the master server to do this so one obvious way to do this is to query delay from the replica server:
select
(extract(epoch from now()) - extract(epoch from last_msg_send_time)) * 1000 as delay_ms
from pg_stat_wal_receiver;
However, it seems that pg_stat_wal_receiver has no data for our slave machines. It does have one row but only the pid column has data and every other column is empty. The documentation is unclear about the details but may it be that pg_stat_wal_receiver has data only for sync streaming replica?
Is there a way to figure out streaming delay of async replica? I'm hoping this is just some kind of configuration error instead of "this is not supported".
All the server machines are running PostgreSQL 12.2 but the client machines are still running PostgreSQL 9.5 client library in case it makes a difference.

I think I can answer the question about the missing columns of pg_stat_wal_receiver. To read the rest of columns, you need to login as superuser or a login role being granted the pg_read_all_stats privilege/role.
This behavior is documented in the source code of walreceiver.c, in the implementation of pg_stat_get_wal_receiver, says:
...
/*
* Only superusers and members of pg_read_all_stats can see details.
* Other users only get the pid value to know whether it is a WAL
* receiver, but no details.
*/
...

I don't understand why the table pg_stat_wal_receiver does not have data, but here's a workaround for the missing latency data:
select now() - pg_last_xact_replay_timestamp() as replication_lag;
or if you want the lag as milliseconds (plain number):
select round(extract(epoch from (now() - pg_last_xact_replay_timestamp())*1000)) as replication_lag_ms;
Note that this uses function pg_last_xact_replay_timestamp() (emphasis mine):
Get time stamp of last transaction replayed during recovery. This is
the time at which the commit or abort WAL record for that transaction
was generated on the primary. If no transactions have been replayed
during recovery, this function returns NULL. Otherwise, if recovery is
still in progress this will increase monotonically. If recovery has
completed then this value will remain static at the value of the last
transaction applied during that recovery. When the server has been
started normally without recovery the function returns NULL.
However, it seems that async streaming replication does increment this timestamp continuously when system has normal load (active writing on master). It's still unclear if this timestamp stops increasing if master has no changes but the streaming replication is active.

Kafka Streams topology with windowing doesn't trigger state changes

I am building the following Kafka Streams topology (pseudo code):
gK = builder.stream().gropuByKey();
g1 = gK.windowedBy(TimeWindows.of("PT1H")).reduce().mapValues().toStream().mapValues().selectKey();
g2 = gK.reduce().mapValues();
g1.leftJoin(g2).to();
If you notice, this is a rhomb-like topology that starts at single input topic and ends in the single output topic with messages flowing through two parallel flows that eventually get joined together at the end. One flow applies (tumbling?) windowing, the other does not. Both parts of the flow work on the same key (apart from the WindowedKey intermediately introduced by the windowing).
The timestamp for my messages is event-time. That is, they get picked from the message body by my custom configured TimestampExtractor implementation. The actual timestamps in my messages are several years to the past.
That all works well at first sight in my unit tests with a couple of input/output messages and in the runtime environment (with real Kafka).
The problem seems to come when the number of messages starts being significant (e.g. 40K).
My failing scenario is following:
~40K records with the same
key get uploaded into the input topic first
~40K updates are
coming out of the output topic, as expected
another ~40K records
with the same but different to step 1) key get uploaded into the
input topic
only ~100 updates are coming out of the output topic,
instead of expected new ~40K updates. There is nothing special to
see on those ~100 updates, their contents seems to be right, but
only for certain time windows. For other time windows there are no
updates even though the flow logic and input data should definetly
generate 40K records. In fact, when I exchange dataset in step 1)
and 3) I have exactly same situation with ~40K updates coming from
the second dataset and same number ~100 from the first.
I can easily reproduce this issue in the unit tests using TopologyTestDriver locally (but only on bigger numbers of input records).
In my tests, I've tried disabling caching with StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG. Unfortunately, that didn't make any difference.
UPDATE
I tried both, reduce() calls and aggregate() calls instead. The issue persists in both cases.
What I'm noticing else is that with StreamsConfig.TOPOLOGY_OPTIMIZATION set to StreamsConfig.OPTIMIZE and without it, the mapValues() handler gets called in debugger before the preceding reduce() (or aggregate()) handlers at least for the first time. I didn't expect that.
Tried both join() and leftJoin() unfortunately same result.
In debugger the second portion of the data doesn't trigger reduce() handler in the "left" flow at all, but does trigger reduce() handler in the "right" flow.
With my configuration, if the number or records in both datasets is 100 in each, the problem doesn't manifests itself, I'm getting 200 output messages as I expect. When I raise the number to 200 in each data set, I'm getting less than 400 expected messages out.
So, it seems at the moment that something like "old" windows get dropped and the new records for those old windows get ignored by the stream.
There is window retention setting that can be set, but with its default value that I use I was expecting for windows to retain their state and stay active for at least 12 hours (what exceeds the time of my unit test run significantly).
Tried to amend the left reducer with the following Window storage config:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
Duration.ofHours(1), false)
)
still no difference in results.
Same issue persists even with only single "left" flow without the "right" flow and without join(). It seems that the problem is in the window retention settings of my set up. Timestamps (event-time) of my input records span 2 years. The second dataset starts from the beginning of 2 years again. this place in Kafka Streams makes sure that the second data set records get ignored:
https://github.com/apache/kafka/blob/trunk/streams/src/main/java/org/apache/kafka/streams/state/internals/InMemoryWindowStore.java#L125
Kafka Streams Version is 2.4.0. Also using Confluent dependencies version 5.4.0.
My questions are
What could be the reason for such behaviour?
Did I miss anything in my stream topology?
Is such topology expected to work at all?

After some debugging time I found the reason for my problem.
My input datasets contain records with timestamps that span 2 years. I am loading the first dataset and with that the "observed" time of my stream gets set to the maximum timestamp from from input data set.
The upload of the second dataset that starts with records with timestamps that are 2 years before the new observed time causes the stream internal to drop the messages. This can be seen if you set the Kafka logging to TRACE level.
So, to fix my problem I had to configure the retention and grace period for my windows:
instead of
.windowedBy(TimeWindows.of(windowSize))
I have to specify
.windowedBy(TimeWindows.of(windowSize).grace(Duration.ofDays(5 * 365)))
Also, I had to explicitly configure reducer storage settings as:
Materialized.as(
Stores.inMemoryWindowStore(
"rollup-left-reduce",
Duration.ofDays(5 * 365),
windowSize, false)
)
That's it, the output is as expected.