How to timeout idle user sessions in Redshift? - postgresql

I am looking for a way to terminate user sessions that have been inactive or open for an arbitrary amount of time in Redshift. I noticed that in STV_SESSIONS I have a large number of sessions open, often for the same user, sometimes having been initialized days earlier. While I understand that this might be a symptom of a larger issue with the way some things close out of Redshift, I was hoping for a configurable timeout solution.
In the AWS documentation I found PG_TERMINATE_BACKEND (http://docs.aws.amazon.com/redshift/latest/dg/PG_TERMINATE_BACKEND.html), but I was hoping for a more automatic solution.

the timeout is only for timing out queries and not for a session.
Timeout (ms)
The maximum time, in milliseconds, queries can run before being canceled. If a read-only query, such as a SELECT statement, is canceled due to a WLM timeout, WLM attempts to route the query to the next matching queue based on the WLM Queue Assignment Rules. If the query doesn't match any other queue definition, the query is canceled; it is not assigned to the default queue. For more information, see WLM Query Queue Hopping. WLM timeout doesn’t apply to a query that has reached the returning state. To view the state of a query, see the STV_WLM_QUERY_STATE system table.
JSON property: max_execution_time

You can use Workload Management Configuration in AWS redshift. Where you can set the user group, query group, and timeout sessions. You can group all the same users together and assign a group name to them and set the timeout session for them. This is how I do it. Set the Query queue, based on your priority and then set concurrency level for the user group and the timeout in ms.
For more information, you can refer to AWS documentation.
Source - Workload Management
- Configuring Workload Management
Its pretty easy and straight forward.
If I’ve made a bad assumption please comment and I’ll refocus my answer.

You can use the newly introduced idle session timeout feature in Redshift. It is available both when creating a user and post creation (using Alter statement). Lookup the SESSION TIMEOUT parameter.

Related

Discrepancy between Redshift data api DescribeStatement status and console status

I'm loading data into redshift which usually takes about an hour when successful but seems to timeout randomly sometimes. I continue to get a "STARTED" status from DescribeStatement calls for my query but when I look in the console it says the query was ABORTED and rolled back via "Undoing 1 transactions on table ..." statement. But I'm not finding any errors in STL_LOAD_ERRORS related to the query or anything useful in STL_UTILITYTEXT for that transaction; though STL_UNDONE view does show the rollback.
I would've expected DescribeStatement to update with "FAILED" or "ABORTED" status when this occurred but that doesn't seem to be the case. Any idea what is causing the load to fail without any errors? Is there a way to catch/handle this via redshift data api? I'm currently thinking of checking STL_UNDONE after a specified time but was hoping there's a better solution.
Statement timeout seems like a likely cause. What you are describing sounds like the connection closed out from under the executing statement. There are a number of places where this timeout can come from but a common one is in the cluster configuration and the WLM configuration.
Another possibility is a network timeout. Database connections stay open for the entirety of the session but when a statement is in flight there is no activity on the connection. Some network equipment see this an assume that something is wrong and close the connection which closes the session which aborts the transaction in flight.
If your issue is caused by the connection closing you may be able to line things up in stl_sessions. There is info in there about timeouts but also you can see if the time the session closes is right when the query commands abort.
Just one area that could be causing your issue but is more common than people think.
So after escalating to AWS support, it was confirmed there was a bug on their end. Related to data API autoscaling protocols that were sometimes scaling down without waiting for outstanding tasks to complete. There's a temporary fix in place to avoid this happening while they implement a long term solution. Should hopefully be rolled out end of this month, June 2022.

How to avoid long delay before finally getting "40001 could not serialize access due to concurrent update"

We have a Postgres 12 system running one master master and two async hot-standby replica servers and we use SERIALIZABLE transactions. All the database servers have very fast SSD storage for Postgres and 64 GB of RAM. Clients connect directly to master server if they cannot accept delayed data for a transaction. Read-only clients that accept data up to 5 seconds old use the replica servers for querying data. Read-only clients use REPEATABLE READ transactions.
I'm aware that because we use SERIALIZABLE transactions Postgres might give us false positive matches and force us to repeat transactions. This is fine and expected.
However, the problem I'm seeing is that randomly a single line INSERT or UPDATE query stalls for a very long time. As an example, one error case was as follows (speaking directly to master to allow modifying table data):
A simple single row insert
insert into restservices (id, parent_id, ...) values ('...', '...', ...);
stalled for 74.62 seconds before finally emitting error
ERROR 40001 could not serialize access due to concurrent update
with error context
SQL statement "SELECT 1 FROM ONLY "public"."restservices" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
We log all queries exceeding 40 ms so I know this kind of stall is rare. Like maybe a couple of queries a day. We average around 200-400 transactions per second during normal load with 5-40 queries per transaction.
After finally getting the above error, the client code automatically released two savepoints, rolled back the transaction and disconnected from database (this cleanup took 2 ms total). It then reconnected to database 2 ms later and replayed the whole transaction from the start and finished in 66 ms including the time to connect to the database. So I think this is not about performance of the client or the master server as a whole. The expected transaction time is between 5-90 ms depending on transaction.
Is there some PostgreSQL connection or master configuration setting that I can use to make PostgreSQL to return the error 40001 faster even if it caused more transactions to be rolled back? Does anybody know if setting
set local statement_timeout='250'
within the transaction has dangerous side-effects? According to the documentation https://www.postgresql.org/docs/12/runtime-config-client.html "Setting statement_timeout in postgresql.conf is not recommended because it would affect all sessions" but I could set the timeout only for transactions by this client that's able to automatically retry the transaction very fast.
Is there anything else to try?
It looks like someone had the parent row to the one you were trying to insert locked. PostgreSQL doesn't know what to do about that until the lock is released, so it blocks. If you failed rather than blocking, and upon failure retried the exact same thing, the same parent row would (most likely) still be locked and so would just fail again, and you would busy-wait. Busy-waiting is not good, so blocking rather than failing is generally a good thing here. It blocks and then unblocks only to fail, but once it does fail a retry should succeed.
An obvious exception to blocking-better-than-failing being if when you retry, you can pick a different parent row to retry with, if that make sense in your context. In this case, maybe the best thing to do is explicitly lock the parent row with NOWAIT before attempting the insert. That way you can perhaps deal with failures in a more nuanced way.
If you must retry with the same parent_id, then I think the only real solution is to figure out who is holding the parent row lock for so long, and fix that. I don't think that setting statement_timeout would be hazardous, but it also wouldn't solve your problem, as you would probably just keep retrying until the lock on the offending row is released. (Setting it on the other session, the one holding the lock, might be helpful, depending on what that session is doing while the lock is held.)

EF core request cannot wake-up Azure Sql (serverless sku) database and times out

I'm using EF Core with one of my apps to query an Azure Sql database. It's the serverless sku, that scales down to zero (goes to sleep) after 1h of inactivity.
Now, in that app there is scheduled function to query the database at certain points in time. This often is in a time, where the DB is sleeping. To compensate for this, I'm using the the following in the DbContext.cs
optionsBuilder.UseSqlServer(connection, opt => opt.EnableRetryOnFailure(
maxRetryCount: 20,
maxRetryDelay: TimeSpan.FromSeconds(30),
errorNumbersToAdd: null
));
If the delay is evenly distributed, that results in an avg of 15s, with 20 retries => timeout after 5mins.
I thought this should be plenty, since when querying a sleeping database with SSMS it usaully takes well under 1min to get going. However, this is not the case, the functions regularly time-out and the queries fail.
Is there a better way to deal with this than just even more increasing the timeout? Should 5mins really not be enough?
Cheers
I think I got it working now. The above code snippet from EF core is relevant to any command timeout occurences. However, since the database was sleeping during the request it was rather a connection timeout issue. I fixed this, by providing adding Connect Timeout=120 in the connection string itself.

How to monitor async streaming replica delay from the slave?

We have a system with PostgreSQL 12.x where all changes are being written to master database server and two read-only streaming async replicas are used to reduce load from the master server for read-only transactions that can deal with slight delay.
Because the async replica may be delayed from the master in some cases we need a method to query the latency (delay) for the replication. We do not want to contact the master server to do this so one obvious way to do this is to query delay from the replica server:
select
(extract(epoch from now()) - extract(epoch from last_msg_send_time)) * 1000 as delay_ms
from pg_stat_wal_receiver;
However, it seems that pg_stat_wal_receiver has no data for our slave machines. It does have one row but only the pid column has data and every other column is empty. The documentation is unclear about the details but may it be that pg_stat_wal_receiver has data only for sync streaming replica?
Is there a way to figure out streaming delay of async replica? I'm hoping this is just some kind of configuration error instead of "this is not supported".
All the server machines are running PostgreSQL 12.2 but the client machines are still running PostgreSQL 9.5 client library in case it makes a difference.
I think I can answer the question about the missing columns of pg_stat_wal_receiver. To read the rest of columns, you need to login as superuser or a login role being granted the pg_read_all_stats privilege/role.
This behavior is documented in the source code of walreceiver.c, in the implementation of pg_stat_get_wal_receiver, says:
...
/*
* Only superusers and members of pg_read_all_stats can see details.
* Other users only get the pid value to know whether it is a WAL
* receiver, but no details.
*/
...
I don't understand why the table pg_stat_wal_receiver does not have data, but here's a workaround for the missing latency data:
select now() - pg_last_xact_replay_timestamp() as replication_lag;
or if you want the lag as milliseconds (plain number):
select round(extract(epoch from (now() - pg_last_xact_replay_timestamp())*1000)) as replication_lag_ms;
Note that this uses function pg_last_xact_replay_timestamp() (emphasis mine):
Get time stamp of last transaction replayed during recovery. This is
the time at which the commit or abort WAL record for that transaction
was generated on the primary. If no transactions have been replayed
during recovery, this function returns NULL. Otherwise, if recovery is
still in progress this will increase monotonically. If recovery has
completed then this value will remain static at the value of the last
transaction applied during that recovery. When the server has been
started normally without recovery the function returns NULL.
However, it seems that async streaming replication does increment this timestamp continuously when system has normal load (active writing on master). It's still unclear if this timestamp stops increasing if master has no changes but the streaming replication is active.

How to set wait timeout for a queue in Amazon redshift?

I know there exists a wlm timeout which times out when the query 'executes' more than that time. But can i set a timeout for the amount of time a query waits in the queue ?
You can control the amount of time that query spends waiting in queue indirectly by specifying statement_timeout configuration parameter on session or whole cluster level in addition to max_execution_time parameter on WLM level. If both WLM timeout (max_execution_time) and statement_timeout are specified, the shorter timeout is used. In this case the maximum time that query will be able to wait in the queue is "statement_timeout" minus "max_execution_time".
You can modify your WLM configuration to create separate queues for the queries on the basis of time they require to run and at runtime, you can route queries to the queues according to user groups or query groups. Hope that is what you want.