We recently saw a few queries "idle in transaction" for quite some time
pid | usename | state | duration | application_name | wait_event | wait_event_type
------+---------+---------------------+----------+------------------+------------+----------------
31620 | results | idle in transaction | 12:52:23 | bin/rails | |
That is almost 13 hours idle in transaction.
Any idea what causes them to get stuck in idle, or how to dig deeper? We did notice some OOM errors for background jobs.
There are also a lot of "idle" queries, but thanks for the comments, those seem to be fine:
In postgresql "idle in transaction" with all locks granted #LaurenzAlbe was pointing out the idle session timeout configuration option as a band-aid, but I'd rather understand this issue than hide it.
thanks!
PS: our application is ruby on rails and we use a mix of active record and custom SQL
EDIT: original title was "idle in transaction", the queries are actually just idle most of the time and not in transaction, sorry about that
EDIT #2: found the 13 hour idle in transaction process
These sessions are actually all idle, so they are no problem.
idle is significantly different from idle in transaction: the latter is an open transaction that holds locks and blocks VACUUM, the first is harmless.
The OOM errors must have a different reason.
You should configure the machine so that
shared_buffers + max_connections * work_mem <= available RAM
Related
I have a long running idle query that is not automatically terminated.
I have set both the max timeouts to 2h (very long I know)
> select name,setting from pg_settings where name='statement_timeout' OR name='idle_in_transaction_session_timeout';
name | setting
-------------------------------------+---------
idle_in_transaction_session_timeout | 7200000
statement_timeout | 7200000
However I have this idle query (not idle_in_transaction) that is leftover from an application that crashed
> SELECT pid, age(clock_timestamp(), query_start), state, usename, query
FROM pg_stat_activity
WHERE query NOT ILIKE '%pg_stat_activity%'
ORDER BY query_start desc;
17117 | 02:11:40.795487 | idle | ms1-user | select distinct ....
Postgres 11.13 running on AWS Aurora
Can anyone explain why/what's missing?
As the name suggests, idle_in_transaction_session_timeout does not terminate idle sessions, but sessions that are "idle in transaction". For the latter, you can use idle_session_timeout introduced in PostgreSQL v14.
In your case, the problem are the TCP keepalive settings. With the default keepalive settings on Linux, it takes the server around 2 hours and 14.5 minutes to figure out that the other end of the connection is no longer there. So wait a few minutes more :^)
If you want to reduce the time, you can set the PostgreSQL parameters tcp_keepalives_idle, tcp_keepalives_interval and tcp_keepalives_count if Amazon allows you to do that. If they don't, complain.
I'm running the following query:
select *
from pg_stat_activity
and getting:
dataid dataname username waitevent query
16384 my_db pos12 clientRead query_1
16384 my_db pos12 clientRead query_2
16384 my_db pos12 clientRead query_3
where query_1, query_2, query_3 are some select queries
There are no applications which running and querying the database.
Does it mean that those queries are old stuck queries?
How can I know if there are old stuck queries?
Do stuck queries occupied DB resources?
Sessions that are in state ClientRead are idle, waiting for the next command from the client.
Whether that is a problem from the database side or not depends on the state:
if the state is idle, the session is not consuming any resources (save for some little memory) and is quite harmless
if the state is idle in transaction for a longer time, the sessions hold locks and block database maintenance – this would be an application bug that has to be fixed there
At any rate, if these are all your sessions, then the database is idle, and if anything is stuck, it is in the client side.
I support an application hosted by a small business, web-based ROR app using pgsql database on the backend.
Postgres is setup for replication to an off-site standby server, which as far as I can tell is working fine, when I query the remote server it shows that it's in recovery, etc.
From the 'master' server:
postgres=# table pg_stat_replication ;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start
| state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+---------+------------------+----------------+-----------------+-------------+-----------------------
--------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
18660 | 1281085 | rep | postgresql2 | 192.168.81.155 | | 43824 | 2017-05-07 11:42:43.15
0057-04 | streaming | 3/B5243418 | 3/B5243418 | 3/B5243418 | 3/B5243150 | 1 | sync
(1 row)
...and on the 'slave':
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
t
(1 row)
postgres=# select now() - pg_last_xact_replay_timestamp() AS replication_delay;
replication_delay
-------------------
01:02:14.885511
(1 row)
I understand the process involved should I have to promote my remote slave DB to the role of master, but the problem I seem to have is that on 2 or 3 occasions now the network link to the remote slave server has gone down, and the application completely "freezes up" (e.g. page loads but will not allow users to logon), despite the fact that the master DB is still up and running. I have wal archiving enabled to make sure that when something like this happens the data is preserved until the link is restored and the transaction logs can be sent...but I don't understand why my master pgsql instance seems to lockup because the slave instance goes offline...kind of defeats the entire concept of replication, so I assume I must be doing something wrong?
The most likely explanation is that you are using synchronous replication with just two nodes.
Is synchronous_standby_names set on the master server?
If the only synchronous standby server is not available, no transaction can commit on the master, and data modifying transactions will “hang”, which would explain the behaviour you observe.
For synchronous replication you need at lest two slaves.
1 S postgres 5038 876 0 80 0 - 11962 sk_wai 09:57 ? 00:00:00 postgres: postgres my_app ::1(45035) idle
1 S postgres 9796 876 0 80 0 - 11964 sk_wai 11:01 ? 00:00:00 postgres: postgres my_app ::1(43084) idle
I see a lot of them. We are trying to fix our connection leak. But meanwhile, we want to set a timeout for these idle connections, maybe max to 5 minute.
It sounds like you have a connection leak in your application because it fails to close pooled connections. You aren't having issues just with <idle> in transaction sessions, but with too many connections overall.
Killing connections is not the right answer for that, but it's an OK-ish temporary workaround.
Rather than re-starting PostgreSQL to boot all other connections off a PostgreSQL database, see: How do I detach all other users from a postgres database? and How to drop a PostgreSQL database if there are active connections to it? . The latter shows a better query.
For setting timeouts, as #Doon suggested see How to close idle connections in PostgreSQL automatically?, which advises you to use PgBouncer to proxy for PostgreSQL and manage idle connections. This is a very good idea if you have a buggy application that leaks connections anyway; I very strongly recommend configuring PgBouncer.
A TCP keepalive won't do the job here, because the app is still connected and alive, it just shouldn't be.
In PostgreSQL 9.2 and above, you can use the new state_change timestamp column and the state field of pg_stat_activity to implement an idle connection reaper. Have a cron job run something like this:
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = 'regress'
AND pid <> pg_backend_pid()
AND state = 'idle'
AND state_change < current_timestamp - INTERVAL '5' MINUTE;
In older versions you need to implement complicated schemes that keep track of when the connection went idle. Do not bother; just use pgbouncer.
In PostgreSQL 9.6, there's a new option idle_in_transaction_session_timeout which should accomplish what you describe. You can set it using the SET command, e.g.:
SET SESSION idle_in_transaction_session_timeout = '5min';
In PostgreSQL 9.1, the idle connections with following query. It helped me to ward off the situation which warranted in restarting the database. This happens mostly with JDBC connections opened and not closed properly.
SELECT
pg_terminate_backend(procpid)
FROM
pg_stat_activity
WHERE
current_query = '<IDLE>'
AND
now() - query_start > '00:10:00';
if you are using postgresql 9.6+, then in your postgresql.conf you can set
idle_in_transaction_session_timeout = 30000 (msec)
There is a timeout on broken connections (i.e. due to network errors), which relies on the OS' TCP keepalive feature. By default on Linux, broken TCP connections are closed after ~2 hours (see sysctl net.ipv4.tcp_keepalive_time).
There is also a timeout on abandoned transactions, idle_in_transaction_session_timeout and on locks, lock_timeout. It is recommended to set these in postgresql.conf.
But there is no timeout for a properly established client connection. If a client wants to keep the connection open, then it should be able to do so indefinitely. If a client is leaking connections (like opening more and more connections and never closing), then fix the client. Do not try to abort properly established idle connections on the server side.
A possible workaround that allows to enable database session timeout without an external scheduled task is to use the extension pg_timeout that I have developped.
Another option is set this value "tcp_keepalives_idle". Check more in documentation https://www.postgresql.org/docs/10/runtime-config-connection.html.
What does it mean when a PostgreSQL process is "idle in transaction"?
On a server that I'm looking at, the output of "ps ax | grep postgres" I see 9 PostgreSQL processes that look like the following:
postgres: user db 127.0.0.1(55658) idle in transaction
Does this mean that some of the processes are hung, waiting for a transaction to be committed? Any pointers to relevant documentation are appreciated.
The PostgreSQL manual indicates that this means the transaction is open (inside BEGIN) and idle. It's most likely a user connected using the monitor who is thinking or typing. I have plenty of those on my system, too.
If you're using Slony for replication, however, the Slony-I FAQ suggests idle in transaction may mean that the network connection was terminated abruptly. Check out the discussion in that FAQ for more details.
As mentioned here: Re: BUG #4243: Idle in transaction it is probably best to check your pg_locks table to see what is being locked and that might give you a better clue where the problem lies.