Postgres - How to debug/trace 'Idle in transaction' connection - postgresql

I am using Postgres for one of my applications and sometimes (not very frequently) one of the connection goes into <IDLE> in transaction state and it keeps acquired lock that causes other connections to wait on these locks ultimately causing my application to hang.
Following is the output from pg_stat_activity table for that process:
select * from pg_stat_activity
24081 | db | 798 | 16384 | db | | 10.112.61.218 | | 59034 | 2013-09-12 23:46:05.132267+00 | 2013-09-12 23:47:31.763084+00 | 2013-09-12 23:47:31.763534+00 | f | <IDLE> in transaction
This indicates that PID=798 is in <IDLE> in transaction state. The client process on web server is found as following using the client_port (59034) from above output.
sudo netstat -apl | grep 59034
tcp 0 0 ip-10-112-61-218.:59034 db-server:postgresql ESTABLISHED 23843/pgbouncer
I know that something is wrong in my application code (I killed one of the running application cron and it freed the locks) that is causing the connection to hang, but I am not able to trace it.
This is not very frequent and I can't find any definite reproduction steps either as this only occurs on the production server.
I would like to get inputs on how to trace such idle connection, e.g. getting last executed query or some kind of trace-back to identify which part of code is causing this issue.

If you upgrade to 9.2 or higher, the pg_stat_activity view will show you what the most recent query executed was for idle in transaction connections.
select * from pg_stat_activity \x\g\x
...
waiting | f
state | idle in transaction
query | select count(*) from pg_class ;
You can also (even in 9.1) look in pg_locks to see what locks are being held by the idle in transaction process. If it only has locks on very commonly used objects, this might not narrow things down much, but if it was a peculiar lock that could tell you exactly where in your code to look.
If you are stuck with 9.1, you can perhaps use the debugger to get all but the first 22 characters of the query (the first 22 are overwritten by the <IDLE> in transaction\0 message). For example:
(gdb) printf "%s\n", ((MyBEEntry->st_activity)+22)

Related

Postgres crashes when selecting from view

I have a view in Postgres with the following definition:
CREATE VIEW participant_data_view AS
SELECT participant_profile.*,
"user".public_id, "user".created, "user".status, "user".region,"user".name, "user".email, "user".locale,
(SELECT COUNT(id) FROM message_log WHERE message_log.target_id = "user".id AND message_log.type = 'diary') AS diary_reminder_count,
(SELECT SUM(pills) FROM "order" WHERE "order".user_id = "user".id AND "order".status = 'complete') AS pills
FROM participant_profile
JOIN "user" ON "user".id = participant_profile.id
;
The view creation works just fine. However, when I query the view SELECT * FROM participant_data_view, postgres crashes with
10:24:46.345 WARN HikariPool-1 - Connection org.postgresql.jdbc.PgConnection#172d19fe marked as broken because of SQLSTATE(08006), ErrorCode(0) c.z.h.p.ProxyConnection
org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend.
this question suggests to me that it might be an internal assertion that causes it to crash.
If I remove the diary_reminder_count field from the view definition, the select works just fine.
What am I doing wrong? How can I fix my view, or change it so I can query the same data in a different way?
Note that creating the view works just fine, it only crashes when querying it.
I tried running explain (analyze) select * from participant_data_view; from the IntelliJ query console, which only returns
[2020-12-08 11:13:56] [08006] An I/O error occurred while sending to the backend.
[2020-12-08 11:13:56] java.io.EOFException
I ran the same using psql, there it returns
my-database=# explain (analyze) select * from participant_data_view;
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Failed.
Looking at the log files, it contains:
2020-12-08 10:24:01.383 CET [111] LOG: server process (PID 89670) was terminated by signal 9: Killed: 9
2020-12-08 10:24:01.383 CET [111] DETAIL: Failed process was running: select "public"."participant_data_view"."id", "public"."participant_data_view"."study_number", <snip many other fields>,
"public"."participant_data_view"."diary_reminder_count", "public"."participant
2020-12-08 10:24:01.383 CET [111] LOG: terminating any other active server processes
In all likelihood, the Linux kernel out-of-memory killer killed your query because the system ran out of memory.
Either restrict the number of database sessions (for example with a connection pool) or reduce work_mem.
It is usually a good idea to set vm.overcommit_memory = 2 in the kernel and tune vm.overcommit_ratio appropriately.

Executed create index concurrently statmt & it was disconnected session due to timeout error.but I do see in pg_stat_activity,it is in running state

I have executed create an index on a big table from pgAdmin, and in a while, I lost connection to the server, so the execution window closed in pgAdnin. Then I reconnected to the server, and when checked pg_stat_activity, I do see that the create index statement is running (active) state, I just wondering to know whether this index being creating or stuck somewhere?
client disconnected with error,
cancelling statement due to statement timeout
when I reconnected to the server, in pg_stat_activity.
31937 "edsadmin" "09:54:44.280176" "CREATE INDEX CONCURRENTLY idx_src_record_date
ON pcd_t.l_esd_detail_report USING btree
(src_record_date COLLATE pg_catalog."default")
TABLESPACE pg_default;"
I'm really confused here wheather it is createing or not.
Late answer, but relevant none-the-less.
Based on my anecdotal experience right now, the index creation persists beyond the lifetime of the client.
I did the following:
database=> CREATE INDEX CONCURRENTLY IX_myIndex on table(column, column2);
It paused for a few minutes, then:
SSL SYSCALL error: EOF detected
The connection to the server was lost. Attempting reset: Succeeded.
psql (14.2 (Debian 14.2-1.pgdg110+1), server 10.18)
Immediately after reconnecting, I ran:
database=> \d table
Table "public.table"
Column | Type | Collation | Nullable |
Default
-----------------+-----------------------------+-----------+----------+---------
... blah blah blah ...
Indexes:
"IX_myIndex" btree (column, column2) INVALID
I waited a few more minutes, then checked again:
database=> \d table
Table "public.table"
Column | Type | Collation | Nullable |
Default
-----------------+-----------------------------+-----------+----------+---------
... blah blah blah ...
Indexes:
"IX_myIndex" btree (column, column2)
So, your index is likely fine.

How to monitor a deadlock in DB2

I am following this link and try to simulate the deadlock issue:
http://www.dba-db2.com/2012/06/how-to-monitor-a-deadlock-in-db2.html
I can see my command run successful.
After that I go to simulate a deadlock error through DbVisualiser tool. However I didnt see any file being generated to the path.
Can someone point the mistake to me?
And also, I try to read back those old 0000000.evt file, it show me something as follow:
EVENT LOG HEADER
Event Monitor name: DB2DETAILDEADLOCK
Server Product ID: SQL10059
Version of event monitor data: 12
Byte order: BIG ENDIAN
Number of nodes in db2 instance: 1
Codepage of database: 1208
Territory code of database: 1
Server instance name: db2inst1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Database Name: MYDB
Database Path: /db2home/db2inst1/NODE0000/SQL00003/MEMBER0000/
First connection timestamp: 01/29/2018 10:00:17.694784
Event Monitor Start time: 01/29/2018 10:00:18.951331
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Database Name: MYDB
Database Path: /db2home/db2inst1/NODE0000/SQL00003/MEMBER0000/
First connection timestamp: 01/29/2018 10:12:54.382936
Event Monitor Start time: 01/29/2018 10:12:54.697223
--------------------------------------------------------------------------
This means no deadlock?
Works correctly for me (linux, Db2 v11.1). Here are some command lines with annotations. You need to have suitable authorisation/privilege for each command. I was using the instance owner account.
Disable default db2detaildeadlock monitor first and then create your own:
$ db2 "set event monitor db2detaildeadlock state=0"
DB20000I The SQL command completed successfully.
$
$ db2 "create event monitor dlmon for deadlocks write to file '/tmp'"
DB20000I The SQL command completed successfully.
$
$ db2 "set event monitor dlmon state=1"
DB20000I The SQL command completed successfully.
$
Generate a deadlock, ensure you see this SQLCODE -911 with reason code 2.
If you dont' see the reason code 2 then you don't have any deadlock but you might have a timeout and timeouts don't get recorded in the deadlock monitor.
Here I show the victim of the deadlock getting notified of rollback and you can see the correct reason code:
$ db2 +c "select * from db2inst1.dlk where a=4 with rr"
SQL0911N The current transaction has been rolled back because of a deadlock
or timeout. Reason code "2". SQLSTATE=40001
Investigate the monitor output with db2evmon and view resulting file
$ db2evmon -db mydb -evm dlmon > /tmp/db2evmon.dlmon.1
Reading /tmp/00000000.evt ...
$ view /tmp/db2evmon.dlmon.1
...<snip>
...
3) Deadlock Event ...
Deadlock ID: 2
Number of applications deadlocked: 2
Deadlock detection time: 01/03/2018 09:06:39.019854
Rolled back Appl participant no: 2
Rolled back Appl Id: *LOCAL.db2inst1.180301090546
Rolled back Appl seq number: 00001
Rolled back Appl handle: 11872
...<snip>

Postgres BDR replication stopped - replication slot inactive

Our Postgres BDR database system stopped replicating data between the nodes.
When I did a check using the pg_xlog_location_diff I noticed that there is a growing buffer in the replication slot.
SELECT slot_name, database, active, pg_xlog_location_diff(pg_current_xlog_insert_location(), restart_lsn) AS retained_bytes
FROM pg_replication_slots
WHERE plugin = 'bdr';
slot_name | database | active | retained_bytes
-----------------------------------------+--------------+--------+----------------
bdr_26702_6275336279642079463_1_20305__ | ourdatabase | f | 32253352
I also noticed that the slot is marked as active=false.
SELECT * FROM pg_replication_slots;
-[ RECORD 1 ]+----------------------------------------
slot_name | bdr_26702_6275336279642079463_1_20305__
plugin | bdr
slot_type | logical
datoid | 26702
database | ourdatabase
active | f
xmin |
catalog_xmin | 8041
restart_lsn | 0/5F0C6C8
I increased the Postgres logging level, but then only messages I see in the log are:
LOCATION: LogicalIncreaseRestartDecodingForSlot, logical.c:886
DEBUG: 00000: updated xmin: 1 restart: 0
LOCATION: LogicalConfirmReceivedLocation, logical.c:958
DEBUG: 00000: failed to increase restart lsn: proposed 0/7DCE6F8, after 0/7DCE6F8, current candidate 0/7DCE6F8, current after 0/7DCE6F8, flushed up to 0/7DCE6F8
Please let me know if you have an idea how I can re-activate the replication slot and allow the replication to resume.
Except if you have really huuuuuge amount of data, I cannot see any reason for not recreating the replication from scratch. Stop the slave, delete the slot on master, delete data directory on slave, create new slot (with the same name to avoid further changes on slave), do pg_basebackup.
You can find a good tutorial here.

Postgres crashes for long query

My postgres crashes for long query. It's on Debian 7 64bit, and postgresql-9.3.2. I uses all default configuration. Could anyone suggest what problem it could be? thanks.
--part1:
SELECT r1.f2 as b, r1.e as l
FROM r r8,r r7,r r6,r r5,r r4,r r3,r r2,r r1
WHERE
r1.f2=r2.f1 AND
r1.f2=r3.f1 AND
r1.f2=r4.f1 AND
r1.f1=r5.f2 AND
r1.f1=r8.f1 AND
r2.f1=r3.f1 AND
r2.f1=r4.f1 AND
r2.f2=r6.f2 AND
r2.f2=r7.f1 AND
r3.f1=r4.f1 AND
r3.f2=r7.f2 AND
r3.f2=r8.f2 AND
r4.f2=r5.f1 AND
r4.f2=r6.f1 AND
r5.f1=r6.f1 AND
r5.f2=r8.f1 AND
r6.f2=r7.f1 AND
r7.f2=r8.f2 AND
r1.d=1 AND
r2.d=1 AND
r3.d=2 AND
r4.d=2 AND
r5.d=2 AND
r6.d=2 AND
r7.d=2 AND
r8.d=2
-- part2
group by r1.f2,r1.e
having
calc_empty_a() AND
calc_empty_b();
In the query, calc_empty_a() are just empty boolean functions (return true), so they should have no problem.
If I run the query in client, the server crashes. There is nothing useful information in the log (please refer to the error info at end of the post).
If I run the part 1 query, the query works well.
If I first run the part 1 query, then I run the whole query, it works well.
If I change the query, reduce the r numbers, for example, there are only r1 to r6 FROM tables, delete the predicates with r8, r7, but keep the part 2's GROUP BY and HAVING clause. The query still works well.
If the query have one empty function in HVING clause, the query also works well, but will crash if there are two functions.
The following query works well
SELECT r1.f2 as b, r1.f1 as a , r1.e as e
FROM r r8,r r7,r r6,r r5,r r4,r r3,r r2,r r1
WHERE
r1.f2=r2.f1 AND
r1.f2=r3.f1 AND
r1.f2=r4.f1 AND
r1.f1=r5.f2 AND
r1.f1=r8.f1 AND
r2.f1=r3.f1 AND
r2.f1=r4.f1 AND
r2.f2=r6.f2 AND
r2.f2=r7.f1 AND
r3.f1=r4.f1 AND
r3.f2=r7.f2 AND
r3.f2=r8.f2 AND
r4.f2=r5.f1 AND
r4.f2=r6.f1 AND
r5.f1=r6.f1 AND
r5.f2=r8.f1 AND
r6.f2=r7.f1 AND
r7.f2=r8.f2
group by r1.f2,r1.f1, r1.e
having
calc_empty_a() AND
calc_empty_a();
I have set the os to use strict overcommit mode:
sysctl -w vm.overcommit_memory=2
Error info:
at client
The connection to the server was lost. Attempting reset: Succeeded.
at server
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted; last known up at 2014-11-07 16:47:03 GMT
LOG: database system was not properly shut down; automatic recovery in progress
LOG: redo starts at 0/2126C98
LOG: record with zero length at 0/21A9D98
LOG: redo done at 0/21A9D68
LOG: last completed transaction was at log time 2014-11-07 16:47:26.844406+00
LOG: autovacuum launcher started
LOG: database system is ready to accept connections