extract/separate duration value from postgresql log in elk - postgresql

I have elk(elasticsearch, logstash , kibana) for monitor logs. postgresql logs with filebeat send to logstash and show logs in kibana.
I enabled log_duration in postgresql.conf and logged correctly.
for example, the postgresql log is below :
2023-01-11 06:17:09.754 EST [19751] user#books LOG: duration: 0.014 ms execute <unnamed>: SET SESSION CHARACTERISTICS AS TRANSACTION READ ONLY
2023-01-11 06:17:09.755 EST [19751] user#books LOG: duration: 0.016 ms bind S_1: BEGIN
2023-01-11 06:17:09.755 EST [19751] user#books LOG: duration: 0.006 ms execute S_1: BEGIN
2023-01-11 06:17:09.756 EST [19751] user#books LOG: duration: 0.488 ms parse <unnamed>: select * from books
but in kibana, The duration value of the field is not separate and it is displayed as one with the message field, and there is no possibility of aggregation and other opetaions.
how to extract and split duration value from message ???

Related

How to debug very slow CREATE DATABASE statement in postgres 12

All other database operation are running at full speed.
This on several hosts and Ubuntu 22.04 LXC containers on that we have been using otherwise very successfully for a couple of years.
Turning fsync off doesnt make any difference.
diskio and processor utilisation is minimal so that isnt it.
I tried logging with debug level 3 but could not find anything.
Listing the various process logs in sql and linux just shows processes quietly waiting for something but I cannot find out what exactly.
The template1 database is virtually empty.
2022-10-31 14:10:02.743 UTC [1086558] exodus#exodus LOG: duration: 17249.532 ms statement: CREATE DATABASE xo_dict WITH ENCODING='UTF8'
2022-10-31 14:10:11.569 UTC [1090734] exodus#exodus LOG: duration: 8010.033 ms statement: DROP DATABASE exodus2b
2022-10-31 14:10:13.359 UTC [1086558] exodus#exodus LOG: duration: 9596.890 ms statement: DROP DATABASE xo_dict
2022-10-31 14:10:15.076 UTC [1090734] exodus#exodus LOG: duration: 3491.147 ms statement: DROP DATABASE exodus3b
2022-10-31 14:10:32.291 UTC [1093962] exodus#exodus LOG: duration: 15510.507 ms statement: CREATE DATABASE exodus2b WITH ENCODING='UTF8'
2022-10-31 14:10:52.174 UTC [1093962] exodus#exodus LOG: duration: 19864.597 ms statement: CREATE DATABASE exodus3b WITH ENCODING='UTF8' TEMPLATE exodus2b
2022-10-31 14:10:52.932 UTC [1093962] exodus#exodus LOG: duration: 740.990 ms statement: DROP DATABASE exodus2b
2022-10-31 14:10:55.849 UTC [1093962] exodus#exodus LOG: duration: 2129.943 ms statement: DROP DATABASE exodus3b
2022-10-31 14:11:13.755 UTC [1102944] exodus#exodus LOG: duration: 17885.511 ms statement: CREATE DATABASE exodus2b WITH ENCODING='UTF8'
2022-10-31 14:11:43.537 UTC [1102944] exodus#exodus LOG: duration: 29769.648 ms statement: CREATE DATABASE exodus3b WITH ENCODING='UTF8'
2022-10-31 14:21:33.410 UTC [1247048] exodus#exodus LOG: duration: 15115.960 ms statement: CREATE DATABASE xo_dict WITH ENCODING='UTF8'

Postgres.exe crashes and tears down all apps, recovers and is running again

I'm running an application with about 20 processes connected to a postgres DB (10.0) on windows server 2016.
Since about a month I have unexpected crashes of postgres.exe.
To isolate the problem I extended the logging by setting log_min_duration_statement = 0
This creates more detailed logfile. What I can see is:
LOG: server process (PID xxxxx) was terminated by exception
0xFFFFFFFF DETAIL: Failed process was running: COMMIT HINT: See C
include file "ntstatus.h" for a description of the hexadecimal value.
Then it tears down all 20 processes like this:
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
HINT: In a moment you should be able to reconnect to the database and repeat your command.
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.Then DB recovers:
HINT: In a moment you should be able to reconnect to the database and repeat your command.
LOG: all server processes terminated; reinitializing
LOG: database system was interrupted; last known up at 2021-06-11 18:17:18 CEST
DB enters recovery mode
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
FATAL: the database system is in recovery mode
LOG: database system was not properly shut down; automatic recovery in progress
...
LOG: redo starts at 1B2/33319E58
FATAL: the database system is in recovery mode
LOG: invalid record length at 1B2/33D29930: wanted 24, got 0
LOG: redo done at 1B2/33D29908
LOG: last completed transaction was at log time 2021-06-11 18:21:39.830526+02
FATAL: the database system is in recovery mode
...
FATAL: the database system is in recovery mode
LOG: database system is ready to accept connections
Now it's running again like normal
The crashed PID xxxxx I can identify to a postgres.exe running for one of the 20 application processes. It's not always the same one. This happens about every 5-10 days.
Can anybody give me some advice how to track down the reason of this crash?
Extensions used:
oracle_fdw 2.0.0, PostgreSQL 10.0, Oracle client 11.2.0.3.0, Oracle server 11.2.0.2.0
Crashdump:
Followed the link :
https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Windows
Although the postgres user has "full control" of the crashdump folder in the security tab it does not write something. Folder stays empty.
Follow-Up on the comment #Laurenz Albe:
The COMMIT is not the reason of the crash. It is the last successfull executed command of the session. Explained on the following example:
Process gets a job and starts to do it's job
2021-06-15 16:27:51.100 CEST [25604] LOG: duration: 0.061 ms statement: DISCARD ALL
2021-06-15 16:27:51.100 CEST [25604] LOG: duration: 0.012 ms statement: BEGIN
2021-06-15 16:27:51.100 CEST [25604] LOG: duration: 0.015 ms statement: SET TRANSACTION ISOLATION LEVEL READ COMMITTED
now a lot of action going on within session 25604
and among others the oracle foreign datawrapper
2021-06-15 16:28:13.792 CEST [25604] LOG: duration: 0.016 ms execute <unnamed>: FETCH ALL FROM "<unnamed portal 689>"
finishes action successfully (data of the transaction in the database)
2021-06-15 16:28:13.823 CEST [25604] LOG: duration: 0.059 ms statement: COMMIT
a lot of action is going in different sessions
among others the oracle foreign datawrapper
more the 7 minutes afterwards the next job is requested and now postgres.exe crash
2021-06-15 16:36:01.524 CEST [17904] LOG: server process (PID 25604) was terminated by exception 0xFFFFFFFF
The process does not do DISCARD ALL, BEGIN and SET TRANSACTION ISOLATION LEVEL READ COMMITTED
It crashes immediately
My Conclusion:
"the possibly corrupted shared memory" was initiated by one of the processes before. Meaning between the last successful COMMIT and the new request.
That's a 7 minutes time span where the problem occurs.
Some feedback on this conclusion?

PostgreSQL 9.4.1 Switchover & Switchback without recover_target_timeline=latest

I have tested different scenarios to do switchover and switchback in postgreSQL 9.4.1 Version.
Scenario 1:- PostgreSQL Switchover and Switchback in 9.4.1
Scenario 2:- Is it mandatory parameter recover_target_timeline='latest' in switchover and switchback in PostgreSQL 9.4.1?
Scenario 3:- On this page
To test scenario 3 I have followed below steps to perform.
1) Stop the application connected to primary server.
2) Confirm all application was stopped and all thread was disconnected from primary DB.
#192.x.x.129(Primary)
3) Clean shutdown primary using
pg_ctl -D$PGDATA stop --mf
#DR(192.x.x.128) side check sync status:
postgres=# select pg_last_xlog_receive_location(),pg_last_xlog_replay_location();
-[ RECORD 1 ]-----------------+-----------
pg_last_xlog_receive_location | 4/57000090
pg_last_xlog_replay_location | 4/57000090
4)Stop DR server.DR(192.x.x.128)
pg_ctl -D $PGDATA stop -mf
pg_log:
2019-12-02 13:16:09 IST LOG: received fast shutdown request
2019-12-02 13:16:09 IST LOG: aborting any active transactions
2019-12-02 13:16:09 IST LOG: shutting down
2019-12-02 13:16:09 IST LOG: database system is shut down
#192.x.x.128(DR)
5) Make following changes on DR server.
mv recovery.conf recovery.conf_bkp
6)make changes in 192.x.x.129(Primary):
[postgres#localhost data]$ cat recovery.conf
standby_mode = 'on'
primary_conninfo = 'user=replication password=postgres host=192.x.x.128 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'
restore_command = 'cp %p /home/postgres/restore/%f'
trigger_file='/tmp/promote'
7)Start DR as read write mode:
pg_ctl -D $DATA start
pg_log:
2019-12-02 13:20:21 IST LOG: database system was shut down in recovery at 2019-12-02 13:16:09 IST
2019-12-02 13:20:22 IST LOG: database system was not properly shut down; automatic recovery in progress
2019-12-02 13:20:22 IST LOG: consistent recovery state reached at 4/57000090
2019-12-02 13:20:22 IST LOG: invalid record length at 4/57000090
2019-12-02 13:20:22 IST LOG: redo is not required
2019-12-02 13:20:22 IST LOG: database system is ready to accept connections
2019-12-02 13:20:22 IST LOG: autovacuum launcher started
(END)
We can see in above log OLD primary is now DR of Primary(Which was OLD DR) and not showing any error because timeline id same on new primary which is already exit in new DR.
8)Start Primary as read only mode:-
pg_ctl -D$PGDATA start
logs:
2019-12-02 13:24:50 IST LOG: database system was shut down at 2019-12-02 11:14:50 IST
2019-12-02 13:24:51 IST LOG: entering standby mode
cp: cannot stat ‘pg_xlog/RECOVERYHISTORY’: No such file or directory
cp: cannot stat ‘pg_xlog/RECOVERYXLOG’: No such file or directory
2019-12-02 13:24:51 IST LOG: consistent recovery state reached at 4/57000090
2019-12-02 13:24:51 IST LOG: record with zero length at 4/57000090
2019-12-02 13:24:51 IST LOG: database system is ready to accept read only connections
2019-12-02 13:24:51 IST LOG: started streaming WAL from primary at 4/57000000 on timeline 9
2019-12-02 13:24:51 IST LOG: redo starts at 4/57000090
(END)
Question 1:- In This scenario i have perform only switch-over to show you. using this method we can do switch-over and switchback. but using below method Switch-over-switchback is work, then why PostgreSQL Community invented recovery_target_timeline=latest and apply patches see blog: https://www.enterprisedb.com/blog/switchover-switchback-in-postgresql-9-3 from PostgrSQL 9.3...to latest version.
Question 2:- What mean to say in above log cp: cannot stat ‘pg_xlog/RECOVERYHISTORY’: No such file or directory ?
Question 3:- I want to make sure from scenarios 1 and scenario 3 which method/Scenarios is correct way to do switchover and switchback? because scenario 2 is getting error because we must use recover_target_timeline=latest which all community experts know.
Answers:
If you shut down the standby cleanly, then remove recovery.conf and restart it, it will come up, but has to perform crash recovery (database system was not properly shut down).
The proper way to promote a standby to a primary is by using the trigger file or running pg_ctl promote (or, from v12 on, by running the SQL function pg_promote). Then you have no down time and don't need to perform crash recovery.
Promoting the standby will make it pick a new time line, so you need recovery_target_timeline = 'latest' if you want the new standby to follow that time line switch.
That is caused by your restore_command.
The method shown in 1. above is the correct one.

PostgreSQL PITR not working properly

I am trying to restore a PostgreSQL database to a point in time.
When I am using only restore_command in recovery.conf then its working fine.
restore_command = 'cp /var/lib/pgsql/pg_log_archive/%f %p'
When I am using the recovery_target_time parameter, it is not restoring to the target time.
restore_command = 'cp /var/lib/pgsql/pg_log_archive/%f %p'
recovery_target_time='2018-06-05 06:43:00.0'
Below is the log file content:
2018-06-05 07:31:39.166 UTC [22512] LOG: database system was interrupted; last known up at 2018-06-05 06:35:52 UTC
2018-06-05 07:31:39.664 UTC [22512] LOG: starting point-in-time recovery to 2018-06-05 06:43:00+00
2018-06-05 07:31:39.671 UTC [22512] LOG: restored log file "00000005.history" from archive
2018-06-05 07:31:39.769 UTC [22512] LOG: restored log file "00000005000000020000008F" from archive
2018-06-05 07:31:39.816 UTC [22512] LOG: redo starts at 2/8F000028
2018-06-05 07:31:39.817 UTC [22512] LOG: consistent recovery state reached at 2/8F000130
2018-06-05 07:31:39.818 UTC [22510] LOG: database system is ready to accept read only connections
2018-06-05 07:31:39.912 UTC [22512] LOG: restored log file "000000050000000200000090" from archive
2018-06-05 07:31:39.996 UTC [22512] LOG: recovery stopping before abort of transaction 9525, time 2018-06-05 06:45:02.088502+00
2018-06-05 07:31:39.996 UTC [22512] LOG: recovery has paused
I am trying to restore the database instance to 06:43:00. Why is it recovering up to 06:45:02?
EDIT
In first scenario recovery.conf converted into recovery.done but this didn't happen in second scenario
What could be the reason of this?
You forgot to set
recovery_target_action = 'promote'
After point-in-time-recovery, recovery_target_action determines how PostgreSQL will proceed.
The default value is pause which means that PostgreSQL will do nothing and wait for you to tell it how to proceed.
To complete recovery, connect to the database and run
SELECT pg_wal_replay_resume();
It seems that there has been no database activity logged between 06:43:00 and 06:45:02. Observe that the log says recovery stopping before abort of transaction 9525.

Can't connect to postgresql server after moving database files

I want to move my postgresql databases to an external hard drive (HDD 2TB USB 3.0). I copied the whole directory:
/var/lib/postgresql/9.4/main/
to the external drive, preserving permissions, with a command (ran by the user postgres):
$ rsync -aHAX /var/lib/postgresql/9.4/main/* new_dir_path
First run of this command was interrupted, but in the second attempt I copied everything (basically one database of size 800 GB). In the file
/etc/postgresql/9.4/main/postgresql.conf
I changed the line
data_directory = '/var/lib/postgresql/9.4/main'
to point to the new location. I restarted the postgresql service, and when from the user postgres I run the command psql, I get:
psql: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
I didn't change any other settings. There is no pidfile 'postmaster.pid' in the new location (or in the old one). When I run a command
$ /usr/lib/postgresql/9.4/bin/postgres --single -D /etc/postgresql/9.4/main -P -d 1
I get
2017-03-16 20:47:39 CET [2314-1] DEBUG: mmap with MAP_HUGETLB failed, huge pages disabled: Cannot allocate memory
2017-03-16 20:47:39 CET [2314-2] NOTICE: database system was shut down at 2017-03-16 20:01:23 CET
2017-03-16 20:47:39 CET [2314-3] DEBUG: checkpoint record is at 647/4041B3A0
2017-03-16 20:47:39 CET [2314-4] DEBUG: redo record is at 647/4041B3A0; shutdown TRUE
2017-03-16 20:47:39 CET [2314-5] DEBUG: next transaction ID: 1/414989450; next OID: 112553
2017-03-16 20:47:39 CET [2314-6] DEBUG: next MultiXactId: 485048384; next MultiXactOffset: 1214064579
2017-03-16 20:47:39 CET [2314-7] DEBUG: oldest unfrozen transaction ID: 259446705, in database 12141
2017-03-16 20:47:39 CET [2314-8] DEBUG: oldest MultiXactId: 476142442, in database 12141
2017-03-16 20:47:39 CET [2314-9] DEBUG: transaction ID wrap limit is 2406930352, limited by database with OID 12141
2017-03-16 20:47:39 CET [2314-10] DEBUG: MultiXactId wrap limit is 2623626089, limited by database with OID 12141
2017-03-16 20:47:39 CET [2314-11] DEBUG: starting up replication slots
2017-03-16 20:47:39 CET [2314-12] DEBUG: oldest MultiXactId member is at offset 1191132700
2017-03-16 20:47:39 CET [2314-13] DEBUG: MultiXact member stop limit is now 1191060352 based on MultiXact 476142442
PostgreSQL stand-alone backend 9.4.9
backend>
but I don't now how to understand this output. When I revert the changes in the postgresql.conf file, everything works fine. Interestingly, few months ago I moved the database in the same way, but to the local directory, and it worked.
I use postgresql-9.4 and debian-jessie.
Thanks for your help!
UPDATE
Content of the log file:
$ cat /var/log/postgresql/postgresql-9.4-main.log
2017-03-14 17:07:16 CET [13822-2] LOG: received fast shutdown request
2017-03-14 17:07:16 CET [13822-3] LOG: aborting any active transactions
2017-03-14 17:07:16 CET [13827-3] LOG: autovacuum launcher shutting down
2017-03-14 17:07:16 CET [13824-1] LOG: shutting down
2017-03-14 17:07:16 CET [13824-2] LOG: database system is shut down