We have a hot_standby replication configured for our postgresDB, master is for read/write and replica is a read only server.
When I try to fetch the dead tuple count on both master and replica (slave) using the following query
#########vacuum and analyze stats
SELECT relname,last_autovacuum,last_autoanalyze,autovacuum_count,autoanalyze_count FROM pg_stat_user_tables;
#########vacuum and analyze stats
I got the below data on slave server
MyTestDB=# SELECT relname,last_autovacuum,last_autoanalyze,autovacuum_count,autoanalyze_count FROM pg_stat_user_tables;
relname | last_autovacuum | last_autoanalyze | autovacuum_count | autoanalyze_count
-----------------------------------+-----------------+------------------+------------------+-------------------
Table1 | | | 0 | 0
Table2 | | | 0 | 0
Here the question is, does analyze/vacuum applicable for slave server? if so, it should contain some stats like last autovacuum_count, autoanalyze_count?
Note: As per this thread in postgres forum VACUUM and ANALYZE are automatically replicated into slave
VACUUM and ANALYZE won't run on standby servers (the results of these operations on the primary are replicated along with all other data), so you will never see statistics that show that they have run.
Related
I have a long running idle query that is not automatically terminated.
I have set both the max timeouts to 2h (very long I know)
> select name,setting from pg_settings where name='statement_timeout' OR name='idle_in_transaction_session_timeout';
name | setting
-------------------------------------+---------
idle_in_transaction_session_timeout | 7200000
statement_timeout | 7200000
However I have this idle query (not idle_in_transaction) that is leftover from an application that crashed
> SELECT pid, age(clock_timestamp(), query_start), state, usename, query
FROM pg_stat_activity
WHERE query NOT ILIKE '%pg_stat_activity%'
ORDER BY query_start desc;
17117 | 02:11:40.795487 | idle | ms1-user | select distinct ....
Postgres 11.13 running on AWS Aurora
Can anyone explain why/what's missing?
As the name suggests, idle_in_transaction_session_timeout does not terminate idle sessions, but sessions that are "idle in transaction". For the latter, you can use idle_session_timeout introduced in PostgreSQL v14.
In your case, the problem are the TCP keepalive settings. With the default keepalive settings on Linux, it takes the server around 2 hours and 14.5 minutes to figure out that the other end of the connection is no longer there. So wait a few minutes more :^)
If you want to reduce the time, you can set the PostgreSQL parameters tcp_keepalives_idle, tcp_keepalives_interval and tcp_keepalives_count if Amazon allows you to do that. If they don't, complain.
We recently saw a few queries "idle in transaction" for quite some time
pid | usename | state | duration | application_name | wait_event | wait_event_type
------+---------+---------------------+----------+------------------+------------+----------------
31620 | results | idle in transaction | 12:52:23 | bin/rails | |
That is almost 13 hours idle in transaction.
Any idea what causes them to get stuck in idle, or how to dig deeper? We did notice some OOM errors for background jobs.
There are also a lot of "idle" queries, but thanks for the comments, those seem to be fine:
In postgresql "idle in transaction" with all locks granted #LaurenzAlbe was pointing out the idle session timeout configuration option as a band-aid, but I'd rather understand this issue than hide it.
thanks!
PS: our application is ruby on rails and we use a mix of active record and custom SQL
EDIT: original title was "idle in transaction", the queries are actually just idle most of the time and not in transaction, sorry about that
EDIT #2: found the 13 hour idle in transaction process
These sessions are actually all idle, so they are no problem.
idle is significantly different from idle in transaction: the latter is an open transaction that holds locks and blocks VACUUM, the first is harmless.
The OOM errors must have a different reason.
You should configure the machine so that
shared_buffers + max_connections * work_mem <= available RAM
I have xlog questions that I'm not sure about.
1) I have two servers that were once slaves. How can I know if they were slaves of the same master? Is it possible to check if they were split from the same source in the past? I know pg_rewind knows how to check if, but is it possible to easily check it without running pg_rewind in dry run mode?
2) Is it true that if pg_last_xlog_replay_location is empty this server was never a slave?
3) Is it possible to know from the database itself to which master the slave is connected? I know to get this info from the recovery.conf or from the process attributes, but is it written in some system tables as well?
Thanks
Avi
were slaves of the same master
indirectly. you can compare select xmin,ctid,oid, datname from pg_database. of course dropping and creating postgres and template databases will change those, so this is very unreliable. but if you check those and find that ALL identifiers match - there's a good change that databases have same source.
more reliable and sophisticated method is comparing history file. Eg - if both ex slaves have same timeline, eg in case below 4:
-bash-4.2$ psql -d 'dbname=replication replication=true sslmode=require' -U replica -h 1.1.1.1 -c 'IDENTIFY_SYSTEM'
Password for user replica:
systemid | timeline | xlogpos
---------------------+----------+--------------
9999384298900975599 | 4 | F79/275B2328
(1 row)
you can check timelines history:
-bash-4.2$ psql -d 'dbname=replication replication=true sslmode=require' -U replica -h 1.1.1.1 -c 'TIMELINE_HISTORY 4'
Password for user replica:
filename | content
------------------+------------------------------------------------------
00000004.history | 1 9E/C3000090 no recovery target specified+
| +
| 2 C1/5A000090 no recovery target specified+
| +
| 3 A52/DB2F98B8 no recovery target specified+
|
(1 row)
If both servers have same timeline and same xlog position at which a timeline was created, you can say with much reliability, I believe, that came from same sourse.
empty pg_last_xlog_replay_location
I would say so. It was never a slave and was never recovered from WALs. At least I don't know how to reset pg_last_xlog_replay_location on promoted master...
system tables to tell to which master the slave is connected
Nothing suitable comes to my mind. If you are SU then you can read recovery.conf even without shell access, if you're not, you probably would not be able to select such a view...
I support an application hosted by a small business, web-based ROR app using pgsql database on the backend.
Postgres is setup for replication to an off-site standby server, which as far as I can tell is working fine, when I query the remote server it shows that it's in recovery, etc.
From the 'master' server:
postgres=# table pg_stat_replication ;
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start
| state | sent_location | write_location | flush_location | replay_location | sync_priority | sync_state
-------+----------+---------+------------------+----------------+-----------------+-------------+-----------------------
--------+-----------+---------------+----------------+----------------+-----------------+---------------+------------
18660 | 1281085 | rep | postgresql2 | 192.168.81.155 | | 43824 | 2017-05-07 11:42:43.15
0057-04 | streaming | 3/B5243418 | 3/B5243418 | 3/B5243418 | 3/B5243150 | 1 | sync
(1 row)
...and on the 'slave':
postgres=# select pg_is_in_recovery();
pg_is_in_recovery
-------------------
t
(1 row)
postgres=# select now() - pg_last_xact_replay_timestamp() AS replication_delay;
replication_delay
-------------------
01:02:14.885511
(1 row)
I understand the process involved should I have to promote my remote slave DB to the role of master, but the problem I seem to have is that on 2 or 3 occasions now the network link to the remote slave server has gone down, and the application completely "freezes up" (e.g. page loads but will not allow users to logon), despite the fact that the master DB is still up and running. I have wal archiving enabled to make sure that when something like this happens the data is preserved until the link is restored and the transaction logs can be sent...but I don't understand why my master pgsql instance seems to lockup because the slave instance goes offline...kind of defeats the entire concept of replication, so I assume I must be doing something wrong?
The most likely explanation is that you are using synchronous replication with just two nodes.
Is synchronous_standby_names set on the master server?
If the only synchronous standby server is not available, no transaction can commit on the master, and data modifying transactions will “hang”, which would explain the behaviour you observe.
For synchronous replication you need at lest two slaves.
I'm attempting to performance-test distributed joins on Citus 5.0. I have a master and two worker nodes, and a few hash distributed tables that behave as expected with the default config. I need to use the task tracker executor to test queries that require repartitioning.
However, After setting citus.task_executor_type to task-tracker, all queries involving distributed tables fail. For example:
postgres=# SET citus.task_executor_type TO "task-tracker";
SET
postgres=# SELECT 1 FROM distrib_mcuser_car LIMIT 1;
ERROR: failed to execute job 39
DETAIL: Too many task tracker failures
Setting citus.task_executor_type in postgresql.conf has the same effect.
Is there some other configuration change I'm missing that's necessary to switch the task executor?
EDIT, more info:
PostGIS is installed on all nodes
postgres_fdw is installed on the master
All other configuration is pristine
All of the tables so far were distributed like:
SELECT master_create_distributed_table('table_name', 'id', 'hash');
SELECT master_create_worker_shards('table_name', 8, 2);
The schema for distrib_mcuser_car is fairly large, so here's a more simple example:
postgres=# \d+ distrib_test_int
Table "public.distrib_test_int"
Column | Type | Modifiers | Storage | Stats target | Description
--------+---------+-----------+---------+--------------+-------------
num | integer | | plain | |
postgres=# select * from distrib_test_int;
ERROR: failed to execute job 76
DETAIL: Too many task tracker failures
The task-tracker executor assigns tasks (queries on shards) to a background worker running on the worker node, which connects to localhost to run the task. If your superuser requires a password when connecting to localhost, then the background worker will be unable to connect. This can be resolved by adding a .pgpass file on the worker nodes for connecting to localhost.
You can modify authentication settings and let workers connect to master without password checks by changing pg_hba.conf.
Add following line to master pg_conf.hba:
host all all [worker 1 ip]/32 trust
host all all [worker 2 ip]/32 trust
And following lines to for each worker-1 pg_hba.conf:
host all all [master ip]/32 trust
host all all [worker 2 ip]/32 trust
And following to worker-2 pg_hba.conf:
host all all [master ip]/32 trust
host all all [worker 1 ip]/32 trust
This is only intended for testing, DO NOT USE this for production system without taking necessary security precautions.