Pgpool directing write queries to standby postgres database? - postgresql

I've got two centos 6.5 servers running postgres 9.4.10 with repmgr and pgpool 3.4.5. For the most part they seem to work fine, but with some tests I get errors in the logs like
< 2017-01-24 18:47:14.588 GMT >STATEMENT: SELECT obj.* FROM MYSCHEMA.clusterobjects obj INNER JOIN MYSCHEMA.objecttypes objtype ON obj.objecttypes_id = objtype.id AND objtype.objecttype = $1 WHERE obj.objectid = $2 FOR UPDATE
< 2017-01-24 18:47:19.585 GMT >ERROR: cannot execute SELECT FOR UPDATE in a read-only transaction
This is happening on the second node which is in standby, so therfore shouldn't have any write queries being directed to it.
It's happened more than a few times, but it's rather inconsistent, you can run the same tests on the same environment with no issue and so far I've had no luck reproducing the issue in vagrant (but that has a tendency to fall over for other reasons)
I'm wondering if this is related to the white/black function list, do we need to add anything else to that?
white_function_list = ''
black_function_list = 'nextval,setval'

Related

OPENJSON not available on SQL Server 2016 with compatibility level set to 130

I am attempting to use OPENJSON in a database that is running on SQL Server 2016, and get the following error when running this simple test query (which works fine on a different 2016 database)
select * from OPENJSON('{ "test": "test" }')
Invalid object name 'OPENJSON'.
I know about the compatibility level settings, but that doesn't seem to be the case on this database. It's compatibility level is already set to 130. This particular database was migrated from an old 2008R2 database. Is there something else we need to do to access the OPENJSON function?
EDIT
As a test, I created a new empty database on the same database server, and the above query works fine. So the database server doesn't seem to be the issue, it's something related to the one database we migrated.
If it matters, I'm connected as the SA account.
The database's reported compatibility level is 130 from both SSMS gui and by running SELECT compatibility_level FROM sys.databases WHERE name = 'MyDB';
For no reason other than to test, I ran ALTER DATABASE MyDB SET COMPATIBILITY_LEVEL = 130 , and now everything works. I don't know what would cause that.
EDIT
I think I know what caused the weird behavior now. The database that was exhibiting this behavior is a dev database that is created every single morning via replication from the prod database. After some more researching, the prod database was not set to compatibility level 130, but rather 100. I'm thinking that when replication occured and restored the dev DB from the prod log files, even though the dev db was set to 130, something was mismatched between the two. I've since upped prod to 130 as well and all should be good going forward.

bdr_init_copy hangs indefinitely

Fairly new to Postgresql, but have to get replication set up. I settled on BDR, and it works fine in the local demo, but on distributed machines it starts to get problematic, mostly because I have no real clue what the hell I am doing, and I cry myself to sleep pining for MySQL. I've gotten BDR working accross multiple servers, almost. When I run:
SELECT bdr.bdr_node_join_wait_for_ready();
on the joining nodes it hangs. This happens on both DB2 and DB3. DB1 returns a valid response. Researching this I came across the bdr_init_copy command, which apparently does everything I have been doing by hand, and then some. so tried that out. Now, when I run:
/usr/lib/postgresql/9.4/bin/bdr_init_copy -d "host=192.168.1.10 dbname=demo3" --local-dbname="host=192.168.1.23 dbname=demo3" -n db2 -D bdr_data
I get
bdr_init_copy: starting ...
Getting remote server identification ...
Detected 2 BDR database(s) on remote server
Updating BDR configuration on the remote node:
demo2: creating replication slot ...
demo2: creating node entry for local node ...
demo3: creating replication slot ...
demo3: creating node entry for local node ...
Creating base backup of the remote node...
63655/63655 kB (100%), 1/1 tablespace
Creating restore point on remote node ...
Bringing local node to the restore point ...
And it sits there. I am assuming that it is the same cause for both issues. as far as I can tell there are no log entries created on the local node (db2) but the following is present on the remote(db1)
2016-10-12 22:38:43 UTC [20808-1] postgres#demo2 LOG: logical decoding found consistent point at 0/5001F00
2016-10-12 22:38:43 UTC [20808-2] postgres#demo2 DETAIL: There are no running transactions.
2016-10-12 22:38:43 UTC [20808-3] postgres#demo2 STATEMENT: SELECT pg_create_logical_replication_slot('bdr_17163_6340711416785871202_2_17163__', 'bdr');
2016-10-12 22:38:43 UTC [20811-1] postgres#demo3 LOG: logical decoding found consistent point at 0/5002090
2016-10-12 22:38:43 UTC [20811-2] postgres#demo3 DETAIL: There are no running transactions.
2016-10-12 22:38:43 UTC [20811-3] postgres#demo3 STATEMENT: SELECT pg_create_logical_replication_slot('bdr_17939_6340711416785871202_2_17939__', 'bdr');
2016-10-12 22:38:44 UTC [20812-1] postgres#demo3 LOG: restore point "bdr_6340711416785871202" created at 0/50022A8
2016-10-12 22:38:44 UTC [20812-2] postgres#demo3 STATEMENT: SELECT pg_create_restore_point('bdr_6340711416785871202')
Any help out there?
Right, just had this issue and none of the other forums were any help. Some of them even say things like it is okay for the new node to report its status as "o" and the other nodes report the new server status as "i" because "this is just a bug and it's fine". It's NOT OKAY. The new server could receive replication updates, but no primary updates were possible on the new server. The key to solving this problem is to crank up the logging on the server you are joining to (not the new one). On the new server logs, you might see things like: 08006: could not receive data from client: Connection reset by peer, which is not very helpful, and will have you checking firewalls, etc. The real money shot will come from the source server logs when they have logs saying something like: no free replication state could be found for 11, increase max_replication_slots What's probably happened is you either have too many servers in your cluster for the default settings or, more likely, there is some junk left over from old hosts.
You need to clean things up ... ON EVERY SERVER IN THE EXISTING CLUSTER (NB!). Start by getting a list of things on the existing cluster:
select * from bdr.bdr_nodes order by node_sysid;
THEN, check the following:
select conn_sysid,conn_dboid from bdr.bdr_connections order by conn_sysid;
.. if you see old entries (that don't contain node_sysid from the first query) then delete
eg. delete from bdr.bdr_connections where conn_sysid='<from-first-query>';
select * from pg_replication_slots order by slot_name;
.. if you see old entries that don't contain an active sysid then delete
.. NB, use the function, DO NOT do a "delete from"
eg. select pg_drop_replication_slot('bdr_17213_6574566740899221664_1_17213__');
select * from pg_replication_identifier order by riname;
.. if you see old entries that don't contain an active sysid then delete
.. NB, use the function, DO NOT do a "delete from"
select pg_replication_identifier_drop('bdr_6443767151306784833_1_17210_17213_');
With any luck, after you've done this on every node, you will see your new server's BDR status go to 'r'. As you clean up each host, you should notice that the logs "08006: could not receive data from client: Connection reset by peer", matching the conn-sysid of the server you've just cleaned up, stop happening. Good luck

Postgresql large table update slows down

I run an update on a large table (e.g. 8 GB). It is a simple update of 3 fields in the table. I had no problems running it under postgresql 9.1, it would take 40-60 minutes but it worked. I run the same query in 9.4 database (freshly created, not upgraded) and it starts the update fine but then slows down. It uses only ~2% CPU, the level if IO is 4-5MB/s and it is sitting there. No locks, no other queries or connections, just this single update SQL on the server.
The SQL is below. "lookup" table has 12 records. The lookup can return only one row, it breaks a discrete scale (SMALLINT, -32768 .. +32767) into non-overlapping regions. "src" and "dest" tables are ~60 million records.
UPDATE dest SET
field1 = src.field1,
field2 = src.field2,
field3_id = (SELECT lookup.id FROM lookup WHERE src.value BETWEEN lookup.min AND lookup.max)
FROM src
WHERE dest.id = src.id;
I thought my disk slowed down but I can copy 1 GB files in parallel to query execution and it runs fast at >40MB/s and I have only one disk (it is a VM with ISCSI media). All other disk operations are not impacted, there is plenty of IO bandwidth. At the same time PostgreSQL is just sitting there doing very little, running very slowly.
I have 2 virtualized linux servers, one runs postgresql 9.1 and another runs 9.4. Both servers have close to identical postgresql configuration.
Has anyone else had similar experience? I am running out of ideas. Help.
Edit
The query "ran" for 20 hours I had to kill the connections and restart the server. Surprisingly it didn't kill the connection via query:
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid <> pg_backend_pid() AND datname = current_database();
and sever produced the following log:
2015-05-21 12:41:53.412 EDT FATAL: terminating connection due to administrator command
2015-05-21 12:41:53.438 EDT FATAL: terminating connection due to administrator command
2015-05-21 12:41:53.438 EDT STATEMENT: UPDATE <... this is 60,000,000 record table update statement>
Also server restart took long time, producing the following log:
2015-05-21 12:43:36.730 EDT LOG: received fast shutdown request
2015-05-21 12:43:36.730 EDT LOG: aborting any active transactions
2015-05-21 12:43:36.730 EDT FATAL: terminating connection due to administrator command
2015-05-21 12:43:36.734 EDT FATAL: terminating connection due to administrator command
2015-05-21 12:43:36.747 EDT LOG: autovacuum launcher shutting down
2015-05-21 12:44:36.801 EDT LOG: received immediate shutdown request
2015-05-21 12:44:36.815 EDT WARNING: terminating connection because of crash of another server process
2015-05-21 12:44:36.815 EDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
"The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory" - is this an indication of a bug in PostgreSQL?
Edit
I tested 9.1, 9.3 and 9.4. Both 9.1 and 9.3 don't experience the slow down. 9.4 consistently slows down on large transactions. I noticed that when a transaction starts htop monitor indicates high CPU and the process status is "R" (running). Then it gradually changes to low CPU usage and status "D" - disk (see screenshot ). My biggest question is why 9.4 is different from 9.1 and 9.3? I have a dozen of servers and this effect is observed across the board.
Thanks everyone for the help. No matter how much I tried to emphasize on the difference of performance between identical configuration of 9.4 and previous versions no one seemed to pay attention to that.
The problem was solved by disabling transparent huge pages:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
Here are some resources I found helpful in reserching the issue:
* https://dba.stackexchange.com/questions/32890/postgresql-pg-stat-activity-shows-commit/34169#34169
* https://lwn.net/Articles/591723/
* https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge
I'd suspect a lot of disk seeking - 5MB/s is just about right for a very random IO on ordinary (spinning) hard drive.
As you constantly replace basically all your rows I'd try to set dest table fillfactor to about 45% (alter table dest set (fillfactor=45);) and then cluster test using test_pkey;. This would allow updated row versions to be placed in the same disk sector.
Additionally using cluster src using src_pkey; so both tables would have data in the same physical order on disk also can help.
Also remember to vacuum table dest; after every update that large, so old row versions could be used again in subsequent updates.
Your old server probably evolved it's fillfactor naturally during multiple updates. On new server it is packed 100%, so updated rows have to be placed at the end.
If only few of the target rows are actually updated, you can avoid new row versions to be generated by using DISTICNT FROM. This can prevent a lot of useless disk traffic.
UPDATE dest SET
field1 = src.field1,
field2 = src.field2,
field3_id = lu.id
FROM src
JOIN lookup lu ON src.value BETWEEN lu.min AND lu.max
WHERE dest.id = src.id
-- avoid unnecessary row versions to be generated
AND (dest.field1 IS DISTINCT FROM src.field1
OR dest.field1 IS DISTINCT FROM src.field1
OR dest.field3_id IS DISTINCT FROM lu.id
)
;

ActiveRecord Postgresql Clear Connections Doesn't seem to actually clear them all the time

We're attempting a new mechanism for clearing out our db in between test runs, as we're having a hell of a time on Capybara tests with the shared connection strategy.
Our basic idea with Postgresql is to create a snapshot of the db right after db:test:prepare. When then, in between feature specs disconnect the db, drop the db, re-create it from the snapshot and reconnect.
It seems to work most of the time, but there are a few test runs that somehow maintain the connection. Our setup looks roughly like this:
# spec/spec_helper.rb
RSpec.configure do |config|
config.before(:each, type: :feature) do
DatabaseCleaner.strategy = nil
end
config.after(:each, type: :feature) do
ActiveRecord::Base.clear_all_connections!
`dropdb -Uroot our_apps_test_db`
`createdb -Uroot -T test_template our_apps_test_db`
ActiveRecord::Base.establish_connection(ActiveRecord::Base.configurations['test'])
end
end
What we've been finding however on some runs is that the rspec process is still holding onto a connection. If we peek into postgres connections just after the ActiveRecord::Base.clear_all_connections! like so:
puts `psql -c "select datname, application_name from pg_stat_activity;"`
At times (only a few) we see the following:
datname | application_name
----------------------------+-----------------------------------------------------------------
our_apps_test_db | /Users/dev/.rvm/gems/ruby-2.0.0-p353#influitive/bin/rspec
I'm wondering if there's anyone out there with a bit more knowledge about ActiveRecord and its connection handling than we have. We're rather stumped on this one.

How to close idle connections in PostgreSQL automatically?

Some clients connect to our postgresql database but leave the connections opened.
Is it possible to tell Postgresql to close those connection after a certain amount of inactivity ?
TL;DR
IF you're using a Postgresql version >= 9.2
THEN use the solution I came up with
IF you don't want to write any code
THEN use arqnid's solution
IF you don't want to write any code
AND you're using a Postgresql version >= 14
THEN use Laurenz Albe's solution
For those who are interested, here is the solution I came up with, inspired from Craig Ringer's comment:
(...) use a cron job to look at when the connection was last active (see pg_stat_activity) and use pg_terminate_backend to kill old ones.(...)
The chosen solution comes down like this:
First, we upgrade to Postgresql 9.2.
Then, we schedule a thread to run every second.
When the thread runs, it looks for any old inactive connections.
A connection is considered inactive if its state is either idle, idle in transaction, idle in transaction (aborted) or disabled.
A connection is considered old if its state stayed the same during more than 5 minutes.
There are additional threads that do the same as above. However, those threads connect to the database with different user.
We leave at least one connection open for any application connected to our database. (rank() function)
This is the SQL query run by the thread:
WITH inactive_connections AS (
SELECT
pid,
rank() over (partition by client_addr order by backend_start ASC) as rank
FROM
pg_stat_activity
WHERE
-- Exclude the thread owned connection (ie no auto-kill)
pid <> pg_backend_pid( )
AND
-- Exclude known applications connections
application_name !~ '(?:psql)|(?:pgAdmin.+)'
AND
-- Include connections to the same database the thread is connected to
datname = current_database()
AND
-- Include connections using the same thread username connection
usename = current_user
AND
-- Include inactive connections only
state in ('idle', 'idle in transaction', 'idle in transaction (aborted)', 'disabled')
AND
-- Include old connections (found with the state_change field)
current_timestamp - state_change > interval '5 minutes'
)
SELECT
pg_terminate_backend(pid)
FROM
inactive_connections
WHERE
rank > 1 -- Leave one connection for each application connected to the database
If you are using PostgreSQL >= 9.6 there is an even easier solution. Let's suppose you want to delete all idle connections every 5 minutes, just run the following:
alter system set idle_in_transaction_session_timeout='5min';
In case you don't have access as superuser (example on Azure cloud), try:
SET SESSION idle_in_transaction_session_timeout = '5min';
But this latter will work only for the current session, that most likely is not what you want.
To disable the feature,
alter system set idle_in_transaction_session_timeout=0;
or
SET SESSION idle_in_transaction_session_timeout = 0;
(by the way, 0 is the default value).
If you use alter system, you must reload configuration to start the change and the change is persistent, you won't have to re-run the query anymore if, for example, you will restart the server.
To check the feature status:
show idle_in_transaction_session_timeout;
Connect through a proxy like PgBouncer which will close connections after server_idle_timeout seconds.
From PostgreSQL v14 on, you can set the idle_session_timeout parameter to automatically disconnect client sessions that are idle.
If you use AWS with PostgreSQL >= 9.6, you have to do the following:
Create custom parameter group
go to RDS > Parameter groups > Create parameter group
Select the version of PSQL that you use, name it 'customParameters' or whatever and add description 'handle idle connections'.
Change the idle_in_transaction_session_timeout value
Fortunately it will create a copy of the default AWS group so you only have to tweak the things that you deem not suitable for your use-case.
Now click on the newly created parameter group and search 'idle'.
The default value for 'idle_in_transaction_session_timeout' is set to 24 hours (86400000 milliseconds). Divide this number by 24 to have hours (3600000) and then you have to again divide 3600000 by 4, 6 or 12 depending on whether you want the timeout to be respectively 15, 10 or 5 minutes (or equivalently multiply the number of minutes x 60000, so value 300 000 for 5 minutes).
Assign the group
Last, but not least, change the group:
go to RDS, select your DB and click on 'Modify'.
Now under 'Database options' you will find 'DB parameter group', change it to the newly created group.
You can then decide if you want to apply the modifications immediately (beware of downtime).
I have the problem of denied connections as there are too much clients connected on Postgresql 12 server (but not on similar projects using earlier 9.6 and 10 versions) and Ubuntu 18.
I wonder if those settings
tcp_keepalives_idle
tcp_keepalives_interval
could be more relevant than
idle_in_transaction_session_timeout
idle_in_transaction_session_timeout indeed closes only the idle connections from failed transactions, not the inactive connections whose statements terminate correctly...
the documentation reads that these socket-level settings have no impact with Unix-domain sockets but it could work on Ubuntu.
Up to PostgreSQL 13, you can use my extension pg_timeout.