bdr_init_copy hangs indefinitely - postgresql

Fairly new to Postgresql, but have to get replication set up. I settled on BDR, and it works fine in the local demo, but on distributed machines it starts to get problematic, mostly because I have no real clue what the hell I am doing, and I cry myself to sleep pining for MySQL. I've gotten BDR working accross multiple servers, almost. When I run:
SELECT bdr.bdr_node_join_wait_for_ready();
on the joining nodes it hangs. This happens on both DB2 and DB3. DB1 returns a valid response. Researching this I came across the bdr_init_copy command, which apparently does everything I have been doing by hand, and then some. so tried that out. Now, when I run:
/usr/lib/postgresql/9.4/bin/bdr_init_copy -d "host=192.168.1.10 dbname=demo3" --local-dbname="host=192.168.1.23 dbname=demo3" -n db2 -D bdr_data
I get
bdr_init_copy: starting ...
Getting remote server identification ...
Detected 2 BDR database(s) on remote server
Updating BDR configuration on the remote node:
demo2: creating replication slot ...
demo2: creating node entry for local node ...
demo3: creating replication slot ...
demo3: creating node entry for local node ...
Creating base backup of the remote node...
63655/63655 kB (100%), 1/1 tablespace
Creating restore point on remote node ...
Bringing local node to the restore point ...
And it sits there. I am assuming that it is the same cause for both issues. as far as I can tell there are no log entries created on the local node (db2) but the following is present on the remote(db1)
2016-10-12 22:38:43 UTC [20808-1] postgres#demo2 LOG: logical decoding found consistent point at 0/5001F00
2016-10-12 22:38:43 UTC [20808-2] postgres#demo2 DETAIL: There are no running transactions.
2016-10-12 22:38:43 UTC [20808-3] postgres#demo2 STATEMENT: SELECT pg_create_logical_replication_slot('bdr_17163_6340711416785871202_2_17163__', 'bdr');
2016-10-12 22:38:43 UTC [20811-1] postgres#demo3 LOG: logical decoding found consistent point at 0/5002090
2016-10-12 22:38:43 UTC [20811-2] postgres#demo3 DETAIL: There are no running transactions.
2016-10-12 22:38:43 UTC [20811-3] postgres#demo3 STATEMENT: SELECT pg_create_logical_replication_slot('bdr_17939_6340711416785871202_2_17939__', 'bdr');
2016-10-12 22:38:44 UTC [20812-1] postgres#demo3 LOG: restore point "bdr_6340711416785871202" created at 0/50022A8
2016-10-12 22:38:44 UTC [20812-2] postgres#demo3 STATEMENT: SELECT pg_create_restore_point('bdr_6340711416785871202')
Any help out there?

Right, just had this issue and none of the other forums were any help. Some of them even say things like it is okay for the new node to report its status as "o" and the other nodes report the new server status as "i" because "this is just a bug and it's fine". It's NOT OKAY. The new server could receive replication updates, but no primary updates were possible on the new server. The key to solving this problem is to crank up the logging on the server you are joining to (not the new one). On the new server logs, you might see things like: 08006: could not receive data from client: Connection reset by peer, which is not very helpful, and will have you checking firewalls, etc. The real money shot will come from the source server logs when they have logs saying something like: no free replication state could be found for 11, increase max_replication_slots What's probably happened is you either have too many servers in your cluster for the default settings or, more likely, there is some junk left over from old hosts.
You need to clean things up ... ON EVERY SERVER IN THE EXISTING CLUSTER (NB!). Start by getting a list of things on the existing cluster:
select * from bdr.bdr_nodes order by node_sysid;
THEN, check the following:
select conn_sysid,conn_dboid from bdr.bdr_connections order by conn_sysid;
.. if you see old entries (that don't contain node_sysid from the first query) then delete
eg. delete from bdr.bdr_connections where conn_sysid='<from-first-query>';
select * from pg_replication_slots order by slot_name;
.. if you see old entries that don't contain an active sysid then delete
.. NB, use the function, DO NOT do a "delete from"
eg. select pg_drop_replication_slot('bdr_17213_6574566740899221664_1_17213__');
select * from pg_replication_identifier order by riname;
.. if you see old entries that don't contain an active sysid then delete
.. NB, use the function, DO NOT do a "delete from"
select pg_replication_identifier_drop('bdr_6443767151306784833_1_17210_17213_');
With any luck, after you've done this on every node, you will see your new server's BDR status go to 'r'. As you clean up each host, you should notice that the logs "08006: could not receive data from client: Connection reset by peer", matching the conn-sysid of the server you've just cleaned up, stop happening. Good luck

Related

Why are there always at least 10 sessions for a postgreSQL database? Why can't they be terminated?

Original aim: rename a database using ALTER DATABASE via psql.
Problem: rename fails due to other sessions accessing target database. ・All terminals/applications I am aware of have been closed.
・querying pg_stat_activity shows that there are 10 processes(=sessions?) accessing the db.・The username for each session is the same user I have been using for psql and for some local phoenix and django apps. The client_addr is also local host for all of them.
・When I use pg_terminate_backend, on any of the pids, another process gets immediately spawned.
・After restarting my pc, 10 processes are again spawned.
Concern: As I can't account for these 10 processes that I can't get rid of, I think I'm misunderstanding how postgres works somewhere.
Question: Why do 10 session/processes connected to a particular one of my databases, and why can't I terminate them using pg_terminate_backend?
Note: In the phoenix project I set up recently, I set the and set the pool_size of the Repo config to 10 - which makes me think it's related...but I'm pretty sure that project isn't running in any way.
Update - Solved
As a_horse_with_no_name suggested, the by doing the following I was able to put a stop to the 10 mystery sessions.
(1) prevent login of user responsible for the sessions (identifiable by querying `pg_stat_activity`), by doing `alter user .... with nologin`
(2)-running pg_terminate_backend on each of the session's pids.
After those steps I was able to change the table name.
The remaining puzzle is, how did those sessions get in that status in the first place... from the contents of pg_stat_activity, the wait_event value for each was clientRead.
From this post, it seems that the application may have been forcibly stopped halfway through a transaction or something, leaving postgres hanging.

Postgres: some process is cleaning all my table records

I have a Postgres installed in a Centos and another application is using Postgres the save data.
For sometime, and I can't find the reason, all the database tables become empty on the weekends.
I have been searching a lot to try to find some clues of the reason of that behaviour, but logs are not giving me that info.
I am pretty sure the application is not executing anything to clean the records, my thoughts are pointing to some process for some reason in the Postgres side.
The pg_log only shows this warning the day it happens:
HINT: Consider increasing the configuration parameter "checkpoint_segments".
LOG: checkpoints are occurring too frequently (11 seconds apart)
Apart from that I have no other clues.
Performing a VACUUM ANALYZE VERBOSE it says there is no dead data so it has nothing to delete.
Can you tell me what should I look to get the reason? Should it be any Postgres process to do it?
LOG: checkpoints are occurring too frequently (11 seconds apart)
This log message should also include all the information log_line_prefix tells it to include. So you should set log_line_prefix to include more information, like application name (voluntarily supplied by the client), database username, and host name/IP from which the connection came.
But perhaps more directly at issue, if things are connecting to your database and doing things you don't understand or approve of, it is time to change your passwords.

How to exit out of database recovery mode (currently locked in read-only mode)

A slave database was set up some time ago for the purpose of backing up or replicating a remote database. However I can no longer write to the database using a Delphi based ETL (the ETL works for another database pair, but to date has never been used for this particular pair). The replication database was setup by somebody else who has since left the company. I am reasonably sure this has been setup as a replication database, however the employee who has since left told me that replication never worked for unrelated reasons. Using the ETL we can (using SQL queries) read from the one database, and write back to the replication database, Or should be able to, as it is currently read only.
I have tried:
Maintenance such as VACUUM
Attempt to drop tables and the entire database
Restore a full backup from the master database
None of these work, and I am told the database is read-only.
I have looked at postgresql.conf and see that hot_standby is checked, so I think (but am not 100% certain) that the database is in some sort of replication mode (I've never touched replication as supported by Postgres, so I wouldn't know).
I have checked permissions in pg_hba.conf and see there are some credentials in there for replication. I am not sure whether this activates "replication mode" for the database, or simply means these credentials are for replication only.
I have been through months worth of log files (This has not been working since our IT department upgraded the entire network about 5 months ago). I see the log file contents seen below, repeated over and over with nothing else for months. Note the IP address shown below is listed in the pg_hba.conf file, so credentials are valid.
The database is in recovery mode, as I have found by using:
select pg_is_in_recovery();
This explains to me why it's read only, but why can I not restore databases, or just simply dump the entire database and start again (it's a backup so losing/restoring it is not an issue)?
I was tempted to try modifying the recovery.conf file (which exists) but I read/believe that once recovery has been initiated (which in my case it has) modifying the file will have no effect.
I'm using a legacy version of Postgres: 9.2.9
Any help here would be greatly appreciated, as I have been working solidly on this for more than a day now.
Log File entry (sample):
FATAL: could not connect to the primary server:
FATAL: no pg_hba.conf entry for replication connection from host "192.168.20.2", user "postgres", SSL off
FATAL: could not connect to the primary server: server closed the connection unexpectedly
This probably means the server terminated abnormally before or while processing the request.
A couple of options would work for me:
Convert the database from being a read-only replication database, to a standard read/write database or
Dump/drop the entire database so I can create a new one with write capabilities.
It looks like the two database clusters have been set up for replication, but configuration changes on one of the machines broke the replication (changed pg_hba.conf on the primary, changed IP addresses, …).
Here is the way to your desired solutions:
Bringing the standby out of recovery mode: Run
/path/to/pg_ctl promote -D /path/to/data/directory
on the standby as operating system user postgres.
Nuking the standby: Remove the data directory on the standby with rm -rf (or the equivalent on your operating system). Kill all PostgreSQL processes.
Then use initdb to create a new database cluster in the same location.

Pgpool directing write queries to standby postgres database?

I've got two centos 6.5 servers running postgres 9.4.10 with repmgr and pgpool 3.4.5. For the most part they seem to work fine, but with some tests I get errors in the logs like
< 2017-01-24 18:47:14.588 GMT >STATEMENT: SELECT obj.* FROM MYSCHEMA.clusterobjects obj INNER JOIN MYSCHEMA.objecttypes objtype ON obj.objecttypes_id = objtype.id AND objtype.objecttype = $1 WHERE obj.objectid = $2 FOR UPDATE
< 2017-01-24 18:47:19.585 GMT >ERROR: cannot execute SELECT FOR UPDATE in a read-only transaction
This is happening on the second node which is in standby, so therfore shouldn't have any write queries being directed to it.
It's happened more than a few times, but it's rather inconsistent, you can run the same tests on the same environment with no issue and so far I've had no luck reproducing the issue in vagrant (but that has a tendency to fall over for other reasons)
I'm wondering if this is related to the white/black function list, do we need to add anything else to that?
white_function_list = ''
black_function_list = 'nextval,setval'

Postgresql large table update slows down

I run an update on a large table (e.g. 8 GB). It is a simple update of 3 fields in the table. I had no problems running it under postgresql 9.1, it would take 40-60 minutes but it worked. I run the same query in 9.4 database (freshly created, not upgraded) and it starts the update fine but then slows down. It uses only ~2% CPU, the level if IO is 4-5MB/s and it is sitting there. No locks, no other queries or connections, just this single update SQL on the server.
The SQL is below. "lookup" table has 12 records. The lookup can return only one row, it breaks a discrete scale (SMALLINT, -32768 .. +32767) into non-overlapping regions. "src" and "dest" tables are ~60 million records.
UPDATE dest SET
field1 = src.field1,
field2 = src.field2,
field3_id = (SELECT lookup.id FROM lookup WHERE src.value BETWEEN lookup.min AND lookup.max)
FROM src
WHERE dest.id = src.id;
I thought my disk slowed down but I can copy 1 GB files in parallel to query execution and it runs fast at >40MB/s and I have only one disk (it is a VM with ISCSI media). All other disk operations are not impacted, there is plenty of IO bandwidth. At the same time PostgreSQL is just sitting there doing very little, running very slowly.
I have 2 virtualized linux servers, one runs postgresql 9.1 and another runs 9.4. Both servers have close to identical postgresql configuration.
Has anyone else had similar experience? I am running out of ideas. Help.
Edit
The query "ran" for 20 hours I had to kill the connections and restart the server. Surprisingly it didn't kill the connection via query:
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE pid <> pg_backend_pid() AND datname = current_database();
and sever produced the following log:
2015-05-21 12:41:53.412 EDT FATAL: terminating connection due to administrator command
2015-05-21 12:41:53.438 EDT FATAL: terminating connection due to administrator command
2015-05-21 12:41:53.438 EDT STATEMENT: UPDATE <... this is 60,000,000 record table update statement>
Also server restart took long time, producing the following log:
2015-05-21 12:43:36.730 EDT LOG: received fast shutdown request
2015-05-21 12:43:36.730 EDT LOG: aborting any active transactions
2015-05-21 12:43:36.730 EDT FATAL: terminating connection due to administrator command
2015-05-21 12:43:36.734 EDT FATAL: terminating connection due to administrator command
2015-05-21 12:43:36.747 EDT LOG: autovacuum launcher shutting down
2015-05-21 12:44:36.801 EDT LOG: received immediate shutdown request
2015-05-21 12:44:36.815 EDT WARNING: terminating connection because of crash of another server process
2015-05-21 12:44:36.815 EDT DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
"The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory" - is this an indication of a bug in PostgreSQL?
Edit
I tested 9.1, 9.3 and 9.4. Both 9.1 and 9.3 don't experience the slow down. 9.4 consistently slows down on large transactions. I noticed that when a transaction starts htop monitor indicates high CPU and the process status is "R" (running). Then it gradually changes to low CPU usage and status "D" - disk (see screenshot ). My biggest question is why 9.4 is different from 9.1 and 9.3? I have a dozen of servers and this effect is observed across the board.
Thanks everyone for the help. No matter how much I tried to emphasize on the difference of performance between identical configuration of 9.4 and previous versions no one seemed to pay attention to that.
The problem was solved by disabling transparent huge pages:
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
Here are some resources I found helpful in reserching the issue:
* https://dba.stackexchange.com/questions/32890/postgresql-pg-stat-activity-shows-commit/34169#34169
* https://lwn.net/Articles/591723/
* https://blogs.oracle.com/linux/entry/performance_issues_with_transparent_huge
I'd suspect a lot of disk seeking - 5MB/s is just about right for a very random IO on ordinary (spinning) hard drive.
As you constantly replace basically all your rows I'd try to set dest table fillfactor to about 45% (alter table dest set (fillfactor=45);) and then cluster test using test_pkey;. This would allow updated row versions to be placed in the same disk sector.
Additionally using cluster src using src_pkey; so both tables would have data in the same physical order on disk also can help.
Also remember to vacuum table dest; after every update that large, so old row versions could be used again in subsequent updates.
Your old server probably evolved it's fillfactor naturally during multiple updates. On new server it is packed 100%, so updated rows have to be placed at the end.
If only few of the target rows are actually updated, you can avoid new row versions to be generated by using DISTICNT FROM. This can prevent a lot of useless disk traffic.
UPDATE dest SET
field1 = src.field1,
field2 = src.field2,
field3_id = lu.id
FROM src
JOIN lookup lu ON src.value BETWEEN lu.min AND lu.max
WHERE dest.id = src.id
-- avoid unnecessary row versions to be generated
AND (dest.field1 IS DISTINCT FROM src.field1
OR dest.field1 IS DISTINCT FROM src.field1
OR dest.field3_id IS DISTINCT FROM lu.id
)
;