After adding a node to test scaling then removing that node with cloudbreak, the service ambari-server won't restart.
The error at launch is:
DB configs consistency check failed. Run "ambari-server start --skip-database-check" to skip. You may try --auto-fix-database flag to attempt to fix issues automatically. If you use this "--skip-database-check" option, do not make any changes to your cluster topology or perform a cluster upgrade until you correct the database consistency issues. See /var/log/ambari-server/ambari-server-check-database.log for more details on the consistency issues.
Looking the logs doesn't say much more. I tried restarting postgres, sometimes it works, like 1 on 10 times (HOW is it possible ?)
I went deeper in my reasonning rather than just restarting postgres.
I opened the ambari table to look in it:
sudo su - postgres
psql ambari -U ambari -W -p 5432
(password is bigdata)
and when I looked in tables topology_logical_request, topology_request and topology_hostgroup, I saw that the cluster register a remove request, only an adding request:
ambari=> select * from topology_logical_request;
id | request_id | description
----+------------+-----------------------------------------------------------
1 | 1 | Logical Request: Provision Cluster 'sentelab-perf'
62 | 51 | Logical Request: Scale Cluster 'sentelab-perf' (+1 hosts)
Check the ids to delete (track all requests with adding node operation) and begin to delete them (order matters):
delete from topology_hostgroup where id = 51;
delete from topology_logical_request where id = 62;
DELETE FROM topology_request WHERE id = 51;
close with \q, restart ambari-server, and it works !
Related
I'm still a relative newbie to Postgresql, so pardon if this is simple ignorance.
I've setup a active/read-only pacemaker cluster of Postgres v9.4 per the cluster labs documentation.
I'm trying to verify that both databases are indeed in sync. I'm doing the dump on both hosts and checking the diff between the output. The command I'm using is:
pg_sql -U myuser mydb >dump-node-1.sql
Pacemaker shows the database status as 'sync' and querying Postgres directly also seems to indicate the sync is good... (Host .59 is my read-only standby node)
psql -c "select client_addr,sync_state from pg_stat_replication;"
+---------------+------------+
| client_addr | sync_state |
+---------------+------------+
| 192.16.111.59 | sync |
+---------------+------------+
(1 row)
However, when I do a dump on the read-only host I end up with all my tables having 'public.' added to the front of the names. So table foo on the master node dumps as 'foo' whereas on the read-only node it dumps as 'public.foo'. I don't understand why this is happening... I had done a 9.2 Postgresql cluster in a similar setup and didn't see this issue. I don't have tables in the public schema on the master node...
Hope someone can help me understand what is going on.
Much appreciated!
Per a_horse_with_no_name, the security updates in 9.4.18 changed the way the dump is written compared to 9.4.15. I didn't catch that one node was still running an older version. The command that identified the problem was his suggestion to run:
psql -c "select version();"
I have xlog questions that I'm not sure about.
1) I have two servers that were once slaves. How can I know if they were slaves of the same master? Is it possible to check if they were split from the same source in the past? I know pg_rewind knows how to check if, but is it possible to easily check it without running pg_rewind in dry run mode?
2) Is it true that if pg_last_xlog_replay_location is empty this server was never a slave?
3) Is it possible to know from the database itself to which master the slave is connected? I know to get this info from the recovery.conf or from the process attributes, but is it written in some system tables as well?
Thanks
Avi
were slaves of the same master
indirectly. you can compare select xmin,ctid,oid, datname from pg_database. of course dropping and creating postgres and template databases will change those, so this is very unreliable. but if you check those and find that ALL identifiers match - there's a good change that databases have same source.
more reliable and sophisticated method is comparing history file. Eg - if both ex slaves have same timeline, eg in case below 4:
-bash-4.2$ psql -d 'dbname=replication replication=true sslmode=require' -U replica -h 1.1.1.1 -c 'IDENTIFY_SYSTEM'
Password for user replica:
systemid | timeline | xlogpos
---------------------+----------+--------------
9999384298900975599 | 4 | F79/275B2328
(1 row)
you can check timelines history:
-bash-4.2$ psql -d 'dbname=replication replication=true sslmode=require' -U replica -h 1.1.1.1 -c 'TIMELINE_HISTORY 4'
Password for user replica:
filename | content
------------------+------------------------------------------------------
00000004.history | 1 9E/C3000090 no recovery target specified+
| +
| 2 C1/5A000090 no recovery target specified+
| +
| 3 A52/DB2F98B8 no recovery target specified+
|
(1 row)
If both servers have same timeline and same xlog position at which a timeline was created, you can say with much reliability, I believe, that came from same sourse.
empty pg_last_xlog_replay_location
I would say so. It was never a slave and was never recovered from WALs. At least I don't know how to reset pg_last_xlog_replay_location on promoted master...
system tables to tell to which master the slave is connected
Nothing suitable comes to my mind. If you are SU then you can read recovery.conf even without shell access, if you're not, you probably would not be able to select such a view...
Fairly new to Postgresql, but have to get replication set up. I settled on BDR, and it works fine in the local demo, but on distributed machines it starts to get problematic, mostly because I have no real clue what the hell I am doing, and I cry myself to sleep pining for MySQL. I've gotten BDR working accross multiple servers, almost. When I run:
SELECT bdr.bdr_node_join_wait_for_ready();
on the joining nodes it hangs. This happens on both DB2 and DB3. DB1 returns a valid response. Researching this I came across the bdr_init_copy command, which apparently does everything I have been doing by hand, and then some. so tried that out. Now, when I run:
/usr/lib/postgresql/9.4/bin/bdr_init_copy -d "host=192.168.1.10 dbname=demo3" --local-dbname="host=192.168.1.23 dbname=demo3" -n db2 -D bdr_data
I get
bdr_init_copy: starting ...
Getting remote server identification ...
Detected 2 BDR database(s) on remote server
Updating BDR configuration on the remote node:
demo2: creating replication slot ...
demo2: creating node entry for local node ...
demo3: creating replication slot ...
demo3: creating node entry for local node ...
Creating base backup of the remote node...
63655/63655 kB (100%), 1/1 tablespace
Creating restore point on remote node ...
Bringing local node to the restore point ...
And it sits there. I am assuming that it is the same cause for both issues. as far as I can tell there are no log entries created on the local node (db2) but the following is present on the remote(db1)
2016-10-12 22:38:43 UTC [20808-1] postgres#demo2 LOG: logical decoding found consistent point at 0/5001F00
2016-10-12 22:38:43 UTC [20808-2] postgres#demo2 DETAIL: There are no running transactions.
2016-10-12 22:38:43 UTC [20808-3] postgres#demo2 STATEMENT: SELECT pg_create_logical_replication_slot('bdr_17163_6340711416785871202_2_17163__', 'bdr');
2016-10-12 22:38:43 UTC [20811-1] postgres#demo3 LOG: logical decoding found consistent point at 0/5002090
2016-10-12 22:38:43 UTC [20811-2] postgres#demo3 DETAIL: There are no running transactions.
2016-10-12 22:38:43 UTC [20811-3] postgres#demo3 STATEMENT: SELECT pg_create_logical_replication_slot('bdr_17939_6340711416785871202_2_17939__', 'bdr');
2016-10-12 22:38:44 UTC [20812-1] postgres#demo3 LOG: restore point "bdr_6340711416785871202" created at 0/50022A8
2016-10-12 22:38:44 UTC [20812-2] postgres#demo3 STATEMENT: SELECT pg_create_restore_point('bdr_6340711416785871202')
Any help out there?
Right, just had this issue and none of the other forums were any help. Some of them even say things like it is okay for the new node to report its status as "o" and the other nodes report the new server status as "i" because "this is just a bug and it's fine". It's NOT OKAY. The new server could receive replication updates, but no primary updates were possible on the new server. The key to solving this problem is to crank up the logging on the server you are joining to (not the new one). On the new server logs, you might see things like: 08006: could not receive data from client: Connection reset by peer, which is not very helpful, and will have you checking firewalls, etc. The real money shot will come from the source server logs when they have logs saying something like: no free replication state could be found for 11, increase max_replication_slots What's probably happened is you either have too many servers in your cluster for the default settings or, more likely, there is some junk left over from old hosts.
You need to clean things up ... ON EVERY SERVER IN THE EXISTING CLUSTER (NB!). Start by getting a list of things on the existing cluster:
select * from bdr.bdr_nodes order by node_sysid;
THEN, check the following:
select conn_sysid,conn_dboid from bdr.bdr_connections order by conn_sysid;
.. if you see old entries (that don't contain node_sysid from the first query) then delete
eg. delete from bdr.bdr_connections where conn_sysid='<from-first-query>';
select * from pg_replication_slots order by slot_name;
.. if you see old entries that don't contain an active sysid then delete
.. NB, use the function, DO NOT do a "delete from"
eg. select pg_drop_replication_slot('bdr_17213_6574566740899221664_1_17213__');
select * from pg_replication_identifier order by riname;
.. if you see old entries that don't contain an active sysid then delete
.. NB, use the function, DO NOT do a "delete from"
select pg_replication_identifier_drop('bdr_6443767151306784833_1_17210_17213_');
With any luck, after you've done this on every node, you will see your new server's BDR status go to 'r'. As you clean up each host, you should notice that the logs "08006: could not receive data from client: Connection reset by peer", matching the conn-sysid of the server you've just cleaned up, stop happening. Good luck
Postgres-XL not working as expected.
I have configured a Postgres-XL cluster as below:
GTM running on node3
GMT_Proxy running on node2 and node1
Co-ordinators and datanodes running on node2 and node1.
When I try to do any operation connecting to the database directly, I get the below error which is expected anyway.
postgres=# create table test(eno integer);
ERROR: cannot execute CREATE TABLE in a read-only transaction
But when I login via the co-ordinator, it says the below error:
postgres=# \l+
ERROR: Could not begin transaction on data node.
In the postresql.log, I can see the below errors. any idea what to be done?
2016-06-26 20:20:29.786 AEST,"postgres","postgres",3880,"192.168.87.130:45479",576fabb5.f28,1,"SET",2016-06-26 20:17:25 AEST,2/31,0,ERROR,22023,"node ""coord1_3878"" does not exist",,,,,,"SET global_session TO coord1_3878;SET parentPGXCPid TO 3878;",,,"pgxc"
2016-06-26 20:20:47.180 AEST,"postgres","postgres",3895,"192.168.87.131:45802",576fac7d.f37,1,"SELECT",2016-06-26 20:20:45 AEST,3/19,0,LOG,00000,"No nodes altered. Returning",,,,,,"SELECT pgxc_pool_reload();",,,"psql"
2016-06-26 20:21:12.147 AEST,"postgres","postgres",3897,"192.168.87.131:45807",576fac98.f39,1,"SET",2016-06-26 20:21:12 AEST,3/22,0,ERROR,22023,"node ""coord1_3741"" does not exist",,,,,,"SET global_session TO coord1_3741;SET parentPGXCPid TO 3741;",,,"pgxc"
PostresXL version - 9.5 r1.1
psql (PGXL 9.5r1.1, based on PG 9.5.3 (Postgres-XL 9.5r1.1))
Ant idea for this?
Seems like you haven't really configured pgxc_ctl well. Just type in
prepare config minimal
in pgxc_ctl command line which will generate you a general pgxc_ctl.conf file that you can modify accordingly.
And you can follow the official postgres XL documentation to add nodes from the pgxc_ctl command line as John H suggested.
I have managed to fix my issue:
1) Used the source from the git repository, XL9_5_STABLE branch ( https://git.postgresql.org/gitweb/?p=postgres-xl.git;a=summary). The source tarball they provide at http://www.postgres-xl.org/download/ did not work for me
2) Used pgxc_ctl as mentioned above. I was getting Could not obtain a transaction ID from GTM because of the fact that when adding the gtm I was using localhost instead of the ip.
add gtm master gtm localhost 20001 $dataDirRoot/gtm
instead of
add gtm master gtm 10.222.1.49 20001 $dataDirRoot/gtm
I'm attempting to performance-test distributed joins on Citus 5.0. I have a master and two worker nodes, and a few hash distributed tables that behave as expected with the default config. I need to use the task tracker executor to test queries that require repartitioning.
However, After setting citus.task_executor_type to task-tracker, all queries involving distributed tables fail. For example:
postgres=# SET citus.task_executor_type TO "task-tracker";
SET
postgres=# SELECT 1 FROM distrib_mcuser_car LIMIT 1;
ERROR: failed to execute job 39
DETAIL: Too many task tracker failures
Setting citus.task_executor_type in postgresql.conf has the same effect.
Is there some other configuration change I'm missing that's necessary to switch the task executor?
EDIT, more info:
PostGIS is installed on all nodes
postgres_fdw is installed on the master
All other configuration is pristine
All of the tables so far were distributed like:
SELECT master_create_distributed_table('table_name', 'id', 'hash');
SELECT master_create_worker_shards('table_name', 8, 2);
The schema for distrib_mcuser_car is fairly large, so here's a more simple example:
postgres=# \d+ distrib_test_int
Table "public.distrib_test_int"
Column | Type | Modifiers | Storage | Stats target | Description
--------+---------+-----------+---------+--------------+-------------
num | integer | | plain | |
postgres=# select * from distrib_test_int;
ERROR: failed to execute job 76
DETAIL: Too many task tracker failures
The task-tracker executor assigns tasks (queries on shards) to a background worker running on the worker node, which connects to localhost to run the task. If your superuser requires a password when connecting to localhost, then the background worker will be unable to connect. This can be resolved by adding a .pgpass file on the worker nodes for connecting to localhost.
You can modify authentication settings and let workers connect to master without password checks by changing pg_hba.conf.
Add following line to master pg_conf.hba:
host all all [worker 1 ip]/32 trust
host all all [worker 2 ip]/32 trust
And following lines to for each worker-1 pg_hba.conf:
host all all [master ip]/32 trust
host all all [worker 2 ip]/32 trust
And following to worker-2 pg_hba.conf:
host all all [master ip]/32 trust
host all all [worker 1 ip]/32 trust
This is only intended for testing, DO NOT USE this for production system without taking necessary security precautions.