Percona cluster automatic fail-over not working - centos

I'm deploying Percona Xtradb-Cluster and get stuck at automatic fail-over. When I stop node 2 in database not auto update status.
This is log:
2017-06-23 23:37:29 MySQL_Monitor.cpp:1126:monitor_ping(): [ERROR] Server 192.168.1.11:3306 missed 3 heartbeats, shunning it and killing all the connections
2017-06-23 23:37:39 MySQL_Monitor.cpp:1126:monitor_ping(): [ERROR] Server 192.168.1.11:3306 missed 3 heartbeats, shunning it and killing all the connections
2017-06-23 23:37:49 MySQL_Monitor.cpp:1126:monitor_ping(): [ERROR] Server 192.168.1.11:3306 missed 3 heartbeats, shunning it and killing all the connections
2017-06-23 23:37:59 MySQL_Monitor.cpp:1126:monitor_ping(): [ERROR] Server 192.168.1.11:3306 missed 3 heartbeats, shunning it and killing all the connections
2017-06-23 23:38:09 MySQL_Monitor.cpp:1126:monitor_ping(): [ERROR] Server 192.168.1.11:3306 missed 3 heartbeats, shunning it and killing all the connections
2017-06-23 23:38:19 MySQL_Monitor.cpp:1126:monitor_ping(): [ERROR] Server 192.168.1.11:3306 missed 3 heartbeats, shunning it and killing all the connections
2017-06-23 23:38:29 MySQL_Monitor.cpp:1126:monitor_ping(): [ERROR] Server 192.168.1.11:3306 missed 3 heartbeats, shunning it and killing all the connections
2017-06-23 23:38:39 MySQL_Monitor.cpp:1126:monitor_ping(): [ERROR] Server 192.168.1.11:3306 missed 3 heartbeats, shunning it and killing all the connections
2017-06-23 23:38:49 MySQL_Monitor.cpp:1126:monitor_ping(): [ERROR] Server 192.168.1.11:3306 missed 3 heartbeats, shunning it and killing all the connections
Thanks.

Related

Need help connecting to libera chat

I am having trouble connecting to libera.chat and irc.libera.chat using Konversation Version 1.8.21123 on Jammy Jellyfish (fully updated). I have worked through the steps given on https://userbase.kde.org/Konversatio...tication#step5 and still cannot connect. The repeating log is shown below.
[12:44] [Info] Looking for server irc.libera.chat (port 6697)...
[12:44] [Info] Server found, connecting...
[12:44] [Info] Negotiating capabilities with server...
[12:44] [Notice] -lithium.libera.chat- *** Checking Ident
[12:44] [Notice] -lithium.libera.chat- *** Looking up your hostname...
[12:44] [Notice] -lithium.libera.chat- *** Couldn't look up your hostname
[12:45] [Notice] -lithium.libera.chat- *** No Ident response
[12:45] [Capabilities] account-notify away-notify chghost extended-join multi-prefix sasl=PLAIN,ECDSA-NIST256P-CHALLENGE,EXTERNAL tls account-tag cap-notify echo-message server-time solanum.chat/identify-msg solanum.chat/oper solanum.chat/realhost
[12:45] [Info] Requesting capabilities: account-notify away-notify chghost extended-join multi-prefix sasl cap-notify server-time
[12:45] [Info] SASL capability acknowledged by server, attempting SASL PLAIN authentication...
[12:45] [Error] SASL authentication attempt failed.
[12:45] [Info] Closing capabilities negotiation.
[12:45] [Error] Connection to server irc.libera.chat (port 6697) lost: The TLS/SSL connection has been closed.
[12:45] [Info] Trying to reconnect to irc.libera.chat (port 6697) in 10 seconds.
[12:45] [Info] Looking for server irc.libera.chat (port 6697)...​ <-- Log repeats from this line.
Is there something blatant that I have overlooked ?
Is there some web page that I need to visit in order to register my ident/hostname/whatever (!) ?
Stuart
​

Error restoring Rancher: This cluster is currently Unavailable; areas that interact directly with it will not be available until the API is ready

I am trying to backup and restore rancher server (single node install), as the described here.
After backup, I tried to turn off the rancher server node, and I run a new rancher container on a new node (in the same network, but another ip address), then I restored using the backup file.
After restoring, I logged in to the rancher UI and it showed the error below:
So, I checked the logs of the rancher server and it showed as below:
2019-10-05 16:41:32.197641 I | http: TLS handshake error from 127.0.0.1:38388: EOF
2019-10-05 16:41:32.202442 I | http: TLS handshake error from 127.0.0.1:38380: EOF
2019-10-05 16:41:32.210378 I | http: TLS handshake error from 127.0.0.1:38376: EOF
2019-10-05 16:41:32.211106 I | http: TLS handshake error from 127.0.0.1:38386: EOF
2019/10/05 16:42:26 [ERROR] ClusterController c-4pgjl [user-controllers-controller] failed with : failed to start user controllers for cluster c-4pgjl: failed to contact server: Get https://192.168.94.154:6443/api/v1/namespaces/kube-system?timeout=30s: waiting for cluster agent to connect
2019/10/05 16:44:34 [ERROR] ClusterController c-4pgjl [user-controllers-controller] failed with : failed to start user controllers for cluster c-4pgjl: failed to contact server: Get https://192.168.94.154:6443/api/v1/namespaces/kube-system?timeout=30s: waiting for cluster agent to connect
2019/10/05 16:48:50 [ERROR] ClusterController c-4pgjl [user-controllers-controller] failed with : failed to start user controllers for cluster c-4pgjl: failed to contact server: Get https://192.168.94.154:6443/api/v1/namespaces/kube-system?timeout=30s: waiting for cluster agent to connect
2019-10-05 16:50:19.114475 I | mvcc: store.index: compact 75951
2019-10-05 16:50:19.137825 I | mvcc: finished scheduled compaction at 75951 (took 22.527694ms)
2019-10-05 16:55:19.120803 I | mvcc: store.index: compact 76282
2019-10-05 16:55:19.124813 I | mvcc: finished scheduled compaction at 76282 (took 2.746382ms)
After that, I checked logs of the master nodes, I found that the rancher agent still tries to connect to the old rancher server (old ip address), not as the new one, so it makes the cluster not available.
How can I fix this?
You need to re-register the node in Rancher using the following steps.
Update the server-url in Rancher by going to Global -> Settings -> server-url
This should be the full URL with https://
Then use this script to re-register the node in Rancher https://github.com/mattmattox/cluster-agent-tool

zookeeper messos configuration issue

I'm following this guide to configure messos 3 node master and 3 node slave cluster. However when I start master zookeepers I get following error log
2017-07-05 09:46:18,568 - INFO [main:FileSnap#83] - Reading snapshot /var/lib/zookeeper/version-2/snapshot.100000016
2017-07-05 09:46:18,606 - ERROR [main:FileTxnSnapLog#210] - Parent /mesos/log_replicas missing for /mesos/log_replicas/0000000002
2017-07-05 09:46:18,607 - ERROR [main:QuorumPeer#453] - Unable to load database on disk
java.io.IOException: Failed to process transaction type: 1 error: KeeperErrorCode = NoNode for /mesos/log_replicas
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:153)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:417)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:409)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /mesos/log_replicas
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:211)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:151)
... 6 more
2017-07-05 09:46:18,610 - ERROR [main:QuorumPeerMain#89] - Unexpected exception, exiting abnormally
java.lang.RuntimeException: Unable to run quorum server
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:454)
at org.apache.zookeeper.server.quorum.QuorumPeer.start(QuorumPeer.java:409)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.runFromConfig(QuorumPeerMain.java:151)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:111)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
Caused by: java.io.IOException: Failed to process transaction type: 1 error: KeeperErrorCode = NoNode for /mesos/log_replicas
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:153)
at org.apache.zookeeper.server.ZKDatabase.loadDataBase(ZKDatabase.java:223)
at org.apache.zookeeper.server.quorum.QuorumPeer.loadDataBase(QuorumPeer.java:417)
... 4 more
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /mesos/log_replicas
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:211)
at org.apache.zookeeper.server.persistence.FileTxnSnapLog.restore(FileTxnSnapLog.java:151)
... 6 more
When slaves are started obviously it cannot discover the masters since it cannot connect to zookeeper. Slaves gives this error
I0705 09:33:43.593530 25710 provisioner.cpp:410] Provisioner recovery complete
I0705 09:33:43.593668 25710 slave.cpp:5970] Finished recovery
W0705 09:33:53.529522 25717 group.cpp:494] Timed out waiting to connect to ZooKeeper. Forcing ZooKeeper session (sessionId=0) expiration
I0705 09:33:53.530243 25717 group.cpp:510] ZooKeeper session expired
W0705 09:34:03.532635 25710 group.cpp:494] Timed out waiting to connect to ZooKeeper. Forcing ZooKeeper session (sessionId=0) expiration
I0705 09:34:03.533331 25710 group.cpp:510] ZooKeeper session expired
Any ideas how to troubleshoot this.
Reinstalling master nodes solved the first problem.
Still I had the 2nd problem, where slaves could not find zookeeper. Documentation seems to indicate slaves could discover the master nodes. Was not working for me. However when I pointed zookeeper nodes in slaves in (/etc/mesos/zk) it started working

Cygnus does not reconnect to kafka broker

I am using cygnus-kafka connector. when the connection is lost beetween cygnus and the zookeeper. cygnus can not reconnect again to the zookeeper when the conenction is back. I need to restart it so it will be able to reconnect to the zookeeper.
Any ideas why cygnus is not able to reconnect to the kafka broker if the connection was lost once?
This the error that I got:
time=2016-11-30T11:29:26.254Z | lvl=WARN | corr=2a924ba4-b6f0-11e6-8836-fa163e68f7a2 | trans=ce766745-ae85-415a-a6f3-0bed9f121e79 | srv=service| subsrv=/servicepath | function=run | comp=cygnusagent | msg=org.apache.zookeeper.ClientCnxn$SendThread[1185] : Session 0x0 for server kafkaServerIp/kafkaServerIp:2181, unexpected error, closing socket connection and attempting reconnect
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:856)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1154)
time=2016-11-30T11:29:28.211Z | lvl=WARN | corr=2a924ba4-b6f0-11e6-8836-fa163e68f7a2 | trans=ce766745-ae85-415a-a6f3-0bed9f121e79 | srv=service| subsrv=/servicepath | function=processNewBatches | comp=cygnusagent | msg=com.telefonica.iot.cygnus.sinks.NGSISink[439] : Unable to connect to zookeeper server within timeout: 10000
Thanks!
The problem is the connection from Cygnus to Kafka is permanent, because of efficiency issues. Nevertheless, a check for reseted connection by peer is missing in the code. I'll fix it ASAP in order it is ready for next version release (1.7.0) by the end of January (of course, it will be available at master branch once fixed, much sooner).

mesos masters keep restarting

I have 3 mesos masters with version 0.26.0 setup with a quorum of 2. When I start them, they keep restarting even before I turn up any frameworks or slaves.
Here's the errors I'm seeing:
F0322 19:36:56.009903 51459 master.cpp:1368] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
E0322 19:37:18.300568 41095 process.cpp:1911] Failed to shutdown socket with fd 26: Transport endpoint is not connected
There's no firewall running.
I start them with supervisord and the following command:
/usr/sbin/mesos-master --cluster=int --log_dir=/var/log/mesos/int --quorum=2 --port=5050 --work_dir=/tmp/mesos/work/int --zk=zk://intMesosMaster01:2181,intMesosMaster02:2181,intMesosMaster03:2181/mesos
Zookeeper is up and running fine with 3 nodes. It's in use for other projects and has no issues at all with them.