mesos masters keep restarting - apache-zookeeper

I have 3 mesos masters with version 0.26.0 setup with a quorum of 2. When I start them, they keep restarting even before I turn up any frameworks or slaves.
Here's the errors I'm seeing:
F0322 19:36:56.009903 51459 master.cpp:1368] Recovery failed: Failed to recover registrar: Failed to perform fetch within 1mins
E0322 19:37:18.300568 41095 process.cpp:1911] Failed to shutdown socket with fd 26: Transport endpoint is not connected
There's no firewall running.
I start them with supervisord and the following command:
/usr/sbin/mesos-master --cluster=int --log_dir=/var/log/mesos/int --quorum=2 --port=5050 --work_dir=/tmp/mesos/work/int --zk=zk://intMesosMaster01:2181,intMesosMaster02:2181,intMesosMaster03:2181/mesos
Zookeeper is up and running fine with 3 nodes. It's in use for other projects and has no issues at all with them.

Related

Hyperledger fabric chaincode connection with peer getting dropped

I have a hyperledger fabric network version 2.4.4 running on Kubernetes. The peers and other components are running behind istio ingress. The chaincode is running on dind (docker-in-docker) container and connects to peer through its URL. The problem is the chaincode connection is being dropped after few minutes. Below are the logs:
2022-07-14T04:31:13.057Z info [c-api:lib/handler.js] [assetschannel-ddc183b4] Calling chaincode Invoke() succeeded. Sending COMPLETED message back to peer
2022-07-14T04:33:04.197Z error [c-api:lib/handler.js] Chat stream with peer - on error: %j "Error: 14 UNAVAILABLE: Connection dropped\n at Object.callErrorFromStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/call.js:31:26)\n at Object.onReceiveStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/client.js:391:49)\n at Object.onReceiveStatus (/usr/local/src/node_modules/#grpc/grpc-js/build/src/client-interceptors.js:328:181)\n at /usr/local/src/node_modules/#grpc/grpc-js/build/src/call-stream.js:187:78\n at processTicksAndRejections (node:internal/process/task_queues:78:11)"
I did set the following environment variables in the peer pod to keep the connection alive:
CORE_CHAINCODE_KEEPALIVE: 60000
CORE_PEER_KEEPALIVE_CLIENT_INTERVAL: 600s
CORE_PEER_KEEPALIVE_CLIENT_TIMEOUT: 2s
CORE_PEER_KEEPALIVE_DELIVERYCLIENT_INTERVAL: 20s
CORE_PEER_KEEPALIVE_MININTERVAL: 15s
but this did not resolve the issue.
Any suggestions would be appreciated.
It appears to be an issue with aws elb. The idle timeout was set to 60s which was breaking the connection between chaincode and peer when there was no communication between them. Increasing this time fixed the issue.

Why is Zookeeper not re-electing new leader in Apache Nifi Cluster?

Following is my architecture
2 Servers:
Server 1: running Apache Nifi + Zookeeper (Not embedded)
Server 2: running Apache Nifi + Zookeeper (Not embedded)
To test failovers, I close down the Server that has been selected as Cluster Coordinator
In this case, zookeeper should automatically elect the remaining one server as leader. But it keeps failing and goes into continuous loop of trying to connect to the first server
Zookeeper Logs in Server 2 when leader (Server 1) went down:
2019-10-22 18:44:01,135 [myid:2] - WARN [NIOWorkerThread-2:NIOServerCnxn#370] - Exception causing close of session 0x0: ZooKeeperServer not running
2019-10-22 18:44:02,925 [myid:2] - WARN [NIOWorkerThread-3:NIOServerCnxn#370] - Exception causing close of session 0x0: ZooKeeperServer not running
2019-10-22 18:44:03,320 [myid:2] - WARN [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumCnxManager#677] -
Cannot open channel to 1 at election address ec2-server-1.compute-1.amazonaws.com/172.xx.x.x:3888
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
Server 2 Config files:
zoo.cfg
tickTime=2000
initLimit=5
syncLimit=2
dataDir=/home/ec2-user/zookeeper
clientPort=2181
server.1=ec2-server-1.compute-1.amazonaws.com:2888:3888
server.2=0.0.0.0:2888:3888
nifi.properties
nifi.cluster.is.node=true
nifi.cluster.node.address=ec2-server-2.compute-1.amazonaws.com
nifi.cluster.node.protocol.port=8082
nifi.cluster.flow.election.max.wait.time=2 mins
nifi.cluster.flow.election.max.candidates=1
# zookeeper properties, used for cluster management #
nifi.zookeeper.connect.string=localhost:2181
nifi.zookeeper.root.node=/nifi
Server 1 Config files:
zoo.cfg
tickTime=2000
initLimit=5
syncLimit=2
dataDir=/home/ec2-user/zookeeper
clientPort=2181
server.1=0.0.0.0:2888:3888
server.2=ec2-server-2.compute-1.amazonaws.com:2888:3888
nifi.properties
nifi.cluster.is.node=true
nifi.cluster.node.address=ec2-server-1.compute-1.amazonaws.com
nifi.cluster.node.protocol.port=8082
nifi.cluster.flow.election.max.wait.time=2 mins
nifi.cluster.flow.election.max.candidates=1
# zookeeper properties, used for cluster management #
nifi.zookeeper.connect.string=localhost:2181
nifi.zookeeper.root.node=/nifi
What am I doing wrong?
You need at least three nodes to be able to handle the failure of one node.
Check the Admin guide:
Clustered (Multi-Server) Setup For reliable ZooKeeper service, you
should deploy ZooKeeper in a cluster known as an ensemble. As long as
a majority of the ensemble are up, the service will be available.
Because Zookeeper requires a majority, it is best to use an odd number
of machines. For example, with four machines ZooKeeper can only handle
the failure of a single machine; if two machines fail, the remaining
two machines do not constitute a majority. However, with five machines
ZooKeeper can handle the failure of two machines.
A simpler explanation here also

Error restoring Rancher: This cluster is currently Unavailable; areas that interact directly with it will not be available until the API is ready

I am trying to backup and restore rancher server (single node install), as the described here.
After backup, I tried to turn off the rancher server node, and I run a new rancher container on a new node (in the same network, but another ip address), then I restored using the backup file.
After restoring, I logged in to the rancher UI and it showed the error below:
So, I checked the logs of the rancher server and it showed as below:
2019-10-05 16:41:32.197641 I | http: TLS handshake error from 127.0.0.1:38388: EOF
2019-10-05 16:41:32.202442 I | http: TLS handshake error from 127.0.0.1:38380: EOF
2019-10-05 16:41:32.210378 I | http: TLS handshake error from 127.0.0.1:38376: EOF
2019-10-05 16:41:32.211106 I | http: TLS handshake error from 127.0.0.1:38386: EOF
2019/10/05 16:42:26 [ERROR] ClusterController c-4pgjl [user-controllers-controller] failed with : failed to start user controllers for cluster c-4pgjl: failed to contact server: Get https://192.168.94.154:6443/api/v1/namespaces/kube-system?timeout=30s: waiting for cluster agent to connect
2019/10/05 16:44:34 [ERROR] ClusterController c-4pgjl [user-controllers-controller] failed with : failed to start user controllers for cluster c-4pgjl: failed to contact server: Get https://192.168.94.154:6443/api/v1/namespaces/kube-system?timeout=30s: waiting for cluster agent to connect
2019/10/05 16:48:50 [ERROR] ClusterController c-4pgjl [user-controllers-controller] failed with : failed to start user controllers for cluster c-4pgjl: failed to contact server: Get https://192.168.94.154:6443/api/v1/namespaces/kube-system?timeout=30s: waiting for cluster agent to connect
2019-10-05 16:50:19.114475 I | mvcc: store.index: compact 75951
2019-10-05 16:50:19.137825 I | mvcc: finished scheduled compaction at 75951 (took 22.527694ms)
2019-10-05 16:55:19.120803 I | mvcc: store.index: compact 76282
2019-10-05 16:55:19.124813 I | mvcc: finished scheduled compaction at 76282 (took 2.746382ms)
After that, I checked logs of the master nodes, I found that the rancher agent still tries to connect to the old rancher server (old ip address), not as the new one, so it makes the cluster not available.
How can I fix this?
You need to re-register the node in Rancher using the following steps.
Update the server-url in Rancher by going to Global -> Settings -> server-url
This should be the full URL with https://
Then use this script to re-register the node in Rancher https://github.com/mattmattox/cluster-agent-tool

Mesos-marthon cluster issue Could not determine the current leader

I am new in Mesos and Marathon services. I have setup 3 master and 3 slave server as per www.digitalocean.com. Configured as it is in master servers as well as slaves. Finally I done setup of Mesos, Marathon, Zookeeper and Chronos. Mesos is able to listing with 5050, Marathon is 8080 and Chronos 4400. After few hours my Marthon instances are showing like Error 503
HTTP ERROR: 503
Problem accessing /. Reason:
Could not determine the current leader
Powered by Jetty:// 9.3.z-SNAPSHOT.
But mesos is working fine. Every time i am facing this problem and if i restart the marathon service and zookeeper service its working fine.
Marathon
Jun 15 06:19:20 master3 marathon[1054]: INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(192.168.4.78:8080 (mesosphere.marathon.api.LeaderProxyFilter$:qtp522188921-35)
Jun 15 06:19:20 master3 marathon[1054]: INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(192.168.4.78:8080 (mesosphere.marathon.api.LeaderProxyFilter$:qtp522188921-35)
Zookeeper
2016-06-15 03:41:13,797 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#197] - Accepted socket connection from /192.168.4.78:38339
2016-06-15 03:41:13,798 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running

mongodb keyFile between replicas throws Permission denied

I have a single node ReplicaSet with auth activated, a root user and a keyFile I've created with this tutorial, I also have two more mongod processes in the same server in different ports (37017 and 47017) and the same replSet name, but when I try to add the secondaries in the mongo shell connected to PRIMARY with rs.add("172.31.48.41:37017") I get:
{
"ok" : 0,
"errmsg" : "Quorum check failed because not enough voting nodes responded; required 2 but only the following 1 voting nodes responded: 172.31.48.41:27017; the following nodes did not respond affirmatively: 172.31.48.41:37017 failed with Failed attempt to connect to 172.31.48.41:37017; couldn't connect to server 172.31.48.41:37017 (172.31.48.41), connection attempt failed",
"code" : 74
}
Then I went to the mongod process log of the PRIMARY and found out this:
2015-05-19T20:53:59.848-0400 I REPL [conn51] replSetReconfig admin command received from client
2015-05-19T20:53:59.848-0400 W NETWORK [conn51] Failed to connect to 172.31.48.41:37017, reason: errno:13 Permission denied
2015-05-19T20:53:59.848-0400 I REPL [conn51] replSetReconfig config object with 2 members parses ok
2015-05-19T20:53:59.849-0400 W NETWORK [ReplExecNetThread-0] Failed to connect to 172.31.48.41:37017, reason: errno:13 Permission denied
2015-05-19T20:53:59.849-0400 W REPL [ReplicationExecutor] Failed to complete heartbeat request to 172.31.48.41:37017; Location18915 Failed attempt to connect to 172.31.48.41:37017; couldn't connect to server 172.31.48.41:37017 (172.31.48.41), connection attempt failed
2015-05-19T20:53:59.849-0400 E REPL [conn51] replSetReconfig failed; NodeNotFound Quorum check failed because not enough voting nodes responded; required 2 but only the following 1 voting nodes responded: 172.31.48.41:27017; the following nodes did not respond affirmatively: 172.31.48.41:37017 failed with Failed attempt to connect to 172.31.48.41:37017; couldn't connect to server 172.31.48.41:37017 (172.31.48.41), connection attempt failed
And the log of the mongod that should become SECONDARY shows nothing, the last two lines are:
2015-05-19T20:48:36.584-0400 I REPL [initandlisten] Did not find local replica set configuration document at startup; NoMatchingDocument Did not find replica set configuration document in local.system.replset
2015-05-19T20:48:36.591-0400 I NETWORK [initandlisten] waiting for connections on port 37017
It's clear that I cannot rs.initiate() in this node because it will self vote to be PRIMARY and that would create a conflict, so the line that states "Did not find local replica set configuration document at startup" is to be ignores as far as I know.
So I would think that the permission should be ok since I'm using the same key file in every mongod process and the replSet is the same in every config file, and that's all the tutorial states to be needed, but obviously something is missing.
Any ideas? Is this a bug?
If you are using ec2 instances and ip 27017 port in security group for both instances, just add a secondary instance port. It worked for me.