Flink: HA mode killing leading jobmanager terminating standby jobmanagers

Flink: HA mode killing leading jobmanager terminating standby jobmanagers - apache-zookeeper

I am trying to get Flink to run in HA mode using Zookeeper, but when I try to test it by killing the leader JobManager all my standby jobmanagers get killed too.
So instead of a standby jobmanager taking over as the new Leader, they all get killed which isn't supposed to happen.
My setup:
4 servers, 3 of those servers have Zookeeper running, but only 1 server will host all the JobManagers.
ad011.local: Zookeeper + Jobmanagers
ad012.local: Zookeeper + Taskmanager
ad013.local: Zookeeper
ad014.local: nothing interesting
My masters file looks like this:
ad011.local:8081
ad011.local:8082
ad011.local:8083
My flink-conf.yaml:
jobmanager.rpc.address: ad011.local
blob.server.port: 6130,6131,6132
jobmanager.heap.mb: 512
taskmanager.heap.mb: 128
taskmanager.numberOfTaskSlots: 4
parallelism.default: 2
taskmanager.tmp.dirs: /var/flink/data
metrics.reporters: jmx
metrics.reporter.jmx.class: org.apache.flink.metrics.jmx.JMXReporter
metrics.reporter.jmx.port: 8789,8790,8791
high-availability: zookeeper
high-availability.zookeeper.quorum: ad011.local:2181,ad012.local:2181,ad013.local:2181
high-availability.zookeeper.path.root: /flink
high-availability.zookeeper.path.cluster-id: /cluster-one
high-availability.storageDir: /var/flink/recovery
high-availability.jobmanager.port: 50000,50001,50002
When I run flink by using start-cluster.sh script I see my 3 JobManagers running, and going to the WebUI they all point to ad011.local:8081, which is the leader. Which is okay I guess?
I then try to test the failover by killing the leader using kill and then all my other standby JobManagers stop too.
This is what I see in my standby JobManager logs:
2017-09-29 08:08:41,590 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager at akka.tcp://flink#ad011.local:50002/user/jobmanager.
2017-09-29 08:08:41,590 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService#72d546c8.
2017-09-29 08:08:41,598 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting with JobManager akka.tcp://flink#ad011.local:50002/user/jobmanager on port 8083
2017-09-29 08:08:41,598 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService.
2017-09-29 08:08:41,645 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink#ad011.local:50000/user/jobmanager:f7dc2c48-dfa5-45a4-a63e-ff27be21363a.
2017-09-29 08:08:41,651 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService.
2017-09-29 08:08:41,722 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Received leader address but not running in leader ActorSystem. Cancelling registration.
2017-09-29 09:26:13,472 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#ad011.local:50000] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
2017-09-29 09:26:14,274 INFO org.apache.flink.runtime.jobmanager.JobManager - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2017-09-29 09:26:14,284 INFO org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at 0.0.0.0:6132
Any help would be appreciated.

Solved it by running my cluster using ./bin/start-cluster.sh instead of using service files (which calls the same script), the service file kills the other jobmanagers apparently.

Related

Kafka and Zookeeper not working give an error (Kafka shutting down and INFO ZooKeeper audit is disabled despite enabling it)

I am a beginner and I have to use Kafka for data transfer into/from Hadoop FS (or any other application, not just through put or copyFromLocal commands),kafka needs zookeeper as well, I enabled Zooekeeper audit logging but I still get errors.
For Kafka, when I want to start it:
JMX_PORT=8004 bin/kafka-server-start.sh config/server.properties
I get the error:
[2022-02-16 13:56:45,939] INFO shutting down (kafka.server.KafkaServer)
[2022-02-16 13:56:46,114] INFO App info kafka.server for 0 unregistered (org.apache.kafka.common.utils.AppInfoParser)
[2022-02-16 13:56:46,133] INFO shut down completed (kafka.server.KafkaServer)
[2022-02-16 13:56:46,133] ERROR Exiting Kafka. (kafka.Kafka$)
[2022-02-16 13:56:46,165] INFO shutting down (kafka.server.KafkaServer)
And when I want to start Zookeeper using the command:
bin/zookeeper-server-start.sh config/zookeeper.properties
I get the following (and it gets stuck on it):
[2022-02-16 14:03:13,954] INFO zookeeper.request_throttler.shutdownTimeout = 10000 (org.apache.zookeeper.server.RequestThrottler)
[2022-02-16 14:03:13,955] INFO PrepRequestProcessor (sid:0) started, reconfigEnabled=false (org.apache.zookeeper.server.PrepRequestProcessor)
[2022-02-16 14:03:14,136] INFO Using checkIntervalMs=60000 maxPerMinute=10000 maxNeverUsedIntervalMs=0 (org.apache.zookeeper.server.ContainerManager)
[2022-02-16 14:03:14,138] INFO ZooKeeper audit is disabled. (org.apache.zookeeper.audit.ZKAuditProvider)
Does anyone know how to work this out? I enabled audit logging but still. Same problem.

Zookeeper CLI isn't "stuck"; it's waiting for connections.
Open a new terminal and start Kafka
Alternatively, you could use Docker Compose / Kubernetes, if you think your host / local JVM is causing issues

Can't change kafka broker-id in Incubator Helm chart?

I have one Zookeeper server (say xx.xx.xx.xxx:2181) running on one GCP Compute Instance VM separately.
I have 3 GKE clusters all in different regions on which I am trying to install Kafka broker nodes so that all nodes connect to one Zookeeper server(xx.xx.xx.xxx:2181).
I installed the Zookeeper server on the VM following this guide with zookeeper properties looking like below:
dataDir=/tmp/data
clientPort=2181
maxClientCnxns=0
initLimit=5
syncLimit=2
tickTime=2000
# list of servers
server.1=0.0.0.0:2888:3888
I am using this Incubator Helm Chart to deploy the brokers on GKE clusters.
As per the README.md I am trying to install with the below command:
helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator
helm install --name my-kafka \
--set replicas=1,zookeeper.enabled=false,configurationOverrides."broker\.id"=1,configurationOverrides."zookeeper\.connect"="xx.xx.xx.xxx:2181" \
incubator/kafka
Error
When I deploy using any of the above ways described above on all of the three GKE Clusters, only one of the brokers gets connected to the Zookeeper server and the other two pods just restart infinitely.
When I check the Zookeeper log (on the VM), it looks something like below:
...
[2019-10-30 14:32:30,930] INFO Accepted socket connection from /xx.xx.xx.xxx:54978 (org.apache.zookeeper.server.NIOServerCnxnFactory)
[2019-10-30 14:32:30,936] INFO Client attempting to establish new session at /xx.xx.xx.xxx:54978 (org.apache.zookeeper.server.ZooKeeperServer)
[2019-10-30 14:32:30,938] INFO Established session 0x100009621af0057 with negotiated timeout 6000 for client /xx.xx.xx.xxx:54978 (org.apache.zookeeper.server.ZooKeeperServer)
[2019-10-30 14:32:32,335] INFO Got user-level KeeperException when processing sessionid:0x100009621af0057 type:create cxid:0xc zxid:0x422 txntype:-1 reqpath:n/a Error Path:/config/users Error:KeeperErrorCode = NodeExists for /config/users (org.apache.zookeeper.server.PrepRequestProcessor)
[2019-10-30 14:32:34,472] INFO Got user-level KeeperException when processing sessionid:0x100009621af0057 type:create cxid:0x14 zxid:0x424 txntype:-1 reqpath:n/a Error Path:/brokers/ids/0 Error:KeeperErrorCode = NodeExists for /brokers/ids/0 (org.apache.zookeeper.server.PrepRequestProcessor)
[2019-10-30 14:32:35,126] INFO Processed session termination for sessionid: 0x100009621af0057 (org.apache.zookeeper.server.PrepRequestProcessor)
[2019-10-30 14:32:35,127] INFO Closed socket connection for client /xx.xx.xx.xxx:54978 which had sessionid 0x100009621af0057 (org.apache.zookeeper.server.NIOServerCnxn)
[2019-10-30 14:36:49,123] INFO Expiring session 0x100009621af003b, timeout of 6000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
...
I am sure I have created firewall rules to open necessary ports and that is not a problem because one of the broker nodes is able to connect (the one who reaches first).
To me, this seems like the borkerID are not getting changed for some reason and that is the reason why Zookeeper is rejecting the connections.
I say this because kubectl logs pod/my-kafka-n outputs something like below:
...
[2019-10-30 19:56:24,614] INFO [SocketServer brokerId=0] Shutdown completed (kafka.network.SocketServer)
...
[2019-10-30 19:56:24,627] INFO [KafkaServer id=0] shutting down (kafka.server.KafkaServer)
...
As we can see above says brokerId=0 for all of the pods in all the 3 clusters.
However, when I do kubectl exec -ti pod/my-kafka-n -- env | grep BROKER, I can see the environment variable KAFKA_BROKER_ID is changed to 1, 2 and 3 for different brokers as I set.
What am I doing wrong? What is the correct way to change the kafka-broker id or to make all brokers connect to one Zookeeper instance?

make all brokers connect to one Zookeeper instance?
Seems like you are doing that okay via the configurationOverrides option. That'll deploy all pods with the same configuration.
That being said, the broker ID should not be the same per pod. If you inspect the statefulset YAML, it appears that the broker ID is calculated based on the POD_NAME variable
Sidenote
3 GKE clusters all in different regions on which I am trying to install Kafka broker nodes so that all nodes connect to one Zookeeper server
It's not clear to me how you would able to deploy to 3 sepearate clusters in one API call. But, this architecture isn't recommended by Kafka, Zookeeper, or Kubernetes communities unless these regions are "geographically close"

Why is Zookeeper not re-electing new leader in Apache Nifi Cluster?

Following is my architecture
2 Servers:
Server 1: running Apache Nifi + Zookeeper (Not embedded)
Server 2: running Apache Nifi + Zookeeper (Not embedded)
To test failovers, I close down the Server that has been selected as Cluster Coordinator
In this case, zookeeper should automatically elect the remaining one server as leader. But it keeps failing and goes into continuous loop of trying to connect to the first server
Zookeeper Logs in Server 2 when leader (Server 1) went down:
2019-10-22 18:44:01,135 [myid:2] - WARN [NIOWorkerThread-2:NIOServerCnxn#370] - Exception causing close of session 0x0: ZooKeeperServer not running
2019-10-22 18:44:02,925 [myid:2] - WARN [NIOWorkerThread-3:NIOServerCnxn#370] - Exception causing close of session 0x0: ZooKeeperServer not running
2019-10-22 18:44:03,320 [myid:2] - WARN [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumCnxManager#677] -
Cannot open channel to 1 at election address ec2-server-1.compute-1.amazonaws.com/172.xx.x.x:3888
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
Server 2 Config files:
zoo.cfg
tickTime=2000
initLimit=5
syncLimit=2
dataDir=/home/ec2-user/zookeeper
clientPort=2181
server.1=ec2-server-1.compute-1.amazonaws.com:2888:3888
server.2=0.0.0.0:2888:3888
nifi.properties
nifi.cluster.is.node=true
nifi.cluster.node.address=ec2-server-2.compute-1.amazonaws.com
nifi.cluster.node.protocol.port=8082
nifi.cluster.flow.election.max.wait.time=2 mins
nifi.cluster.flow.election.max.candidates=1
# zookeeper properties, used for cluster management #
nifi.zookeeper.connect.string=localhost:2181
nifi.zookeeper.root.node=/nifi
Server 1 Config files:
zoo.cfg
tickTime=2000
initLimit=5
syncLimit=2
dataDir=/home/ec2-user/zookeeper
clientPort=2181
server.1=0.0.0.0:2888:3888
server.2=ec2-server-2.compute-1.amazonaws.com:2888:3888
nifi.properties
nifi.cluster.is.node=true
nifi.cluster.node.address=ec2-server-1.compute-1.amazonaws.com
nifi.cluster.node.protocol.port=8082
nifi.cluster.flow.election.max.wait.time=2 mins
nifi.cluster.flow.election.max.candidates=1
# zookeeper properties, used for cluster management #
nifi.zookeeper.connect.string=localhost:2181
nifi.zookeeper.root.node=/nifi
What am I doing wrong?

You need at least three nodes to be able to handle the failure of one node.
Check the Admin guide:
Clustered (Multi-Server) Setup For reliable ZooKeeper service, you
should deploy ZooKeeper in a cluster known as an ensemble. As long as
a majority of the ensemble are up, the service will be available.
Because Zookeeper requires a majority, it is best to use an odd number
of machines. For example, with four machines ZooKeeper can only handle
the failure of a single machine; if two machines fail, the remaining
two machines do not constitute a majority. However, with five machines
ZooKeeper can handle the failure of two machines.
A simpler explanation here also

Zookeeper unable to talk to new Kafka broker

In an attempt to reduce the storage on my AWS instance I decided to launch a new, smaller instance and setup Kafka again from scratch using the Ansible playbook we had from before. I then terminated the old, larger instance and took its IP address that it and the other brokers were using and put it on my new instance.
When tailing my Zookeeper logs however I'm receiving this error -
2018-04-13 14:17:34,884 [myid:2] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker#810] - Connection broken for id 1, my id = 2, error =
java.net.SocketException: Socket closed
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:153)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at java.net.SocketInputStream.read(SocketInputStream.java:211)
at java.io.DataInputStream.readInt(DataInputStream.java:387)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:795)
2018-04-13 14:17:34,885 [myid:2] - WARN [RecvWorker:1:QuorumCnxManager$RecvWorker#813] - Interrupting SendWorker
2018-04-13 14:17:34,884 [myid:2] - WARN [SendWorker:1:QuorumCnxManager$SendWorker#727] - Interrupted while waiting for message on queue
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2017)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2095)
at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:389)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:879)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.access$500(QuorumCnxManager.java:65)
at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:715)
I double checked and all 3 Kafka broker IP addresses are correctly listed in these location and I restarted all their services to be safe.
/etc/hosts
/etc/kafka/config/server.properties
/etc/zookeeper/conf/zoo.cfg
/etc/filebeat/filebeat.yml

Mesos-marthon cluster issue Could not determine the current leader

I am new in Mesos and Marathon services. I have setup 3 master and 3 slave server as per www.digitalocean.com. Configured as it is in master servers as well as slaves. Finally I done setup of Mesos, Marathon, Zookeeper and Chronos. Mesos is able to listing with 5050, Marathon is 8080 and Chronos 4400. After few hours my Marthon instances are showing like Error 503
HTTP ERROR: 503
Problem accessing /. Reason:
Could not determine the current leader
Powered by Jetty:// 9.3.z-SNAPSHOT.
But mesos is working fine. Every time i am facing this problem and if i restart the marathon service and zookeeper service its working fine.
Marathon
Jun 15 06:19:20 master3 marathon[1054]: INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(192.168.4.78:8080 (mesosphere.marathon.api.LeaderProxyFilter$:qtp522188921-35)
Jun 15 06:19:20 master3 marathon[1054]: INFO Waiting for consistent leadership state. Are we leader?: false, leader: Some(192.168.4.78:8080 (mesosphere.marathon.api.LeaderProxyFilter$:qtp522188921-35)
Zookeeper
2016-06-15 03:41:13,797 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#197] - Accepted socket connection from /192.168.4.78:38339
2016-06-15 03:41:13,798 - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#354] - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running