Can't change kafka broker-id in Incubator Helm chart? - kubernetes

I have one Zookeeper server (say xx.xx.xx.xxx:2181) running on one GCP Compute Instance VM separately.
I have 3 GKE clusters all in different regions on which I am trying to install Kafka broker nodes so that all nodes connect to one Zookeeper server(xx.xx.xx.xxx:2181).
I installed the Zookeeper server on the VM following this guide with zookeeper properties looking like below:
dataDir=/tmp/data
clientPort=2181
maxClientCnxns=0
initLimit=5
syncLimit=2
tickTime=2000
# list of servers
server.1=0.0.0.0:2888:3888
I am using this Incubator Helm Chart to deploy the brokers on GKE clusters.
As per the README.md I am trying to install with the below command:
helm repo add incubator http://storage.googleapis.com/kubernetes-charts-incubator
helm install --name my-kafka \
--set replicas=1,zookeeper.enabled=false,configurationOverrides."broker\.id"=1,configurationOverrides."zookeeper\.connect"="xx.xx.xx.xxx:2181" \
incubator/kafka
Error
When I deploy using any of the above ways described above on all of the three GKE Clusters, only one of the brokers gets connected to the Zookeeper server and the other two pods just restart infinitely.
When I check the Zookeeper log (on the VM), it looks something like below:
...
[2019-10-30 14:32:30,930] INFO Accepted socket connection from /xx.xx.xx.xxx:54978 (org.apache.zookeeper.server.NIOServerCnxnFactory)
[2019-10-30 14:32:30,936] INFO Client attempting to establish new session at /xx.xx.xx.xxx:54978 (org.apache.zookeeper.server.ZooKeeperServer)
[2019-10-30 14:32:30,938] INFO Established session 0x100009621af0057 with negotiated timeout 6000 for client /xx.xx.xx.xxx:54978 (org.apache.zookeeper.server.ZooKeeperServer)
[2019-10-30 14:32:32,335] INFO Got user-level KeeperException when processing sessionid:0x100009621af0057 type:create cxid:0xc zxid:0x422 txntype:-1 reqpath:n/a Error Path:/config/users Error:KeeperErrorCode = NodeExists for /config/users (org.apache.zookeeper.server.PrepRequestProcessor)
[2019-10-30 14:32:34,472] INFO Got user-level KeeperException when processing sessionid:0x100009621af0057 type:create cxid:0x14 zxid:0x424 txntype:-1 reqpath:n/a Error Path:/brokers/ids/0 Error:KeeperErrorCode = NodeExists for /brokers/ids/0 (org.apache.zookeeper.server.PrepRequestProcessor)
[2019-10-30 14:32:35,126] INFO Processed session termination for sessionid: 0x100009621af0057 (org.apache.zookeeper.server.PrepRequestProcessor)
[2019-10-30 14:32:35,127] INFO Closed socket connection for client /xx.xx.xx.xxx:54978 which had sessionid 0x100009621af0057 (org.apache.zookeeper.server.NIOServerCnxn)
[2019-10-30 14:36:49,123] INFO Expiring session 0x100009621af003b, timeout of 6000ms exceeded (org.apache.zookeeper.server.ZooKeeperServer)
...
I am sure I have created firewall rules to open necessary ports and that is not a problem because one of the broker nodes is able to connect (the one who reaches first).
To me, this seems like the borkerID are not getting changed for some reason and that is the reason why Zookeeper is rejecting the connections.
I say this because kubectl logs pod/my-kafka-n outputs something like below:
...
[2019-10-30 19:56:24,614] INFO [SocketServer brokerId=0] Shutdown completed (kafka.network.SocketServer)
...
[2019-10-30 19:56:24,627] INFO [KafkaServer id=0] shutting down (kafka.server.KafkaServer)
...
As we can see above says brokerId=0 for all of the pods in all the 3 clusters.
However, when I do kubectl exec -ti pod/my-kafka-n -- env | grep BROKER, I can see the environment variable KAFKA_BROKER_ID is changed to 1, 2 and 3 for different brokers as I set.
What am I doing wrong? What is the correct way to change the kafka-broker id or to make all brokers connect to one Zookeeper instance?

make all brokers connect to one Zookeeper instance?
Seems like you are doing that okay via the configurationOverrides option. That'll deploy all pods with the same configuration.
That being said, the broker ID should not be the same per pod. If you inspect the statefulset YAML, it appears that the broker ID is calculated based on the POD_NAME variable
Sidenote
3 GKE clusters all in different regions on which I am trying to install Kafka broker nodes so that all nodes connect to one Zookeeper server
It's not clear to me how you would able to deploy to 3 sepearate clusters in one API call. But, this architecture isn't recommended by Kafka, Zookeeper, or Kubernetes communities unless these regions are "geographically close"

Related

Kafka and Zookeeper not working give an error (Kafka shutting down and INFO ZooKeeper audit is disabled despite enabling it)

I am a beginner and I have to use Kafka for data transfer into/from Hadoop FS (or any other application, not just through put or copyFromLocal commands),kafka needs zookeeper as well, I enabled Zooekeeper audit logging but I still get errors.
For Kafka, when I want to start it:
JMX_PORT=8004 bin/kafka-server-start.sh config/server.properties
I get the error:
[2022-02-16 13:56:45,939] INFO shutting down (kafka.server.KafkaServer)
[2022-02-16 13:56:46,114] INFO App info kafka.server for 0 unregistered (org.apache.kafka.common.utils.AppInfoParser)
[2022-02-16 13:56:46,133] INFO shut down completed (kafka.server.KafkaServer)
[2022-02-16 13:56:46,133] ERROR Exiting Kafka. (kafka.Kafka$)
[2022-02-16 13:56:46,165] INFO shutting down (kafka.server.KafkaServer)
And when I want to start Zookeeper using the command:
bin/zookeeper-server-start.sh config/zookeeper.properties
I get the following (and it gets stuck on it):
[2022-02-16 14:03:13,954] INFO zookeeper.request_throttler.shutdownTimeout = 10000 (org.apache.zookeeper.server.RequestThrottler)
[2022-02-16 14:03:13,955] INFO PrepRequestProcessor (sid:0) started, reconfigEnabled=false (org.apache.zookeeper.server.PrepRequestProcessor)
[2022-02-16 14:03:14,136] INFO Using checkIntervalMs=60000 maxPerMinute=10000 maxNeverUsedIntervalMs=0 (org.apache.zookeeper.server.ContainerManager)
[2022-02-16 14:03:14,138] INFO ZooKeeper audit is disabled. (org.apache.zookeeper.audit.ZKAuditProvider)
Does anyone know how to work this out? I enabled audit logging but still. Same problem.
Zookeeper CLI isn't "stuck"; it's waiting for connections.
Open a new terminal and start Kafka
Alternatively, you could use Docker Compose / Kubernetes, if you think your host / local JVM is causing issues

Why is Zookeeper not re-electing new leader in Apache Nifi Cluster?

Following is my architecture
2 Servers:
Server 1: running Apache Nifi + Zookeeper (Not embedded)
Server 2: running Apache Nifi + Zookeeper (Not embedded)
To test failovers, I close down the Server that has been selected as Cluster Coordinator
In this case, zookeeper should automatically elect the remaining one server as leader. But it keeps failing and goes into continuous loop of trying to connect to the first server
Zookeeper Logs in Server 2 when leader (Server 1) went down:
2019-10-22 18:44:01,135 [myid:2] - WARN [NIOWorkerThread-2:NIOServerCnxn#370] - Exception causing close of session 0x0: ZooKeeperServer not running
2019-10-22 18:44:02,925 [myid:2] - WARN [NIOWorkerThread-3:NIOServerCnxn#370] - Exception causing close of session 0x0: ZooKeeperServer not running
2019-10-22 18:44:03,320 [myid:2] - WARN [QuorumPeer[myid=2](plain=/0:0:0:0:0:0:0:0:2181)(secure=disabled):QuorumCnxManager#677] -
Cannot open channel to 1 at election address ec2-server-1.compute-1.amazonaws.com/172.xx.x.x:3888
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
Server 2 Config files:
zoo.cfg
tickTime=2000
initLimit=5
syncLimit=2
dataDir=/home/ec2-user/zookeeper
clientPort=2181
server.1=ec2-server-1.compute-1.amazonaws.com:2888:3888
server.2=0.0.0.0:2888:3888
nifi.properties
nifi.cluster.is.node=true
nifi.cluster.node.address=ec2-server-2.compute-1.amazonaws.com
nifi.cluster.node.protocol.port=8082
nifi.cluster.flow.election.max.wait.time=2 mins
nifi.cluster.flow.election.max.candidates=1
# zookeeper properties, used for cluster management #
nifi.zookeeper.connect.string=localhost:2181
nifi.zookeeper.root.node=/nifi
Server 1 Config files:
zoo.cfg
tickTime=2000
initLimit=5
syncLimit=2
dataDir=/home/ec2-user/zookeeper
clientPort=2181
server.1=0.0.0.0:2888:3888
server.2=ec2-server-2.compute-1.amazonaws.com:2888:3888
nifi.properties
nifi.cluster.is.node=true
nifi.cluster.node.address=ec2-server-1.compute-1.amazonaws.com
nifi.cluster.node.protocol.port=8082
nifi.cluster.flow.election.max.wait.time=2 mins
nifi.cluster.flow.election.max.candidates=1
# zookeeper properties, used for cluster management #
nifi.zookeeper.connect.string=localhost:2181
nifi.zookeeper.root.node=/nifi
What am I doing wrong?
You need at least three nodes to be able to handle the failure of one node.
Check the Admin guide:
Clustered (Multi-Server) Setup For reliable ZooKeeper service, you
should deploy ZooKeeper in a cluster known as an ensemble. As long as
a majority of the ensemble are up, the service will be available.
Because Zookeeper requires a majority, it is best to use an odd number
of machines. For example, with four machines ZooKeeper can only handle
the failure of a single machine; if two machines fail, the remaining
two machines do not constitute a majority. However, with five machines
ZooKeeper can handle the failure of two machines.
A simpler explanation here also

Kafka brokers not starting up

I have 2 broker cluster of kafka 0.10.1 running up previously on my development servers with zookeeper 3.3.6 correctly.
I recently tried upgrading broker version to latest kafka 2.3.0 but it didn't start.
There is nothing much changed in the configuration.
Can anybody direct me what possibly could go wrong. Why brokers are not getting started?
Changed server.properties on broker server 1
broker.id=1
log.dirs=/mnt/kafka_2.11-2.3.0/logs
zookeeper.connect=local1:2181,local2:2181
listeners=PLAINTEXT://local1:9092
advertised.listeners=PLAINTEXT://local1:9092
Changed server.properties on broker server 2
broker.id=2
log.dirs=/mnt/kafka_2.11-2.3.0/logs
zookeeper.connect=local1:2181,local2:2181
listeners=PLAINTEXT://local2:9092
advertised.listeners=PLAINTEXT://local2:9092
NOTE:
1. Zookeeper is running on both servers
2. Kafka directories namely /brokers, /brokers/ids, /consumers etc are getting created.
3. Nothing is getting registered under /brokers/ids. Zookeeper CLI get /brokers/ids returns
[]
4. Command lsof -i tcp:9082 returns
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 18290 cass 118u IPv6 52133 0t0 TCP local2:9092 (LISTEN)
4. logs/server.log has no errors logged.
5. No more logs are getting appended to server.log.
Server logs
[2019-07-01 10:56:14,534] INFO Starting log flusher with a default period of 9223372036854775807 ms. (kafka.log.LogManager)
[2019-07-01 10:56:14,801] INFO Awaiting socket connections on local2:9092. (kafka.network.Acceptor)
[2019-07-01 10:56:14,829] INFO [SocketServer brokerId=1] Created data-plane acceptor and processors for endpoint : EndPoint(local2,9092,ListenerName(PLAINTEXT),PLAINTEXT) (kafka.network.SocketServer)
[2019-07-01 10:56:14,830] INFO [SocketServer brokerId=1] Started 1 acceptor threads for data-plane (kafka.network.SocketServer)
[2019-07-01 10:56:14,850] INFO [ExpirationReaper-1-Produce]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2019-07-01 10:56:14,851] INFO [ExpirationReaper-1-Fetch]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2019-07-01 10:56:14,851] INFO [ExpirationReaper-1-DeleteRecords]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2019-07-01 10:56:14,852] INFO [ExpirationReaper-1-ElectPreferredLeader]: Starting (kafka.server.DelayedOperationPurgatory$ExpiredOperationReaper)
[2019-07-01 10:56:14,860] INFO [LogDirFailureHandler]: Starting (kafka.server.ReplicaManager$LogDirFailureHandler)
[2019-07-01 10:56:14,892] INFO Creating /brokers/ids/1 (is it secure? false) (kafka.zk.KafkaZkClient)
From the docs regarding ZooKeeper
Stable version
The current stable branch is 3.4 and the latest release of that branch is 3.4.9.
Upgrading zookeeper version to latest 3.5.5 helped and Kafka broker started correctly.
It would have been great if docs had stated the incompatibility with previous zookeeper version.
PS: Answer added to help someone struck with similar issue because of zookeeper version.

Flink: HA mode killing leading jobmanager terminating standby jobmanagers

I am trying to get Flink to run in HA mode using Zookeeper, but when I try to test it by killing the leader JobManager all my standby jobmanagers get killed too.
So instead of a standby jobmanager taking over as the new Leader, they all get killed which isn't supposed to happen.
My setup:
4 servers, 3 of those servers have Zookeeper running, but only 1 server will host all the JobManagers.
ad011.local: Zookeeper + Jobmanagers
ad012.local: Zookeeper + Taskmanager
ad013.local: Zookeeper
ad014.local: nothing interesting
My masters file looks like this:
ad011.local:8081
ad011.local:8082
ad011.local:8083
My flink-conf.yaml:
jobmanager.rpc.address: ad011.local
blob.server.port: 6130,6131,6132
jobmanager.heap.mb: 512
taskmanager.heap.mb: 128
taskmanager.numberOfTaskSlots: 4
parallelism.default: 2
taskmanager.tmp.dirs: /var/flink/data
metrics.reporters: jmx
metrics.reporter.jmx.class: org.apache.flink.metrics.jmx.JMXReporter
metrics.reporter.jmx.port: 8789,8790,8791
high-availability: zookeeper
high-availability.zookeeper.quorum: ad011.local:2181,ad012.local:2181,ad013.local:2181
high-availability.zookeeper.path.root: /flink
high-availability.zookeeper.path.cluster-id: /cluster-one
high-availability.storageDir: /var/flink/recovery
high-availability.jobmanager.port: 50000,50001,50002
When I run flink by using start-cluster.sh script I see my 3 JobManagers running, and going to the WebUI they all point to ad011.local:8081, which is the leader. Which is okay I guess?
I then try to test the failover by killing the leader using kill and then all my other standby JobManagers stop too.
This is what I see in my standby JobManager logs:
2017-09-29 08:08:41,590 INFO org.apache.flink.runtime.jobmanager.JobManager - Starting JobManager at akka.tcp://flink#ad011.local:50002/user/jobmanager.
2017-09-29 08:08:41,590 INFO org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting ZooKeeperLeaderElectionService org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService#72d546c8.
2017-09-29 08:08:41,598 INFO org.apache.flink.runtime.webmonitor.WebRuntimeMonitor - Starting with JobManager akka.tcp://flink#ad011.local:50002/user/jobmanager on port 8083
2017-09-29 08:08:41,598 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService.
2017-09-29 08:08:41,645 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://flink#ad011.local:50000/user/jobmanager:f7dc2c48-dfa5-45a4-a63e-ff27be21363a.
2017-09-29 08:08:41,651 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Starting ZooKeeperLeaderRetrievalService.
2017-09-29 08:08:41,722 INFO org.apache.flink.runtime.clusterframework.standalone.StandaloneResourceManager - Received leader address but not running in leader ActorSystem. Cancelling registration.
2017-09-29 09:26:13,472 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink#ad011.local:50000] has failed, address is now gated for [5000] ms. Reason: [Disassociated]
2017-09-29 09:26:14,274 INFO org.apache.flink.runtime.jobmanager.JobManager - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2017-09-29 09:26:14,284 INFO org.apache.flink.runtime.blob.BlobServer - Stopped BLOB server at 0.0.0.0:6132
Any help would be appreciated.
Solved it by running my cluster using ./bin/start-cluster.sh instead of using service files (which calls the same script), the service file kills the other jobmanagers apparently.

testing kafka consumer and producer failed on connection

I have been trying to test a kafka installation and using the guide created a producer and consumer. When trying to retrieve a message I get the following error:
WARN Session 0x0 for server null, unexpected error, closing socket connection and
attempting reconnect (org.apache.zookeeper.ClientCnxn)
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1146)
[2014-03-04 18:01:20,628] INFO Terminate ZkClient event thread. (org.I0Itec.zkclient.ZkEventThread)
[2014-03-04 18:01:21,315] INFO Opening socket connection to server kafka-test/192.xxxxxx.110:2182 (org.apache.zookeeper.ClientCnxn)
[2014-03-04 18:01:21,418] INFO Session: 0x0 closed (org.apache.zookeeper.ZooKeeper)
Exception in thread "main" org.I0Itec.zkclient.exception.ZkTimeoutException: Unable to connect to zookeeper server within timeout: 6000
at org.I0Itec.zkclient.ZkClient.connect(ZkClient.java:880)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:98)
at org.I0Itec.zkclient.ZkClient.<init>(ZkClient.java:84)
at kafka.consumer.ZookeeperConsumerConnector.connectZk(ZookeeperConsumerConnector.scala:151)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:112)
at kafka.consumer.ZookeeperConsumerConnector.<init>(ZookeeperConsumerConnector.scala:123)
at kafka.consumer.Consumer$.create(ConsumerConnector.scala:89)
at kafka.consumer.ConsoleConsumer$.main(ConsoleConsumer.scala:178)
at kafka.consumer.ConsoleConsumer.main(ConsoleConsumer.scala)
[2014-03-04 18:01:21,419] INFO EventThread shut down (org.apache.zookeeper.ClientCnxn)
Kafka
Looks like you're not connecting to Zookeeper correctly. I'm not sure of your setup (multi-machine, VMs, containers) so it's hard to say what's wrong. From the debug output I see the following line hinting at your expected Zookeeper IP:
[2014-03-04 18:01:21,315] INFO Opening socket connection to server kafka-test/192.xxxxxx.110:2182 (org.apache.zookeeper.ClientCnxn)
Kafka looks for Zookeeper at the address specified by the zookeeper.connect configuration property in the $KAFKA_HOME/config/server.properties file. Be sure to edit that before starting Kafka. Also, try giving the actual public IP of your Zookeeper instance, not just 127.0.0.1 as that solves a lot of confusion if you're running in containers. In your case it looks like it would be:
zookeeper.connect=192.xxxxxx.110:2182
Also relevant to the Kafka config if you're running on AWS or operating in a container, don't forget to update the following two configuration properties to make sure clients who connect to Kafka see the correct public IP
advertised.host.name
advertised.port
and Kafka sees the correct internal IP
host.name
port
Zookeeper
Zookeeper has some gotchas when setting it up as well. On your Zookeeper instance, don't forget to edit the server configuration property in the zoo.cfg (usually in /etc/zookeeper/conf) file to point to the correct IP for your Zookeeper instance. In your case probably the following:
server.1=192.xxxxxx.110:2888:3888
Those last two ports (2888 3888) are only needed if you're running a Zookeeper cluster (for followers to connect to the leader and Zookeeper leader election, respectively, so be sure to unblock them on firewallish things if you have multiple Zookeeper servers).
Check your zookeeper connection with telnet command:
telnet 192.xxxxxx.110 2181
You probably get an error, in which case check that the process is running:
ps -ef | grep "zookeeper.properties"
If it's not running, start it by going into kafka home directory:
bin/zookeeper-server-start.sh config/zookeeper.properties &
Something wrong with your Zookeper configuration. Make sure your zookeeper is up and running. The default port it runs on is 2181
Bit more info and some code could be useful I believe.
I hit the same issue and the problem was the max client connections property in zookeeper config.
if you see something like "maxClientCnxns = 20" in the config file in /etc/zookeeper/conf, comment it out and restart zookeeper.
You may also check if the all the connections available have already been exhausted. If you are using an API to connect to ZK, make sure you free up the connection after you're done.
I also meet the problem. When I shutdown the firewall of the zk node, it will work.