Testing "fail-over" on Kafka - apache-kafka

Set-up 1:
OS: Windows 10
ZooKeeper
3 ZooKeeper instances downloaded from Apache(tested with v3.5.6 and v.3.4.14):
(1) apache-zookeeper-3.5.6-bin_1
(2) apache-zookeeper-3.5.6-bin_2 (Copy of 1)
(3) apache-zookeeper-3.5.6-bin_3 (Copy of 1)
zoo.cfg:
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/tmp/zookeeper_3.4.14_1
clientPort=2181
admin.serverPort=10081
server.1=localhost:2881:3881
server.2=localhost:2882:3882
server.3=localhost:2883:3883
4lw.commands.whitelist=*
zoo.cfg:
...
dataDir=/tmp/zookeeper_3.4.14_2
clientPort=2182
admin.serverPort=10082
...
zoo.cfg:
...
dataDir=/tmp/zookeeper_3.4.14_3
clientPort=2183
admin.serverPort=10083
...
myid file in dataDir with values 1,2 and 3 respectively
Kafka
2 Kafka instances:
(1) kafka_2.12-2.3.0_1
(2) kafka_2.12-2.3.0_2 (Copy of 1)
server.properties:
...
broker.id=1
listeners=PLAINTEXT://:9091
log.dirs=/tmp/kafka-logs-1
zookeeper.connect=localhost:2181,localhost:2182,localhost:2183
...
server.properties:
...
broker.id=2
listeners=PLAINTEXT://:9092
log.dirs=/tmp/kafka-logs-2
zookeeper.connect=localhost:2181,localhost:2182,localhost:2183
...
Spring
spring-boot-starter-* 2.2.0.RELEASE
spring-kafka-2.3.1.RELEASE
=====================================================================
Set-up 2:
Same as set-up 1, the only difference being that instead of using the ZooKeeper downloaded from Apache, i am using the ZooKeeper that comes with Kafka.
=====================================================================
Issue
The issue is that when i bring 1 Kafka down:
=> Set-up 1 will not fail-over, meaning that when i produce a message, the Kafka that is up is not receiving the message
=> Set-up 2 will fail-over, meaning that when i produce a message, the Kafka that is up will receive the message
Do you guys see anything wrong with Set-up 1?
P.S If you need more details, i am happy to provide.

In case it helps someone:
I had 2 Kafka instances, but my replication-factor for any topic created was 1(due to my misunderstanding/misinterpretation of its meaning).
What this means is that, at the time of topic creation, the topic/s will be either created in Kafka-1 or Kafka-2, not both. As such, when i tried to do a fail-over, the fail-over will fail, depending on which topic i am writing to, and which Kafka was brought down.
In short, if you have X Kafka instances, topic replication-factor needs to be X (Same goes for offsets.topic.replication.factor)

Related

Replication factor: 3 larger than available brokers: 1 when starting the kafka

I have a kafka installed in my mac last year, which has many topics within the system. Now I upgrade the zookeeper and kafka to the latest version.
by running zookeeper, it is successful
zookeeper-server-start /usr/local/etc/kafka/zookeeper.properties
Then a broker:
kafka-server-start /usr/local/etc/kafka/server.properties
however it comes up with the error
INFO [Admin Manager on Broker 0]: Error processing create topic request CreatableTopic(name='_confluent-license', numPartitions=1, replicationFactor=3, assignments=[], configs=[CreateableTopicConfig(name='cleanup.policy', value='compact'), CreateableTopicConfig(name='min.insync.replicas', value='2')]) (kafka.server.AdminManager)
org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 1.
How would I solve it?
A Confluent enterprise license is stored in the _confluent-command topic. This topic is created by default and contains the license that corresponds to the license key supplied through the confluent.license property. So when you're starting the Kafka server it tries to create it with replication-factor of 3 but there is only 1 broker available so it failed.
Set confluent.topic.replication.factor property to 1 in /usr/local/etc/kafka/server.properties file.
#Pardeep 's answer worked for me, but for me there were more replication factors to set (I'm using Confluent 6.2.1):
confluent.balancer.topic.replication.factor=1
confluent.durability.topic.replication.factor=1
confluent.license.topic.replication.factor=1
confluent.tier.metadata.replication.factor=1
transaction.state.log.replication.factor=1
offsets.topic.replication.factor=1
You can use findstr (on Windows) or grep (on Unix-based OS) to extract them all from the console output:
kafka-server-start /usr/local/etc/kafka/server.properties | findstr "replication.factor"

Kafka Unable to create a basic 3 node cluster on 3 separate VM's

I'm using kafka version 2.11-2.3.1 and zookeeper 3.4.10-3 and am able to get each of 3 Ubuntu 18.04 VM's running everything fine separately with basic configs allowing me to create a topic, produce data, and consume data with the kafka provided kafka-topics.sh, kafka-console-producer.sh, and kafka-console-consumer.sh.
Can anyone please help me set up the configs of zookeeper and kafka so that I can connect the 3 into a cluster? I thought all I had to do was set configs as shown below and then use the kafka-topics.sh shell scripts to create a topic with replication-factor 3 and then use the kafka-console-producer.sh and kafka-console-consumer.sh and it would work. But I am getting lots of different errors. Does anyone have a guide/tutorial I can follow?
In zoo.cfg the only settings I have changed from the default script are:
For server 1 ...
server.1=0.0.0.0:2888:3888
server.2=myserver2.com.au:2888:3888
server.3=myserver3.com.au:2888:3888
... and similarly for servers 2 and 3
In kafka's server.properties, the only settings I have changed from the default are:
For server 1 ...
broker.id=1
zookeeper.connect=0.0.0.0:2181,myserver2.com.au:2181,myserver3.com.au:2181
... and similarly for servers 2 and 3 (broker.id is set to 2 and 3 respectively and the zookeeper.connect servers are set appropriately).
Then on myserver1.com.au I ran:
/usr/local/kafka/bin/kafka-topics.sh --create --zookeeper
"localhost:2181,myserver2.com.au:2181,myserver3.com.au:2181"
--replication-factor 3 --partitions 1 --topic test
and then tried ...
/usr/local/kafka/bin/kafka-console-producer.sh --broker-list
"localhost:9092,myserver2.com.au:9092,myserver3.com.au:9092"
--topic test
But it fails with numerous errors. One is ...
org.apache.kafka.common.KafkaException: Should not set log start
offset on partition test's local replica 1 without attempting
to delete records of the log
another is...
ERROR [KafkaApi-0] Error when handling request: clientId=0, correlationId=0,
api=UPDATE_METADATA, body={controller_id=0,controller_epoch=1,
broker_epoch=155,topic_states=[{t...
java.lang.IllegalStateException: Epoch 155 larger than current broker epoch 24
another is ...
ERROR Error when sending message to topic test4 with key: null, value: 1 bytes with
error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Topic test not present in
metadata after 60000 ms.
another is ...
ERROR Error when sending message to topic test with key: null, value: 1 bytes with
error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server
is not the leader for that topic-partition.
I can't seem to find anything useful on the internet about these errors and I've tried quite a few combinations of things but am really struggling, so any help would be appreciated!

filebeat-kafka:WARN producer/broker/0 maximum request accumulated, waiting for space

when filebeat output data to kafka , there are many warning message in filebeat log.
..
*WARN producer/broker/0 maximum request accumulated, waiting for space
*WARN producer/broker/0 maximum request accumulated, waiting for space
..
nothing special in my filebeat config:
..
output.kafka:
hosts: ["localhost:9092"]
topic: "log-oneday"
..
i have also updated these socket setting in kafka:
...
socket.send.buffer.bytes=10240000
socket.receive.buffer.bytes=10240000
socket.request.max.bytes=1048576000
queued.max.requests=1000
...
but it did not work.
is there something i missing? or i have to increase those number bigger?
besides , no error or exception found in kafka server log
is there any expert have any idea about this ?
thanks
Apparently you have only one partition in your topic. Try to increase partitions for the topic. See the links below for more information.
More Partitions Lead to Higher Throughput
https://www.confluent.io/blog/how-to-choose-the-number-of-topicspartitions-in-a-kafka-cluster/
https://kafka.apache.org/documentation/#basic_ops_modify_topic
Try the following command (replacing info with your particular use case):
bin/kafka-topics.sh --zookeeper zk_host:port/chroot --alter --topic my_topic_name --partitions 40
You need to configure 3 things:
Brokers
Filebeat kafka output
Consumer
Here a example (change paths according your environment).
Broker configuration:
# open kafka server configuration file
vim /opt/kafka/config/server.properties
# add this line
# The largest record batch size allowed by Kafka.
message.max.bytes=100000000
# restart kafka service
systemctl restart kafka.service
Filebeat kafka output:
output.kafka:
...
max_message_bytes: 100000000
Consumer configuration:
# larger than the max.message.size
max.partition.fetch.bytes=200000000

Kafka rolling restart: Data is lost

As part of our current Kafka cluster, high-availability testing (HA) is being done. The objective is, while a producer job is pushing data to a particular partition of a topic, all the brokers in Kafka cluster are restarted sequentially (Stop-first broker- restart it and after first broker comes up, do same steps for second broker and so-on). The producer job is pushing around 7 million records for about 30 minutes while this test is going on. At the end of job, it was noticed that around 1000 records are missing.
Below are specifics of our Kafka cluster: (kafka_2.10-0.8.2.0)
-3 Kafka brokers each with 2 100GB mounts
Topic was created with:
-Replication factor of 3
-min.insync.replica=2
server.properties:
broker.id=1
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.dirs=/drive1,/drive2
num.partitions=1
num.recovery.threads.per.data.dir=1
log.flush.interval.messages=10000
log.retention.hours=1
log.segment.bytes=1073741824
log.retention.check.interval.ms=1800000
log.cleaner.enable=false
zookeeper.connect=ZK1:2181,ZK2:2181,ZK3:2181
zookeeper.connection.timeout.ms=10000
advertised.host.name=XXXX
auto.leader.rebalance.enable=true
auto.create.topics.enable=false
queued.max.requests=500
delete.topic.enable=true
controlled.shutdown.enable=true
unclean.leader.election=false
num.replica.fetchers=4
controller.message.queue.size=10
Producer.properties (aync producer with new producer API)
bootstrap.servers=broker1:9092,broker2:9092,broker3:9092
acks=all
buffer.memory=33554432
compression.type=snappy
batch.size=32768
linger.ms=5
max.request.size=1048576
block.on.buffer.full=true
reconnect.backoff.ms=10
retry.backoff.ms=100
key.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
Can someone share any info about Kafka-cluster and HA to ensure that data would not be lost while rolling restarting Kafka brokers?
Also, here is my producer code. This is a fire and forget kind of producer. we are not handling failures explicitly as of now. Working fine for almost millions of records. I am seeing problem, only when Kafka brokers are restarted as explained above.
public void sendMessage(List<byte[]> messages, String destination, Integer parition, String kafkaDBKey) {
for(byte[] message : messages) {
producer.send(new ProducerRecord<byte[], byte[]>(destination, parition, kafkaDBKey.getBytes(), message));
}
}
By increasing default retries value from 0 to 4000 on producer side, we are able to send data successfully without loosing.
retries=4000
Due to this setting, there is a possibility of sending same message twice and messages are out of sequence by the time consumer receives it (second msg might reach before first msg). But for our current problem that is not an issue and is handled on consumer side to ensure everything is in order.

Kafka QuickStart, advertised.host.name gives kafka.common.LeaderNotAvailableException

I am able to get a simple one-node Kafka (kafka_2.11-0.8.2.1) working locally on one linux machine, but when I try to run a producer remotely I'm getting some confusing errors.
I'm following the quickstart guide at http://kafka.apache.org/documentation.html#quickstart. I stopped the kafka processes and deleted all the zookeeper & karma files in /tmp. I am on a local 10.0.0.0/24 network NAT-ed with an external IP address, so I modified server.properties to tell zookeeper how to broadcast my external address, as per https://medium.com/#thedude_rog/running-kafka-in-a-hybrid-cloud-environment-17a8f3cfc284:
advertised.host.name=MY.EXTERNAL.IP
Then I'm running this:
$ bin/zookeeper-server-start.sh config/zookeeper.properties
--> ...
$ export KAFKA_HEAP_OPTS="-Xmx256M -Xms128M" # small test server!
$ bin/kafka-server-start.sh config/server.properties
--> ...
I opened up the firewall for my producer on the remote machine, and created a new topic and verified it:
$ bin/kafka-topics.sh --create --zookeeper MY.EXTERNAL.IP:2181 --replication-factor 1 --partitions 1 --topic test123
--> Created topic "test123".
$ bin/kafka-topics.sh --list --zookeeper MY.EXTERNAL.IP:2181
--> test123
However, the producer I'm running remotely gives me errors:
$ bin/kafka-console-producer.sh --broker-list MY.EXTERNAL.IP:9092 --topic test123
--> [2015-06-16 14:41:19,757] WARN Property topic is not valid (kafka.utils.VerifiableProperties)
My Test Message
--> [2015-06-16 14:42:43,347] WARN Error while fetching metadata [{TopicMetadata for topic test123 ->
No partition metadata for topic test123 due to kafka.common.LeaderNotAvailableException}] for topic [test123]: class kafka.common.LeaderNotAvailableException (kafka.producer.BrokerPartitionInfo)
--> (repeated several times)
(I disabled the whole firewall to make sure that wasn't the problem.)
The stdout errors in the karma-startup are repeated: [2015-06-16 20:42:42,768] INFO Closing socket connection to /MY.EXTERNAL.IP. (kafka.network.Processor)
And the controller.log gives me this, several times:
java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
at kafka.controller.RequestSendThread.liftedTree1$1(ControllerChannelManager.scala:132)
at kafka.controller.RequestSendThread.doWork(ControllerChannelManager.scala:131)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
[2015-06-16 20:44:08,128] INFO [Controller-0-to-broker-0-send-thread], Controller 0 connected to id:0,host:MY.EXTERNAL.IP,port:9092 for sending state change requests (kafka.controller.RequestSendThread)
[2015-06-16 20:44:08,428] WARN [Controller-0-to-broker-0-send-thread], Controller 0 epoch 1 fails to send request Name:LeaderAndIsrRequest;Version:0;Controller:0;ControllerEpoch:1;CorrelationId:7;ClientId:id_0-host_null-port_9092;Leaders:id:0,host:MY.EXTERNAL.IP,port:9092;PartitionState:(test123,0) -> (LeaderAndIsrInfo:(Leader:0,ISR:0,LeaderEpoch:0,ControllerEpoch:1),ReplicationFactor:1),AllReplicas:0) to broker id:0,host:MY.EXTERNAL.IP,port:9092. Reconnecting to broker. (kafka.controller.RequestSendThread)
Running this seems to indicate that there is a leader at 0:
$ ./bin/kafka-topics.sh --zookeeper MY.EXTERNAL.IP:2181 --describe --topic test123
--> Topic:test123 PartitionCount:1 ReplicationFactor:1 Configs:
Topic: test123 Partition: 0 Leader: 0 Replicas: 0 Isr: 0
I reran this test and my server.log indicates that there is a leader at 0:
...
[2015-06-16 21:58:04,498] INFO 0 successfully elected as leader (kafka.server.ZookeeperLeaderElector)
[2015-06-16 21:58:04,642] INFO Registered broker 0 at path /brokers/ids/0 with address MY.EXTERNAL.IP:9092. (kafka.utils.ZkUtils$)
[2015-06-16 21:58:04,670] INFO [Kafka Server 0], started (kafka.server.KafkaServer)
[2015-06-16 21:58:04,736] INFO New leader is 0 (kafka.server.ZookeeperLeaderElector$LeaderChangeListener)
I see this error in the logs when I send a message from the producer:
[2015-06-16 22:18:24,584] ERROR [KafkaApi-0] error when handling request Name: TopicMetadataRequest; Version: 0; CorrelationId: 7; ClientId: console-producer; Topics: test123 (kafka.server.KafkaApis)
kafka.admin.AdminOperationException: replication factor: 1 larger than available brokers: 0
at kafka.admin.AdminUtils$.assignReplicasToBrokers(AdminUtils.scala:70)
I assume this means that the broker can't be found for some reason? I'm confused what this means...
For the recent versions of Kafka (0.10.0 as of this writing), you don't want to use advertised.host.name at all. In fact, even the [documentation] states that advertised.host.name is already deprecated. Moreover, Kafka will use this not only as the "advertised" host name for the producers/consumers, but for other brokers as well (in a multi-broker environment)...which is kind of a pain if you're using using a different (perhaps internal) DNS for the brokers...and you really don't want to get into the business of adding entries to the individual /etc/hosts of the brokers (ew!)
So, basically, you would want the brokers to use the internal name, but use the external FQDNs for the producers and consumers only. To do this, you will update advertised.listeners instead.
Set advertised.host.name to a host name, not an IP address. The default is to return a FQDN using getCanonicalHostName(), but this is only best effort and falls back to an IP. See the java docs for getCanonicalHostName().
The trick is to get that host name to always resolve to the correct IP. For small environments I usually setup all of the hosts with all of their internal IPs in /etc/hosts. This way all machines know how to talk to each other over the internal network, by name. In fact, configure your Kafka clients by name now too, not by IP. If managing all the /etc/hosts files is a burden then setup an internal DNS server to centralize it, but internal DNS should return internal IPs. Either of these options should be less work than having IP addresses scattered throughout various configuration files on various machines.
Once everything is communicating by name all that's left is to configure external DNS with the external IPs and everything just works. This includes configuring Kafka clients with the server names, not IPs.
So to summarize, the solution to this was to add a route via NAT so that the machine can access its own external IP address.
Zookeeper uses the address it finds in advertised.host.name both to tell clients where to find the broker as well as to communicate with the broker itself. The error that gets reported doesn't make this very clear, and it's confusing because a client has no problem opening a TCP connection.
Taking cue from above: for my single node (while still learning) I modified server.properties file having text "advertised.host.name" to value=127.0.01. So finally it looks something like this
advertised.host.name=127.0.0.1
While starting producer it still shows warning, but now it is atleast working while I can see messages on consumer terminal perfectly comming
On your machine where Kafka is installed, check if it is up and running. The error states, 0 brokers are available that means Kafka is not up and running.
On linux machine you can use the netstat command to check if the service is running.
netstat -an|grep port_kafka_is_Listening ( default is 9092)
conf/server.properties:
host.name
DEPRECATED: only used when listeners is not set. Use listeners instead. hostname of broker. If this is set, it will only bind to this address. If this is not set, it will bind to all interfaces