Kafka: Partitions reassignment not happening - apache-kafka

I am getting an error while migrating data between Kafka brokers.
I am using kafka-reassignment tool to reassign partitions to a different broker without any throttling(because it didn't worked with the below command.). There were around 400 partitions of 50 topics.
Apache Kafka 1.1.0
Confluent Docker Image tag : 4.1.0
Command:
kafka-reassign-partitions --zookeeper IP:2181 --reassignment-json-file proposed.json --execute —throttle 100000000
After some time, I am able to see the below error continuously on the target broker.
[2019-09-21 11:24:07,625] INFO [ReplicaFetcher replicaId=4, leaderId=0, fetcherId=0] Error sending fetch request (sessionId=514675011, epoch=INITIAL) to node 0: java.io.IOException: Connection to 0 was disconnected before the response was read. (org.apache.kafka.clients.FetchSessionHandler)
[2019-09-21 11:24:07,626] WARN [ReplicaFetcher replicaId=4, leaderId=0, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=4, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={TOPIC-4=(offset=4624271, logStartOffset=4624271, maxBytes=104
8576), TOPIC-2=(offset=1704819, logStartOffset=1704819, maxBytes=1048576), TOPIC-8=(offset=990485, logStartOffset=990485, maxBytes=1048576), TOPIC-1=(offset=1696764, logStartOffset=1696764, maxBytes=1048576), TOPIC-7=(offset=991507, logStartOffset=991507, maxBytes=10485
76), TOPIC-5=(offset=988660, logStartOffset=988660, maxBytes=1048576)}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=514675011, epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 0 was disconnected before the response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:220)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:43)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:146)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
Zookeeper status:
ls /admin/reassign_partitions
[]
I am using t2.medium type EC2 instances and gp2 type EBS volumes with 120GB size.
I am able to connect to the zookeeper from all brokers.
[zk: localhost:2181(CONNECTED) 3] ls /brokers/ids [0, 1, 2, 3]
I am using IP address for all brokers, so DNS mismatch is also not the case.
Also, I am not able to see any topic scheduled for reassignment in zookeeper.
[zk: localhost:2181(CONNECTED) 2] ls /admin/reassign_partitions
[]
Interestingly, I can see data is pilling up for the partitions which are not listed above. But the partitions listed in the error are not getting migrated as of now.
I am using confluent kafka docker image.
Kafka Broker Setting:
https://gist.github.com/ethicalmohit/cd44f580356ca02250760a307d90b54d

If you can give us some more details on your topology maybe we can understand better the problem.
Some thoughts:
- Can you connect via zookeeper-cli at kafka-0:2181 ? kafka-0 resolves to the correct host ?
- If reassignment is in progress either you have to manual stop this by deleting the appropriate key in zookeeper (warning, this may make some topic or partition broken) either you have to wait for this job to finish. Can you monitor the ongoing reassignment and give some info about that ?

This has been solved by increasing the value of replica.socket.receive.buffer.bytes in all destination brokers.
After changing the above parameter and restarting broker. I was able to see the data in above-mentioned partitions.

Related

Kafka expectantly shutting down. License topic could not be created

On Kafka startup multiple messages are logged to kafka/logs/kafkaServer.out and contain:
INFO [Admin Manager on Broker 0]: Error processing create topic
request CreatableTopic(name='_confluent-license', numPartitions=1,
replicationFactor=3, assignments=[],
configs=[CreateableTopicConfig(name='cleanup.policy',
value='compact'), CreateableTopicConfig(name='min.insync.replicas',
value='2')]) (kafka.server.AdminManager)
After approx 15 minutes Kafka shuts down and outputted to
kafka/logs/kafkaServer.out is :
org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 3 larger than available brokers: 1.
[2020-12-08 04:04:15,951] ERROR [KafkaServer id=0] Fatal error during KafkaServer startup. Prepare to shutdown
(kafka.server.KafkaServer)
org.apache.kafka.common.errors.TimeoutException: License topic could not be created
Caused by: org.apache.kafka.common.errors.InvalidReplicationFactorException:
Replication factor: 3 larger than available brokers: 1.
[2020-12-08 04:04:15,952] INFO [KafkaServer id=0] shutting down (kafka.server.KafkaServer)
It appears Kafka shuts down because the replication factor is set to 3 for the topic _confluent-license ? I'm not creating the topic _confluent-license, is this created as part of Kafka startup for licensing check ?
In attempt to fix I've modified /v5.5.0/etc/kafka/server.properties so that replication factor is 1 for internal topics:
############################# Internal Topic Settings #############################
# The replication factor for the group metadata internal topics "__consumer_offsets" and "__transaction_state"
# For anything other than development testing, a value greater than 1 is recommended for to ensure availability such as 3.
offsets.topic.replication.factor=1
transaction.state.log.replication.factor=1
instead of 3 :
#offsets.topic.replication.factor=3
#transaction.state.log.replication.factor=3
But this does not fix the issue and same logs are generated. The replication factor of __consumer_offsets is still 3. How to reduce the replication factor of topic _confluent-license from 3 to 1 ? Or could there be an alternative issue that is causing Kafka to shutdown ?
You should change the property confluent.license.topic.replication.factor by default it is 3.
(kafka.server.KafkaServer)
org.apache.kafka.common.errors.TimeoutException: License topic could
not be created Caused by:
org.apache.kafka.common.errors.InvalidReplicationFactorException:
Replication factor: 3 larger than available brokers: 1. [2020-12-08
04:04:15,952] INFO [KafkaServer id=0] shutting down
(kafka.server.KafkaServer)
The above error is due to the license topic having a default replication factor as 3. The same can be configured with confluent.license.topic.replication.factor to be equal to 1 if you have only 1 broker. The documentation for the same is here.
[2020-12-08 07:46:02,241] ERROR Error checking or creating metrics topic (io.confluent.metrics.reporter.ConfluentMetricsReporter)
org.apache.kafka.common.errors.InvalidReplicationFactorException: Replication factor: 2 larger than available brokers: 1.
The above error is due to the Confluent Metrics Reporter being enabled. The replication factor for the metric topic is defaulted at 3 and can be configured with confluent.metrics.reporter.topic.replicas to be 1 if you have just one broker. The documentation for the same is here.
The property confluent.license.topic.replication.factor is not working as expected.
Instead, we can try anyone of these properties confluent.license.replication.factor=1 or confluent.license.admin.replication.factor=1 if the number of broker is 1. This property is not in the documentation.

Kafka throws java.nio.channels.ClosedChannelException

When i try to consume messages from the kafka server which is hosted in ec2 with kafka console tool (V 0.9.0.1 , i think this uses old consumer APIs)
I get following exception.
How can i overcome this?
#./kafka-console-consumer.sh --zookeeper zookeeper1.xx.com:2181 --topic MY_TOPIC --from-beginning
[2016-04-06 14:34:58,219] WARN Fetching topic metadata with correlation id 0 for topics [Set(MY_TOPIC)] from broker [BrokerEndPoint(1014,kafka3.xx.com,9092)] failed (kafka.client.ClientUtils$)
java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:110)
at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:75)
at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:74)
at kafka.producer.SyncProducer.send(SyncProducer.scala:119)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:59)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:94)
at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
[2016-04-06 14:34:58,222] WARN Fetching topic metadata with correlation id 0 for topics [Set(MY_TOPIC)] from broker [BrokerEndPoint(1013,kafka22.xx.com,9092)] failed (kafka.client.ClientUtils$)
java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:110)
at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:75)
at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:74)
at kafka.producer.SyncProducer.send(SyncProducer.scala:119)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:59)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:94)
at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:63)
[2016-
The reason for the original closed channel exception is, i had some DNS issue, which i have solved by editing my local hosts file
I was able to solve this issue by setting advertised.host.name to the config file
This is actually a WARNING - no big deal. May be your topic is corrupted? Try to recreate topic.

When does kafka change leader?

I was running my services that work with kafka already for a year and no spontaneous changes of leader happens.
But for the last 2 weeks that started happens quite often.
Kafka log on that:
[2015-09-27 15:35:14,826] INFO [ReplicaFetcherManager on broker 2]
Removed fetcher for partitions [myTopic] (kafka.server.ReplicaFetcherManager)
[2015-09-27 15:35:14,830] INFO Truncating log myTopic-0 to offset 11520979. (kafka.log.Log)
[2015-09-27 15:35:14,845] WARN [Replica Manager on Broker 2]: Fetch request with correlation id 713276 from client ReplicaFetcherThread-0-2 on partition [myTopic,0] failed due to Leader not local for partition [myTopic,0] on broker 2 (kafka.server.ReplicaManager)
[2015-09-27 15:35:14,857] WARN [Replica Manager on Broker 2]: Fetch request with correlation id 256685 from client mirrormaker-1 on partition [myTopic,0] failed due to Leader not local for partition [myTopic,0] on broker 2 (kafka.server.ReplicaManager)
[2015-09-27 15:35:20,171] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions [myTopic,0] (kafka.server.ReplicaFetcherManager)
What can cause switching leader? If there is info in some kafka documentation - please - just point the link. I've failed to find.
System configuration
kafka version: kafka_2.10-0.8.2.1
os: Red Hat Enterprise Linux Server release 6.5 (Santiago)
server.properties (differs from default):
broker.id=001
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.flush.interval.messages=10000
log.flush.interval.ms=1000
log.retention.bytes=-1
controlled.shutdown.enable=true
auto.create.topics.enable=false
It appears like lead broker is down for that partition. It might be that data directroy(log.dirs) configured in server.properties is out of space and broker is not able to accommodate.
Also, what is replication factor of topic and cluster size of brokers?
I am assuming you have one topic and one partition with a replication factor of 2. Which is not a good configuration for optimal Kafka performance and consumers.
Your Logs are not clear enough for leader switch. Major issue in your topic may be having the only one leader due to the only partition. Now the single file in your logs is getting bigger in size day by day. Kafka internally does rebalancing at some level(details are not confirmed). That can be the reason for your leader switch. But i am not sure.
Also in your 2nd log line its says some of the logs are truncated. Can you please go though the logs in details and check is this happening only after truncation?
As you already mentioned you already checked your Kafka log directory files and their size. Please run the describe when you got this issue. The leader switch will reflect here as well. Or if you can setup some dashboard that will display the leader for past time. Then it will be easy for you to find the root cause.
bin/kafka-topics.sh --describe --zookeeper Zookeeperhost:Port --topic TopicName
Suggestion: i will suggest you to create a new topic with more partitions(read Kafka documentation to get a good idea about optimum number of partitions) and start writing to it. Or you can check, how to change partitions for current topic.
Last Thing: Is leader switch causing some issues in your Clients or you are worried only about warnings?

Why isn't kafka continuing to work on fail of one of the brokers?

I am under the impression that with two brokers with sync turned on my kafka setup should keep on working even on fail of one of the broker.
To test it I made a new topic named topicname. Its description is as follows:
Topic:topicname PartitionCount:1 ReplicationFactor:1 Configs:
Topic: topicname Partition: 0 Leader: 0 Replicas: 0 Isr: 0
Then I ran producer.sh and consumer.sh in the following way:
bin/kafka-console-producer.sh --broker-list localhost:9092,localhost:9095 sync --topic topicname
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic topicname --from-beginning
Till both the brokers were working I saw that messages were being received properly by the consumer, but when I killed one of the instance of the brokers through kill command then the consumer stopped showing me any new messages. Instead it showed me the following error message:
WARN [ConsumerFetcherThread-console-consumer-57116_ip-<internalipvalue>-1438604886831-603de65b-0-0], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 865; ClientId: console-consumer-57116; ReplicaId: -1; MaxWait: 100 ms; MinBytes: 1 bytes; RequestInfo: [topicname,0] -> PartitionFetchInfo(9,1048576). Possible cause: java.nio.channels.ClosedChannelException (kafka.consumer.ConsumerFetcherThread)
[2015-08-03 12:29:36,341] WARN Fetching topic metadata with correlation id 1 for topics [Set(topicname)] from broker [id:0,host:<hostname>,port:9092] failed (kafka.client.ClientUtils$)
java.nio.channels.ClosedChannelException
at kafka.network.BlockingChannel.send(BlockingChannel.scala:100)
at kafka.producer.SyncProducer.liftedTree1$1(SyncProducer.scala:73)
at kafka.producer.SyncProducer.kafka$producer$SyncProducer$$doSend(SyncProducer.scala:72)
at kafka.producer.SyncProducer.send(SyncProducer.scala:113)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:58)
at kafka.client.ClientUtils$.fetchTopicMetadata(ClientUtils.scala:93)
at kafka.consumer.ConsumerFetcherManager$LeaderFinderThread.doWork(ConsumerFetcherManager.scala:66)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:60)
I had this similar problem, setting the producer config "topic.metadata.refresh.interval.ms" to -1 (or whatever value is suitable for you) solved the issue for me.
So in my case , I had 3 broker (multi broker set up on my local machine) and created the topic with 3 partitions and replication factor 2.
Test set up:
Before the producer config:
Tried 3 brokers running , killed one of the brokers after producer started, the local Zookeeper updated the ISR and topic metadata info (removed down broker as leader) but the producer did not pick it up (may be due to default 10 mins refresh time).So messages end up failing. I get send exceptions.
After the producer config (-1 in my case):
Tried 3 brokers running , killed one of the brokers after producer started, the local Zookeeper updated the ISR info (removed down broker as leader), the producer refreshed the new ISR/topic metadata info and messages send did not fail.
-1 makes it refresh topic metadata on each failed attempt so may be you want to reduce the refresh time to something reasonable instead.
I think there are two things can make your consumer not work after a broker down for kafka HA cluster:
--replication-factor should bigger than 1 for your topic. so every topic partition can have at least one backup.
replication factor for internal topics for kafka configuration should also bigger than 1:
offsets.topic.replication.factor = 3
transaction.state.log.replication.factor = 3
transaction.state.log.min.isr = 2
This two modification make my producer and consumer still work after broker shutdown (5 broker and every broker goes down once) .
You can see in the topic description that you posted that your topic has only a single replica.
With a single replica there is no fault tolerance and if broker 0 (the broker that contains the replica) goes away, the topic will be unavailable.
Create a topic with more replicas (with --replication-factor 3) to have fault tolerance in case of crashes.
I had run into into the same problem even when using a topic with replication factor of 2.
Setting the following property on the producer worked for me.
"metadata.max.age.ms". (Kafka-0.8.2.1)
Else, my Producer was waiting for 1 minute by default to fetch the new leader and start contacting it
For a topic with replication factor N, Kafka tolerate up to N-1 server failures. E.g. having a replication factor 3 will allow you to handle upto 2 server failure.

Kafka 0.8 All Good & rocks! .... Kafka 0.7 not able to make it happen

Kafka 0.8 works great. I am able to use CLI as well as write my own producers/consumers!
Checking Zookeeper... and I see all the topics and partitions created successfully for 0.8.
Kafka 0.7 does not work!
Why Kafka 0.7? I am using Kafka Spout from Storm which is made for Kafka 0.7.
First I just want to run CLI based producer/consumer for Kafka 0.7, which I am unable to. I carry out the following steps:
I delete all the topics/partitions etc. in Zookeeper that were created from my Kafka 0.8
I change the dataDir in zoo.cfg to point to different location.
Now I start the kafka server 0.7. It starts successfully. However I don’t know why it again registers the broker topics I deleted?
Now I start the Kafka Producer :
bin/kafka-console-producer.sh --zookeeper localhost:2181 --topic topicime
& it starts successfully:
[2013-06-28 14:06:05,521] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient)
[2013-06-28 14:06:05,606] INFO Creating async producer for broker id = 0 at 0:0 (kafka.producer.ProducerPool)
Time to send some messages & oops I get this error:
[2013-06-28 14:07:19,650] INFO Disconnecting from 0:0 (kafka.producer.SyncProducer)
[2013-06-28 14:07:19,653] ERROR Connection attempt to 0:0 failed, next attempt in 1 ms (kafka.producer.SyncProducer)
java.net.ConnectException: Connection refused
at sun.nio.ch.Net.connect0(Native Method)
at sun.nio.ch.Net.connect(Net.java:364)
at sun.nio.ch.Net.connect(Net.java:356)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:623)
at kafka.producer.SyncProducer.connect(SyncProducer.scala:173)
at kafka.producer.SyncProducer.getOrMakeConnection(SyncProducer.scala:196)
at kafka.producer.SyncProducer.send(SyncProducer.scala:92)
at kafka.producer.SyncProducer.multiSend(SyncProducer.scala:135)
at kafka.producer.async.DefaultEventHandler.send(DefaultEventHandler.scala:58)
at kafka.producer.async.DefaultEventHandler.handle(DefaultEventHandler.scala:44)
at kafka.producer.async.ProducerSendThread.tryToHandle(ProducerSendThread.scala:116)
at scala.collection.immutable.Stream.foreach(Stream.scala:254)
at kafka.producer.async.ProducerSendThread.processEvents(ProducerSendThread.scala:70)
at kafka.producer.async.ProducerSendThread.run(ProducerSendThread.scala:41)
Note that Zookeeper is already running.
Any help would really be appreciated.
EDIT:
I don't even see the topic being created in zookeeper. I am running the following command:
bin/kafka-console-producer.sh --zookeeper localhost:2181 --topic topicime
After the command everything is fine & I get the following message:
[2013-06-28 14:30:17,614] INFO Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x13f805c6673004b, negotiated timeout = 6000 (org.apache.zookeeper.ClientCnxn)
[2013-06-28 14:30:17,615] INFO zookeeper state changed (SyncConnected) (org.I0Itec.zkclient.ZkClient)
[2013-06-28 14:30:17,700] INFO Creating async producer for broker id = 0 at 0:0 (kafka.producer.ProducerPool)
However now when i type a string to send I get the above error (Connection refused!)
INFO Disconnecting from 0:0 (kafka.producer.SyncProducer)
The above line has the error hidden in it. 0:0 is not a valid host and port. The solution is to explicitly set the host ip to be registered in Zookeeper by setting the "hostname" property in server.properties.
Consider checking out the storm-kafka fork, available at https://github.com/wurstmeister/storm-kafka-0.8-plus
I'm installing it right now for our servers =).