kafka broker restart in clean state - apache-kafka

We are using apache kafka 1.0 .We had stopped one of the kafka brokers in our cluster for an activity and its disk got fully wiped. We added it back with the same broker id so it started syncing with other brokers. We saw the below once it started syncing -
In application while producing to kafka things were working fine mostly . We saw a massive spike in consumer offset commit failures with error messages like below across consumers -
[Consumer clientId=abcasdfsadf, groupId=service_name_group_id] Offset commit failed on partition topic-name-1 at offset 2770664: The request timed out.
During this duration in the brokers which were already running we also so these logs pretty frequently [25-30k times in an hour] -
[2023-02-01 20:03:36,739] WARN Attempting to send response via channel for which there is no open connection, connection id broker-source-ip:broker-source-port-remote-ip:remote-port-1203082 (kafka.network.Processor)
Another observation was it seemed like the kafka cluster was expanding the isr and contracting it pretty frequently too for some topics [ it eventually did get in sync ]. Once the restarted broker got in sync all the errors seemed to go away. While the network and disk io was higher than usual but the machine on which broker runs has more bandwidth .
Was wondering if anyone has encountered a similar issue before and what could be the cause of this .

Related

Kafka - broker partitions not in-sync after restart

We use 3 node kafka clusters running 2.7.0 with quite high number of topics and partitions. Almost all the topics have only 1 partition and replication factor of 3 so that gives us roughly:
topics: 7325
partitions total in cluster (including replica): 22110
Brokers are relatively small with
6vcpu
16gb memory
500GB in /var/lib/kafka occupied by partitions data
As you can imagine because we have 3 brokers and replication factor 3 the data is very evenly spread across brokers. Each broker leads very similar (same) amount of partitions and the number of partitions per broker is equal. Under normal circumstances.
Before doing rolling restart yesterday everything was in-sync. We stopped the process and started it again after 1 minute. It took some 10minutes to get synchronized with Zookeeper and start listening on port.
After saing 'Kafka server started'. Nothing is happening. There is no CPU, memory or disk activity. The partition data is visible on data disk. There are no messages in log for more than 1 day now since process booted up.
We've tried restarting zookeeper cluster (one by one). We've tried restart of broker again. Now it's been 24 hours since last restart and still not change.
Broker itself is reporting it leads 0 partitions. Leadership for all the partitions moved to other brokers and they are reporting that everything located in this broker is not in sync.
I'm aware the number of partitions per broker is far exceeding the recommendation but I'm still confused by lack of any activity or log messages. Any ideas what should be checked further? It looks like something is stuck somewhere. I checked the kafka ACLs and there are no block messages related to broker username.
I tried another restart with DEBUG mode and it seems there is some problem with metadata. These two messages are constantly repeating:
[2022-05-13 16:33:25,688] DEBUG [broker-1-to-controller-send-thread]: Controller isn't cached, looking for local metadata changes (kafka.server.BrokerToControllerRequestThread)
[2022-05-13 16:33:25,688] DEBUG [broker-1-to-controller-send-thread]: No controller defined in metadata cache, retrying after backoff (kafka.server.BrokerToControllerRequestThread)
With kcat it's also impossible to fetch metadata about topics (meaning if I specify this broker as bootstrap server).

Kafka Streams Apps Threads fail transaction and are fenced and restarted after Kafka broker restart

We are noticing Streams Apps threads fail transactions during rolling restarts of our Kafka Brokers. The transaction failure causes stream thread fencing which in turn causes a restart of the thread and re-balancing. The re-balancing causes some delay in processing. Our goal is to make broker restarts as smooth as possible and prevent processing delays as much as possible.
For our rolling Broker restarts we use the controlled.shutdown=true configuration, and before each restart we wait for all partitions to be in-sync across all replicas.
For our Streams Apps we have properly configured group.instance.id and an appropriate session.timeout.ms so that rolling restarts of the streams apps themselves are smooth and without re-balances.
From the Kafka Streams app logs I have identified a sequence of events leading up to the fencing:
Broker starts shutting down
App logs error producing to topic due to NOT_LEADER_OR_FOLLOWER
App heartbeats failing because coordinator is restarting broker
App discovers new group coordinator (this bounces a a bit between the restarting broker and live brokers)
App stabilizes
Broker starting up again
App fails to do fetch request to starting broker due to FETCH_SESSION_ID_NOT_FOUND
App discovers starting broker as transaction coordinator
App transaction fails due to one of two reasons:
InvalidProducerEpochException: Producer attempted to produce with an old epoch.
ProducerFencedException: There is a newer producer with the same transactionalId which fences the current one
Stream threads end up in fatal error state, get fenced and restarted which causes a rebalance.
What could be causing the two exceptions that cause stream thread transactions to fail? My intuition is that the broker starting up is assigned as transaction coordinator before it has synced its transaction states with the in-sync brokers. This could explain old epochs or different transactional ids to be known by that broker.
How can we further identify what is going wrong here and how it can be improved?
you can set request.timeout.ms in kafka streams which will make stream API wait for a longer period of time. if kafka broker is not up in a given period of time then only it will throw an exception which can be handled by using ProductionExceptionHandler as described in Handling exceptions in Kafka streams

Kafka : Failed to update metadata after 60000 ms with only one broker down

We have a kafka producer configured as -
metadata.broker.list=broker1:9092,broker2:9092,broker3:9092,broker4:9092
serializer.class=kafka.serializer.StringEncoder
request.required.acks=1
request.timeout.ms=30000
batch.num.messages=25
message.send.max.retries=3
producer.type=async
compression.codec=snappy
Replication Factor is 3 and total number of partition currently is 108
Rest of the properties are default.
This producer was running absolutely fine. Then, due to some reason, one of the broker went down. Then, our producer started to show the log as -
"Failed to update metadata after 60000 ms". Nothing else was there in the log and we were seeing this error. In some interval, few requests were getting blocked, even if producer was async.
This issue was resolved when the broker was again up and running.
What can be the reason of this? One broker down should not affect the system as a whole as per my understanding.
Posting the answer for someone who might face this issue -
The reason is older version of Kafka Producer. The kafka producers take bootstrap servers as list. In older versions, for fetching metadata, producers will try to connect with all the servers in Round Robin fashion. So, if one of the broker is down, the requests going to this server will fail and this message will come.
Solution:
Upgrade to newer producer version.
can reduce metadata.fetch.timeout.ms settings: This will ensure the main thread is not getting blocked and send will fail soon. Default value is 60000ms. Not needed in higher version
Note: Kafka send method is blocked till the producer is able to write to buffer.
I got the same error because I forgot to create the topic. Once I created the topic the issue was resolved.

Kafka broker occassionally takes much longer than usual to load logs on startup

We are observing that Kafka brokers occasionally take much more time to load logs on startup than usual. Much longer in this case means 40 minutes instead of at most 1 minute. This happens during a rolling restart following the procedure described by Confluent. This happens after the broker reported that controlled shutdown was succesful.
Kafka Setup
Confluent Platform 5.5.0
Kafka Version 2.5.0
3 Replicas (minimum 2 in sync)
Controlled broker shutdown enabled
1TB of AWS EBS for Kafka log storage
Other potentially useful information
We make extensive use of Kafka Streams
We use exactly-once processing and transactional producers/consumers
Observations
It is not always the same broker that takes a long time.
It does not only occur when the broker is the active controller.
A log partition that loads quickly (15ms) can take a long time (9549 ms) for the same broker a day later.
We experienced this issue before on Kafka 2.4.0 but after upgrading to 2.5.0 it did not occur for a few weeks.
Does anyone have an idea what could be causing this? Or what additional information would be useful to track down the issue?

Apache kafka storm, persistence during maintenance

I have Ubuntu 14.04TS. I use Node.js->Kafka->Storm->MongoDB chain. With initial development, everything goes well. Messages are finally stored into mMngoDB.
In Kafka, I have one Zookeeper and broker0 in kakfa1. broker1 in kafka2. With Storm, Zookeeper, nimbus, and DRPC are located at storm1. Supervisor and worker are located at storm2.
Now the questions is when I do update storm1 and storm2. I stopped all processes of storm1 and storm2. I suppose Kafka will buffer the message from Node.js. After I restarted both storm1 and storm2, and redeployed topology, I found messages during storm1 storm2's, down and up, are lost. So indeed, Kafka does not keep persistence of messages during storm maintenance period.
In my mind, Kafka will remember the last index of the message it receive acknowledgement.
In all, how could I prevent message from lost when storm is under maintenance.