Kafka : How to deal with corrupted __consumer_offset? - apache-kafka

I'm running a 3 nodes kafka/zookeeper cluster (with logstash as a consumer). I get this error spammed in the logs in all the 3 nodes :
org.apache.kafka.common.errors.CorruptRecordException: Record size 0 is less than the minimum record overhead (14)
[2022-07-06 13:42:59,538] ERROR [ReplicaFetcher replicaId=23, leaderId=24, fetcherId=0] Found invalid messages during fetch for partition __consumer_offsets-3 offset 169799163 (kafka.server.ReplicaFetcherThread)
org.apache.kafka.common.errors.CorruptRecordException: Record size 0 is less than the minimum record overhead (14)
[2022-07-06 13:42:59,538] ERROR [ReplicaFetcher replicaId=23, leaderId=24, fetcherId=0] Found invalid messages during fetch for partition __consumer_offsets-20 offset 124408988 (kafka.server.ReplicaFetcherThread)
org.apache.kafka.common.errors.CorruptRecordException: Record size 0 is less than the minimum record overhead (14)
[2022-07-06 13:43:00,694] ERROR [ReplicaFetcher replicaId=23, leaderId=24, fetcherId=0] Found invalid messages during fetch for partition __consumer_offsets-9 offset 171772074 (kafka.server.ReplicaFetcherThread)
This 3 offets are always the same and never changes. It seems like partitions 3, 9 and 20 are stuck indefinitely cause of bad offsets.
Additionnal infos : __consumer_offset is set to 2 replicas and 50 partitions in total.
Any idea how can i fix this ? Rebooting kafka/the cluster does not do anything.
Thanks in advance !

Related

kafka + This server is not the leader for that topic-partition

I have 5 broker kafka version 0.10 cluster.Replication factor is 3. and this is production kafka
brokers IDS are
101
102
103
104
105
after couple months that cluster was ok , we observed following logs in Kakfa server.log.
from the log we can see many lines of 'This server is not the leader for that topic-partitionb' exception.
the topic - kopa.thrn.bvff have 100 partitions
and we can see that all 100 partitions are balanced and no need to run kafka kafka-reassign-partitions
What may be the possible reason?
Please help me.
[2023-01-19 11:53:37,434] ERROR [ReplicaFetcherThread-0-101], Error for partition [kopa.thrn.bvff,78] to broker 101:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2023-01-19 11:53:37,434] ERROR [ReplicaFetcherThread-0-101], Error for partition [kopa.thrn.bvff,23] to broker 101:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2023-01-19 11:53:37,434] ERROR [ReplicaFetcherThread-0-101], Error for partition [kopa.thrn.bvff,63] to broker 101:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2023-01-19 11:53:37,434] ERROR [ReplicaFetcherThread-0-101], Error for partition [kopa.thrn.bvff,98] to broker 101:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
[2023-01-19 11:53:37,434] ERROR [ReplicaFetcherThread-0-101], Error for partition [kopa.thrn.bvff,3] to broker 101:org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition. (kafka.server.ReplicaFetcherThread)
The leader broker and the follower brokers manage each partition in Kafka. Since you have replication factor 3, each partition will have one leader broker and 2 follower brokers.
When the Kafka producer produces data, it connects to the leader and puts the data there, the followers will copy the data from the leader.
Now, the Kafka leader broker can be reassigned based on the leader's availability, if the leader was unavailable for some time for any reason in a distributed environment (busy CPU, network partition etc), Kafka will run the leader election for the partition to elect a leader for the partition.
You can see who is the leader and who is the follower by topic describe command.
In your case, the partition leader has been changed due to some unavailability of the leader. If you have Kafka metrics, you could see those leader election events for the partition. It is hard in a distributed environment to ensure one broker will remain the leader forever.

mysql table record not being consumed by Kafka

I just started learning kafka and I am running kafka 2.13-2.80 on windows server 2012 R2. I started zookeeper using the following:
zookeeper-server-start.bat ../../config/zookeeper.properties
I started kafka using the following:
kafka-server-start.bat ../../config/server.properties
I started a connector with the following:
connect-standalone.bat ../../config/connect-standalone.properties ../../config/mysql.properties
The content of my mysql.properties file is as follows:
name=test-source-mysql-jdbc-autoincrement
connector.class=io.confluent.connect.jdbc.JdbcSourceConnector
tasks.max=1
connection.url=jdbc:mysql://127.0.0.1:3306/DBName?user=username&password=userpassword
mode=incrementing
incrementing.column.name=id
topic.prefix=test-mysql-jdbc-
I started a consumer with and without a partition option:
kafka-console-consumer.bat -topic test-mysql-jdbc-groups -bootstrap-server localhost:9092 -from-beginning [-partition 0]
All seemingly started without issues but when I add a record to my mysql table called groups, I do not see it in my consumer. I checked all the various logs. The only error messages I saw were in the state-change.log and they looked like the following:
ERROR [Broker id=0] Ignoring StopReplica request (delete=true) from controller 0 with correlation id 5 epoch 1 for partition mytopic-2 as the local replica for the partition is in an offline log directory (state.change.logger)
ERROR [Broker id=0] Ignoring StopReplica request (delete=true) from controller 0 with correlation id 5 epoch 1 for partition mytopic-1 as the local replica for the partition is in an offline log directory (state.change.logger)
ERROR [Broker id=0] Ignoring StopReplica request (delete=true) from controller 0 with correlation id 5 epoch 1 for partition mytopic-0 as the local replica for the partition is in an offline log directory (state.change.logger)
ERROR [Broker id=0] Received LeaderAndIsrRequest with correlation id 1 from controller 0 epoch 2 for partition mytopic-0 (last update controller epoch 1) but cannot become follower since the new leader -1 is unavailable. (state.change.logger)
ERROR [Broker id=0] Received LeaderAndIsrRequest with correlation id 1 from controller 0 epoch 2 for partition mytopic-1 (last update controller epoch 1) but cannot become follower since the new leader -1 is unavailable. (state.change.logger)
ERROR [Broker id=0] Received LeaderAndIsrRequest with correlation id 1 from controller 0 epoch 2 for partition mytopic-2 (last update controller epoch 1) but cannot become follower since the new leader -1 is unavailable. (state.change.logger)
I also notice this message in zookeeper
INFO Expiring session timeout of exceeded (org.apache.zookeeper.server.ZooKeeperServer)
Please could anyone give me pointers as to what I could be doing wrong? Thanks

Why FETCH_SESSION_ID_NOT_FOUND in Kafka?

I got lots of lots of FETCH_SESSION_ID_NOT_FOUND
INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=2] Node 1 was unable to process the fetch request with (sessionId=1229568311, epoch=511): FETCH_SESSION_ID_NOT_FOUND. (org.apache.kafka.clients.FetchSessionHandler)
INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=5] Node 1 was unable to process the fetch request with (sessionId=136816338, epoch=504): FETCH_SESSION_ID_NOT_FOUND. (org.apache.kafka.clients.FetchSessionHandler)
INFO [ReplicaFetcher replicaId=2, leaderId=0, fetcherId=2] Node 0 was unable to process the fetch request with (sessionId=311282207, epoch=569): FETCH_SESSION_ID_NOT_FOUND. (org.apache.kafka.clients.FetchSessionHandler)
...
I read Kafka: Continuously getting FETCH_SESSION_ID_NOT_FOUND and How to check the actual number of incremental fetch session cache slots used in Kafka cluster?
For now, we just got metrics about lags from Burrow.
My questions:
1, Can someone explain why I got so many FETCH_SESSION_ID_NOT_FOUND? what does it mean? I did not get them before.
Some consumers send too many requests? or the leaders always re-elected?
I have no idea. Could some give me more details?
2, If it is because some consumers send too many requests, how to identify these consumers?
Thanks

Kafka: Continuously getting FETCH_SESSION_ID_NOT_FOUND

I am continuously getting FETCH_SESSION_ID_NOT_FOUND. I'm not sure why its happening. Can anyone please me here what is the problem and what will be the impact on consumers and brokers.
Kafka Server Log:
INFO [2019-10-18 12:09:00,709] [ReplicaFetcherThread-1-8][] org.apache.kafka.clients.FetchSessionHandler - [ReplicaFetcher replicaId=6, leaderId=8, fetcherId=1] Node 8 was unable to process the fetch request with (sessionId=258818904, epoch=2233): FETCH_SESSION_ID_NOT_FOUND.
INFO [2019-10-18 12:09:01,078] [ReplicaFetcherThread-44-10][] org.apache.kafka.clients.FetchSessionHandler - [ReplicaFetcher replicaId=6, leaderId=10, fetcherId=44] Node 10 was unable to process the fetch request with (sessionId=518415741, epoch=4416): FETCH_SESSION_ID_NOT_FOUND.
INFO [2019-10-18 12:09:01,890] [ReplicaFetcherThread-32-9][] org.apache.kafka.clients.FetchSessionHandler - [ReplicaFetcher replicaId=6, leaderId=9, fetcherId=32] Node 9 was unable to process the fetch request with (sessionId=418200413, epoch=3634): FETCH_SESSION_ID_NOT_FOUND.
Kafka Consumer Log:
12:29:58,936 INFO [FetchSessionHandler:383] [Consumer clientId=bannerGroupMap#87e2af7cf742#test, groupId=bannerGroupMap#87e2af7cf742#test] Node 8 was unable to process the fetch request with (sessionId=1368981303, epoch=60): FETCH_SESSION_ID_NOT_FOUND.
12:29:58,937 INFO [FetchSessionHandler:383] [Consumer clientId=bannerGroupMap#87e2af7cf742#test, groupId=bannerGroupMap#87e2af7cf742#test] Node 3 was unable to process the fetch request with (sessionId=1521862194, epoch=59): FETCH_SESSION_ID_NOT_FOUND.
12:29:59,939 INFO [FetchSessionHandler:383] [Consumer clientId=zoneGroupMap#87e2af7cf742#test, groupId=zoneGroupMap#87e2af7cf742#test] Node 7 was unable to process the fetch request with (sessionId=868804875, epoch=58): FETCH_SESSION_ID_NOT_FOUND.
12:30:06,952 INFO [FetchSessionHandler:383] [Consumer clientId=creativeMap#87e2af7cf742#test, groupId=creativeMap#87e2af7cf742#test] Node 3 was unable to process the fetch request with (sessionId=1135396084, epoch=58): FETCH_SESSION_ID_NOT_FOUND.
12:30:12,965 INFO [FetchSessionHandler:383] [Consumer clientId=creativeMap#87e2af7cf742#test, groupId=creativeMap#87e2af7cf742#test] Node 6 was unable to process the fetch request with (sessionId=1346340004, epoch=56): FETCH_SESSION_ID_NOT_FOUND.
Cluster Details:
Broker: 13 (1 Broker : 14 cores & 36GB memory)
Kafka cluster version: 2.0.0
Kafka Java client version: 2.0.0
Number topics: ~15.
Number of consumers: 7K (all independent and manually assigned all partitions of a topic to a consumers. One consumer is consuming all partitions from a topic only)
This is not an error, it's INFO and it's telling you that you are connected but it can't fetch a session id because there's none to fetch.
It's normal to see this message and the flushing message in the log.
Increase the value of max.incremental.fetch.session.cache.slots. The default value is 1K, in my case I have increased it to 10K and it fixed.
I have increased it at first from 1K to 2K, and in the second step from 2K to 4K, and as long as the limit was not exhausted, there was no appearance of error:
As it seemed to me like a session leak by certain unidentified consumer, I didn't try 10K limit yet, but reading Hrishikesh Mishra's answer, I definitely will. Because, increasing the limit also decreased the frequency of error, so the question of identifying individual consumer groups that are opening excessive number of incremental fetch sessions, mentioned here How to check the actual number of incremental fetch session cache slots used in Kafka cluster? , may be irrelevant in the end.

cluster no response due to replication

I found this in my server.log:
[2016-03-29 18:24:59,349] INFO Scheduling log segment 3773408933 for log g17-4 for deletion. (kafka.log.Log)
[2016-03-29 18:24:59,349] INFO Scheduling log segment 3778380412 for log g17-4 for deletion. (kafka.log.Log)
[2016-03-29 18:24:59,403] WARN [ReplicaFetcherThread-3-4], Replica 2 for partition [g17,4] reset its fetch offset from 3501121050 to current leader 4's start offset 3501121050 (kafka.server.ReplicaFetcherThread)
[2016-03-29 18:24:59,403] ERROR [ReplicaFetcherThread-3-4], Current offset 3781428103 for partition [g17,4] out of range; reset offset to 3501121050 (kafka.server.ReplicaFetcherThread)
[2016-03-29 18:25:27,816] INFO Rolled new log segment for 'g17-12' in 1 ms. (kafka.log.Log)
[2016-03-29 18:25:35,548] INFO Rolled new log segment for 'g18-10' in 2 ms. (kafka.log.Log)
[2016-03-29 18:25:35,707] INFO Partition [g18,10] on broker 2: Shrinking ISR for partition [g18,10] from 2,4 to 2 (kafka.cluster.Partition)
[2016-03-29 18:25:36,042] INFO Partition [g18,10] on broker 2: Expanding ISR for partition [g18,10] from 2 to 2,4 (kafka.cluster.Partition)
The offset of replication is larger than leader's, so the replication data will delete, and then copy the the data from leader.
But when copying, the cluster is very slow; some storm topology fail due to no response from Kafka.
How do I prevent this problem from occurring?
How do I slow down the replication rate, while replication is copying?