Why FETCH_SESSION_ID_NOT_FOUND in Kafka? - apache-kafka

I got lots of lots of FETCH_SESSION_ID_NOT_FOUND
INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=2] Node 1 was unable to process the fetch request with (sessionId=1229568311, epoch=511): FETCH_SESSION_ID_NOT_FOUND. (org.apache.kafka.clients.FetchSessionHandler)
INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=5] Node 1 was unable to process the fetch request with (sessionId=136816338, epoch=504): FETCH_SESSION_ID_NOT_FOUND. (org.apache.kafka.clients.FetchSessionHandler)
INFO [ReplicaFetcher replicaId=2, leaderId=0, fetcherId=2] Node 0 was unable to process the fetch request with (sessionId=311282207, epoch=569): FETCH_SESSION_ID_NOT_FOUND. (org.apache.kafka.clients.FetchSessionHandler)
...
I read Kafka: Continuously getting FETCH_SESSION_ID_NOT_FOUND and How to check the actual number of incremental fetch session cache slots used in Kafka cluster?
For now, we just got metrics about lags from Burrow.
My questions:
1, Can someone explain why I got so many FETCH_SESSION_ID_NOT_FOUND? what does it mean? I did not get them before.
Some consumers send too many requests? or the leaders always re-elected?
I have no idea. Could some give me more details?
2, If it is because some consumers send too many requests, how to identify these consumers?
Thanks

Related

Apache Flink, Kafka consumer - org.apache.kafka.clients.producer.internals.Sender retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION

We are using Apache flink kafka consumer to consume the payload . We are facing the delay in processing intermittently. We have added the logs in our business logic and everything looks good. But keep on getting the below error.
[kafka-producer-network-thread | producer-44] WARN org.apache.kafka.clients.producer.internals.Sender - [Producer clientId=producer-44] Got error produce response with correlation id 82 on topic-partition topicname-ingress-0, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION. Error Message: Disconnected from node 0
[kafka-producer-network-thread | producer-44] WARN org.apache.kafka.clients.producer.internals.Sender - [Producer clientId=producer-44] Received invalid metadata error in produce request on partition topicnamae-ingress-0 due to org.apache.kafka.common.errors.NetworkException: Disconnected from node 0. Going to request metadata update now

Kafka : How to deal with corrupted __consumer_offset?

I'm running a 3 nodes kafka/zookeeper cluster (with logstash as a consumer). I get this error spammed in the logs in all the 3 nodes :
org.apache.kafka.common.errors.CorruptRecordException: Record size 0 is less than the minimum record overhead (14)
[2022-07-06 13:42:59,538] ERROR [ReplicaFetcher replicaId=23, leaderId=24, fetcherId=0] Found invalid messages during fetch for partition __consumer_offsets-3 offset 169799163 (kafka.server.ReplicaFetcherThread)
org.apache.kafka.common.errors.CorruptRecordException: Record size 0 is less than the minimum record overhead (14)
[2022-07-06 13:42:59,538] ERROR [ReplicaFetcher replicaId=23, leaderId=24, fetcherId=0] Found invalid messages during fetch for partition __consumer_offsets-20 offset 124408988 (kafka.server.ReplicaFetcherThread)
org.apache.kafka.common.errors.CorruptRecordException: Record size 0 is less than the minimum record overhead (14)
[2022-07-06 13:43:00,694] ERROR [ReplicaFetcher replicaId=23, leaderId=24, fetcherId=0] Found invalid messages during fetch for partition __consumer_offsets-9 offset 171772074 (kafka.server.ReplicaFetcherThread)
This 3 offets are always the same and never changes. It seems like partitions 3, 9 and 20 are stuck indefinitely cause of bad offsets.
Additionnal infos : __consumer_offset is set to 2 replicas and 50 partitions in total.
Any idea how can i fix this ? Rebooting kafka/the cluster does not do anything.
Thanks in advance !

Received invalid metadata error in produce request on partition topic-0 due to org.apache.kafka.common.errors.NotLeaderForPartitionException

We use spring kafka stream producer to produce data to kafka topic. when we did resiliency test, we got the the below error.
`2020-08-28 16:18:35.536 WARN [,,,] 26 --- [ad | producer-3] o.a.k.clients.producer.internals.Sender : [Producer clientId=producer-3] Received invalid metadata error in produce request on partition topic1-0 due to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.. Going to request metadata update now
 log: 2020-08-28 16:18:35.536 WARN [,,,] 26 --- [ad | producer-3] o.a.k.clients.producer.internals.Sender : [Producer clientId=producer-3] Got error produce response with correlation id 80187 on topic-partition topic1-0, retrying (4 attempts left). Error: NOT_LEADER_FOR_PARTITION
[Producer clientId=producer-3] Received invalid metadata error in produce request on partition topic1-0 due to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.. Going to request metadata update now.
The warn should be coming only for the period of time we are running resiliency(broker down/up testing) but these warning happening even after the resiliency test period and happening only for the particular partition(here topic1-0). all the other partitions are working fine.`
this is the producer config we have:
spring.cloud.stream.kafka.binder.requiredAcks=all spring.cloud.stream.kafka.binder.configuration.retries=5 spring.cloud.stream.kafka.binder.configuration.metadata.max.age.ms=3000 spring.cloud.stream.kafka.binder.configuration.max.in.flight.requests.per.connection=1 spring.cloud.stream.kafka.binder.configuration.retry.backoff.ms=10000
we have retry config too and it is retrying to get the proper metadata which you can see the above log but keep getting the same warning for that particular partition. Our kafka team also analyzing this issue. I checked google for any solution but nothing i could find to be useful.
is there any config or anything else missing?
please help me.
Thanks in advance.
This error comes when Kafka is down. Restarting Kafka worked for me! :)

Kafka: Continuously getting FETCH_SESSION_ID_NOT_FOUND

I am continuously getting FETCH_SESSION_ID_NOT_FOUND. I'm not sure why its happening. Can anyone please me here what is the problem and what will be the impact on consumers and brokers.
Kafka Server Log:
INFO [2019-10-18 12:09:00,709] [ReplicaFetcherThread-1-8][] org.apache.kafka.clients.FetchSessionHandler - [ReplicaFetcher replicaId=6, leaderId=8, fetcherId=1] Node 8 was unable to process the fetch request with (sessionId=258818904, epoch=2233): FETCH_SESSION_ID_NOT_FOUND.
INFO [2019-10-18 12:09:01,078] [ReplicaFetcherThread-44-10][] org.apache.kafka.clients.FetchSessionHandler - [ReplicaFetcher replicaId=6, leaderId=10, fetcherId=44] Node 10 was unable to process the fetch request with (sessionId=518415741, epoch=4416): FETCH_SESSION_ID_NOT_FOUND.
INFO [2019-10-18 12:09:01,890] [ReplicaFetcherThread-32-9][] org.apache.kafka.clients.FetchSessionHandler - [ReplicaFetcher replicaId=6, leaderId=9, fetcherId=32] Node 9 was unable to process the fetch request with (sessionId=418200413, epoch=3634): FETCH_SESSION_ID_NOT_FOUND.
Kafka Consumer Log:
12:29:58,936 INFO [FetchSessionHandler:383] [Consumer clientId=bannerGroupMap#87e2af7cf742#test, groupId=bannerGroupMap#87e2af7cf742#test] Node 8 was unable to process the fetch request with (sessionId=1368981303, epoch=60): FETCH_SESSION_ID_NOT_FOUND.
12:29:58,937 INFO [FetchSessionHandler:383] [Consumer clientId=bannerGroupMap#87e2af7cf742#test, groupId=bannerGroupMap#87e2af7cf742#test] Node 3 was unable to process the fetch request with (sessionId=1521862194, epoch=59): FETCH_SESSION_ID_NOT_FOUND.
12:29:59,939 INFO [FetchSessionHandler:383] [Consumer clientId=zoneGroupMap#87e2af7cf742#test, groupId=zoneGroupMap#87e2af7cf742#test] Node 7 was unable to process the fetch request with (sessionId=868804875, epoch=58): FETCH_SESSION_ID_NOT_FOUND.
12:30:06,952 INFO [FetchSessionHandler:383] [Consumer clientId=creativeMap#87e2af7cf742#test, groupId=creativeMap#87e2af7cf742#test] Node 3 was unable to process the fetch request with (sessionId=1135396084, epoch=58): FETCH_SESSION_ID_NOT_FOUND.
12:30:12,965 INFO [FetchSessionHandler:383] [Consumer clientId=creativeMap#87e2af7cf742#test, groupId=creativeMap#87e2af7cf742#test] Node 6 was unable to process the fetch request with (sessionId=1346340004, epoch=56): FETCH_SESSION_ID_NOT_FOUND.
Cluster Details:
Broker: 13 (1 Broker : 14 cores & 36GB memory)
Kafka cluster version: 2.0.0
Kafka Java client version: 2.0.0
Number topics: ~15.
Number of consumers: 7K (all independent and manually assigned all partitions of a topic to a consumers. One consumer is consuming all partitions from a topic only)
This is not an error, it's INFO and it's telling you that you are connected but it can't fetch a session id because there's none to fetch.
It's normal to see this message and the flushing message in the log.
Increase the value of max.incremental.fetch.session.cache.slots. The default value is 1K, in my case I have increased it to 10K and it fixed.
I have increased it at first from 1K to 2K, and in the second step from 2K to 4K, and as long as the limit was not exhausted, there was no appearance of error:
As it seemed to me like a session leak by certain unidentified consumer, I didn't try 10K limit yet, but reading Hrishikesh Mishra's answer, I definitely will. Because, increasing the limit also decreased the frequency of error, so the question of identifying individual consumer groups that are opening excessive number of incremental fetch sessions, mentioned here How to check the actual number of incremental fetch session cache slots used in Kafka cluster? , may be irrelevant in the end.

Kafka: Partitions reassignment not happening

I am getting an error while migrating data between Kafka brokers.
I am using kafka-reassignment tool to reassign partitions to a different broker without any throttling(because it didn't worked with the below command.). There were around 400 partitions of 50 topics.
Apache Kafka 1.1.0
Confluent Docker Image tag : 4.1.0
Command:
kafka-reassign-partitions --zookeeper IP:2181 --reassignment-json-file proposed.json --execute —throttle 100000000
After some time, I am able to see the below error continuously on the target broker.
[2019-09-21 11:24:07,625] INFO [ReplicaFetcher replicaId=4, leaderId=0, fetcherId=0] Error sending fetch request (sessionId=514675011, epoch=INITIAL) to node 0: java.io.IOException: Connection to 0 was disconnected before the response was read. (org.apache.kafka.clients.FetchSessionHandler)
[2019-09-21 11:24:07,626] WARN [ReplicaFetcher replicaId=4, leaderId=0, fetcherId=0] Error in response for fetch request (type=FetchRequest, replicaId=4, maxWait=500, minBytes=1, maxBytes=10485760, fetchData={TOPIC-4=(offset=4624271, logStartOffset=4624271, maxBytes=104
8576), TOPIC-2=(offset=1704819, logStartOffset=1704819, maxBytes=1048576), TOPIC-8=(offset=990485, logStartOffset=990485, maxBytes=1048576), TOPIC-1=(offset=1696764, logStartOffset=1696764, maxBytes=1048576), TOPIC-7=(offset=991507, logStartOffset=991507, maxBytes=10485
76), TOPIC-5=(offset=988660, logStartOffset=988660, maxBytes=1048576)}, isolationLevel=READ_UNCOMMITTED, toForget=, metadata=(sessionId=514675011, epoch=INITIAL)) (kafka.server.ReplicaFetcherThread)
java.io.IOException: Connection to 0 was disconnected before the response was read
at org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97)
at kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:96)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:220)
at kafka.server.ReplicaFetcherThread.fetch(ReplicaFetcherThread.scala:43)
at kafka.server.AbstractFetcherThread.processFetchRequest(AbstractFetcherThread.scala:146)
at kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82)
Zookeeper status:
ls /admin/reassign_partitions
[]
I am using t2.medium type EC2 instances and gp2 type EBS volumes with 120GB size.
I am able to connect to the zookeeper from all brokers.
[zk: localhost:2181(CONNECTED) 3] ls /brokers/ids [0, 1, 2, 3]
I am using IP address for all brokers, so DNS mismatch is also not the case.
Also, I am not able to see any topic scheduled for reassignment in zookeeper.
[zk: localhost:2181(CONNECTED) 2] ls /admin/reassign_partitions
[]
Interestingly, I can see data is pilling up for the partitions which are not listed above. But the partitions listed in the error are not getting migrated as of now.
I am using confluent kafka docker image.
Kafka Broker Setting:
https://gist.github.com/ethicalmohit/cd44f580356ca02250760a307d90b54d
If you can give us some more details on your topology maybe we can understand better the problem.
Some thoughts:
- Can you connect via zookeeper-cli at kafka-0:2181 ? kafka-0 resolves to the correct host ?
- If reassignment is in progress either you have to manual stop this by deleting the appropriate key in zookeeper (warning, this may make some topic or partition broken) either you have to wait for this job to finish. Can you monitor the ongoing reassignment and give some info about that ?
This has been solved by increasing the value of replica.socket.receive.buffer.bytes in all destination brokers.
After changing the above parameter and restarting broker. I was able to see the data in above-mentioned partitions.