Kafka: Continuously getting FETCH_SESSION_ID_NOT_FOUND - apache-kafka

I am continuously getting FETCH_SESSION_ID_NOT_FOUND. I'm not sure why its happening. Can anyone please me here what is the problem and what will be the impact on consumers and brokers.
Kafka Server Log:
INFO [2019-10-18 12:09:00,709] [ReplicaFetcherThread-1-8][] org.apache.kafka.clients.FetchSessionHandler - [ReplicaFetcher replicaId=6, leaderId=8, fetcherId=1] Node 8 was unable to process the fetch request with (sessionId=258818904, epoch=2233): FETCH_SESSION_ID_NOT_FOUND.
INFO [2019-10-18 12:09:01,078] [ReplicaFetcherThread-44-10][] org.apache.kafka.clients.FetchSessionHandler - [ReplicaFetcher replicaId=6, leaderId=10, fetcherId=44] Node 10 was unable to process the fetch request with (sessionId=518415741, epoch=4416): FETCH_SESSION_ID_NOT_FOUND.
INFO [2019-10-18 12:09:01,890] [ReplicaFetcherThread-32-9][] org.apache.kafka.clients.FetchSessionHandler - [ReplicaFetcher replicaId=6, leaderId=9, fetcherId=32] Node 9 was unable to process the fetch request with (sessionId=418200413, epoch=3634): FETCH_SESSION_ID_NOT_FOUND.
Kafka Consumer Log:
12:29:58,936 INFO [FetchSessionHandler:383] [Consumer clientId=bannerGroupMap#87e2af7cf742#test, groupId=bannerGroupMap#87e2af7cf742#test] Node 8 was unable to process the fetch request with (sessionId=1368981303, epoch=60): FETCH_SESSION_ID_NOT_FOUND.
12:29:58,937 INFO [FetchSessionHandler:383] [Consumer clientId=bannerGroupMap#87e2af7cf742#test, groupId=bannerGroupMap#87e2af7cf742#test] Node 3 was unable to process the fetch request with (sessionId=1521862194, epoch=59): FETCH_SESSION_ID_NOT_FOUND.
12:29:59,939 INFO [FetchSessionHandler:383] [Consumer clientId=zoneGroupMap#87e2af7cf742#test, groupId=zoneGroupMap#87e2af7cf742#test] Node 7 was unable to process the fetch request with (sessionId=868804875, epoch=58): FETCH_SESSION_ID_NOT_FOUND.
12:30:06,952 INFO [FetchSessionHandler:383] [Consumer clientId=creativeMap#87e2af7cf742#test, groupId=creativeMap#87e2af7cf742#test] Node 3 was unable to process the fetch request with (sessionId=1135396084, epoch=58): FETCH_SESSION_ID_NOT_FOUND.
12:30:12,965 INFO [FetchSessionHandler:383] [Consumer clientId=creativeMap#87e2af7cf742#test, groupId=creativeMap#87e2af7cf742#test] Node 6 was unable to process the fetch request with (sessionId=1346340004, epoch=56): FETCH_SESSION_ID_NOT_FOUND.
Cluster Details:
Broker: 13 (1 Broker : 14 cores & 36GB memory)
Kafka cluster version: 2.0.0
Kafka Java client version: 2.0.0
Number topics: ~15.
Number of consumers: 7K (all independent and manually assigned all partitions of a topic to a consumers. One consumer is consuming all partitions from a topic only)

This is not an error, it's INFO and it's telling you that you are connected but it can't fetch a session id because there's none to fetch.
It's normal to see this message and the flushing message in the log.

Increase the value of max.incremental.fetch.session.cache.slots. The default value is 1K, in my case I have increased it to 10K and it fixed.

I have increased it at first from 1K to 2K, and in the second step from 2K to 4K, and as long as the limit was not exhausted, there was no appearance of error:
As it seemed to me like a session leak by certain unidentified consumer, I didn't try 10K limit yet, but reading Hrishikesh Mishra's answer, I definitely will. Because, increasing the limit also decreased the frequency of error, so the question of identifying individual consumer groups that are opening excessive number of incremental fetch sessions, mentioned here How to check the actual number of incremental fetch session cache slots used in Kafka cluster? , may be irrelevant in the end.

Related

Received invalid metadata error in produce request on partition topic-0 due to org.apache.kafka.common.errors.NotLeaderForPartitionException

We use spring kafka stream producer to produce data to kafka topic. when we did resiliency test, we got the the below error.
`2020-08-28 16:18:35.536 WARN [,,,] 26 --- [ad | producer-3] o.a.k.clients.producer.internals.Sender : [Producer clientId=producer-3] Received invalid metadata error in produce request on partition topic1-0 due to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.. Going to request metadata update now
 log: 2020-08-28 16:18:35.536 WARN [,,,] 26 --- [ad | producer-3] o.a.k.clients.producer.internals.Sender : [Producer clientId=producer-3] Got error produce response with correlation id 80187 on topic-partition topic1-0, retrying (4 attempts left). Error: NOT_LEADER_FOR_PARTITION
[Producer clientId=producer-3] Received invalid metadata error in produce request on partition topic1-0 due to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.. Going to request metadata update now.
The warn should be coming only for the period of time we are running resiliency(broker down/up testing) but these warning happening even after the resiliency test period and happening only for the particular partition(here topic1-0). all the other partitions are working fine.`
this is the producer config we have:
spring.cloud.stream.kafka.binder.requiredAcks=all spring.cloud.stream.kafka.binder.configuration.retries=5 spring.cloud.stream.kafka.binder.configuration.metadata.max.age.ms=3000 spring.cloud.stream.kafka.binder.configuration.max.in.flight.requests.per.connection=1 spring.cloud.stream.kafka.binder.configuration.retry.backoff.ms=10000
we have retry config too and it is retrying to get the proper metadata which you can see the above log but keep getting the same warning for that particular partition. Our kafka team also analyzing this issue. I checked google for any solution but nothing i could find to be useful.
is there any config or anything else missing?
please help me.
Thanks in advance.
This error comes when Kafka is down. Restarting Kafka worked for me! :)

Why FETCH_SESSION_ID_NOT_FOUND in Kafka?

I got lots of lots of FETCH_SESSION_ID_NOT_FOUND
INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=2] Node 1 was unable to process the fetch request with (sessionId=1229568311, epoch=511): FETCH_SESSION_ID_NOT_FOUND. (org.apache.kafka.clients.FetchSessionHandler)
INFO [ReplicaFetcher replicaId=2, leaderId=1, fetcherId=5] Node 1 was unable to process the fetch request with (sessionId=136816338, epoch=504): FETCH_SESSION_ID_NOT_FOUND. (org.apache.kafka.clients.FetchSessionHandler)
INFO [ReplicaFetcher replicaId=2, leaderId=0, fetcherId=2] Node 0 was unable to process the fetch request with (sessionId=311282207, epoch=569): FETCH_SESSION_ID_NOT_FOUND. (org.apache.kafka.clients.FetchSessionHandler)
...
I read Kafka: Continuously getting FETCH_SESSION_ID_NOT_FOUND and How to check the actual number of incremental fetch session cache slots used in Kafka cluster?
For now, we just got metrics about lags from Burrow.
My questions:
1, Can someone explain why I got so many FETCH_SESSION_ID_NOT_FOUND? what does it mean? I did not get them before.
Some consumers send too many requests? or the leaders always re-elected?
I have no idea. Could some give me more details?
2, If it is because some consumers send too many requests, how to identify these consumers?
Thanks

Subset of stream's changelog and repartition partitions not available as broker is down - how stream should behave?

My setup consists of 3 kafka brokers (2.11-1.1.1), a single ZK and a java service that is using the Streams API.
The java service is consuming from topic A, performs a persistent stream operation (backed up by a changelog and a repartition streams topic) and writes to topic B. EOS semantics are enabled.
Given that the changelog and repartition topics have replication factor of 1, how should the streams java app behave in case one of my brokers is down (e.g. in my DEV env the disk is full only for one broker). Will the stream continue to consume even if 1/3 of the changelog and repartition partitions are not reachable?
EDIT: Also given that topics A, B and __consumer_offsets have RF=3.
In my java service logs I see:
2019-01-04 09:14:38,787 UTC WARN kafka-producer-network-thread | trsb-app-
nonprod.snapshot-14fa12b2-ac15-4ecc-8729-8f6c4a0034a7-StreamThread-2-0_4-
producer org.apache.kafka.clients.NetworkClient warn | [Producer
clientId=trsb-app-nonprod.snapshot-14fa12b2-ac15-4ecc-8729-8f6c4a0034a7-
StreamThread-2-0_4-producer, transactionalId=trsb-app-nonprod.snapshot-0_4]
Connection to node 1 could not be established. Broker may not be available.
2019-01-04 09:14:38,797 UTC WARN kafka-producer-network-thread | trsb-app-
nonprod.snapshot-14fa12b2-ac15-4ecc-8729-8f6c4a0034a7-StreamThread-2-1_10-
producer org.apache.kafka.clients.NetworkClient warn | [Producer
clientId=trsb-app-nonprod.snapshot-14fa12b2-ac15-4ecc-8729-8f6c4a0034a7-
StreamThread-2-1_10-producer, transactionalId=trsb-app-nonprod.snapshot-
1_10] Connection to node 1 could not be established. Broker may not be
available.
And nothing is consumed.
In both working broker logs I see:
[2019-01-04 13:56:56,449] WARN Resetting first dirty offset of trsb-app-
nonprod.snapshot-store.invoices-changelog-43 to log start offset 99 since
the checkpointed offset 95 is invalid. (kafka.log.LogCleanerManager$)
[2019-01-04 13:56:56,449] WARN Resetting first dirty offset of trsb-app-
nonprod.snapshot-store.invoices-changelog-40 to log start offset 103 since
the checkpointed offset 100 is invalid. (kafka.log.LogCleanerManager$)
Since you are using exactly once semantics, a minimum of 3 brokers are needed to continue processing, so your app would not continue to process if one of the brokers went down. Read here (see processing.guarantee section) for more info regarding this:
https://kafka.apache.org/10/documentation/streams/developer-guide/config-streams.html#id25
The stream continues to consume, but as the state store, depending on the message key, may no be pushable to its corresponding changelog partition, some keys may fail and these transactions will fail and be rollbacked. As a result, the first key on topic A that once consumed will cause the state store push to fail, will block its partition till the broker is up again. This is because the state store push is part of the EOS transaction.

Kafka mirror maker duplicates when DCs are isolated

We have 5 kafka 1.0.0 clusters:
4 of them are made of 3 nodes and are in different regions in the world
the last one is made of 5 nodes and is an aggregate only cluster.
We are using MirrorMaker (later referenced as MM) to read from the regional clusters and copy the data in the aggregate cluster in our HQ datacenter.
And not sure about where to run it we have currently 2 cases in our prod environment:
MM in the region: reading locally and pushing to aggregate cluster in remote data-center (DC), before committing offsets locally. I tend to call this the push mode (pushing the data)
MM in the DC of the aggregate cluster: reading remotely the data, writing it locally before committing the offsets on remote DC.
What happened is that we got the entire DC where we have our aggregate server totally isolated from a network point of view. And in both cases, we got duplicated records in our aggregate cluster.
Push mode = MM local to the regional cluster, pushing data to remote aggregate cluster
MM started to throw errors like this:
WARN [Producer clientId=producer-1] Got error produce response with correlation id 674364 on topic-partition <topic>-4, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
then:
WARN [Producer clientId=producer-1] Connection to node 1 could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
which is ok so far because of idempotence.
But finally we got errors like:
ERROR Error when sending message to topic debug_sip_callback-delivery with key: null, value: 1640 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for <topic>-4: 30032 ms has passed since batch creation plus linger time
ERROR Error when sending message to topic <topic> with key: null, value: 1242 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
java.lang.IllegalStateException: Producer is closed forcefully.
causing MM to stop and I think this is the problem causing duplicates (I need to dig the code, but could be that it lost information about idempotence and on restart it resumed from previously committed offsets).
Pull mode = MM local to the aggregate cluster, pulling data from remote regional cluster
MM instances (with logs at INFO level in this case) started seeing the broker as dead:
INFO [Consumer clientId=mirror-maker-region1-agg-0, groupId=mirror-maker-region1-agg] Marking the coordinator kafka1.region1.internal:9092 (id: 2147483646 rack: null) dead (org.apache.kafka.clients.consumer.internals.AbstractCoordinator)
At the same time on the broker side, we got:
INFO [GroupCoordinator 1]: Member mirror-maker-region1-agg-0-de2af312-befb-4af7-b7b0-908ca8ecb0ed in group mirror-maker-region1-agg has failed, removing it from the group (kafka.coordinator.group.GroupCoordinator)
...
INFO [GroupCoordinator 1]: Group mirror-maker-region1-agg with generation 42 is now empty (__consumer_offsets-2) (kafka.coordinator.group.GroupCoordinator)
Later on MM side, a lot of:
WARN [Consumer clientId=mirror-maker-region1-agg-0, groupId=mirror-maker-region1-agg] Connection to node 2 could not be established. Broker may not be available. (org.apache.kafka.clients.NetworkClient)
and finally when network came back:
ERROR [Consumer clientId=mirror-maker-region1-agg-0, groupId=mirror-maker-region1-agg] Offset commit failed on partition <topic>-dr-8 at offset 382424879: The coordinator is not aware of this member. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator)
i.e., it could not commit in region1 the offsets written on agg because of the rebalancing. And it resumed after rebalance from previously successfully committed offset causing duplicates.
Configuration
Our MM instances are configured like this:
For our consumer:
bootstrap.servers=kafka1.region1.intenal:9092,kafka2.region1.internal:9092,kafka3.region1.internal:9092
group.id=mirror-maker-region-agg
auto.offset.reset=earliest
isolation.level=read_committed
For our producer:
bootstrap.servers=kafka1.agg.internal:9092,kafka2.agg.internal:9092,kafka3.agg.internal:9092,kafka4.agg.internal:9092,kafka5.agg.internal:9092
compression.type=none
request.timeout.ms=30000
max.block.ms=60000
linger.ms=15000
max.request.size=1048576
batch.size=32768
buffer.memory=134217728
retries=2147483647
max.in.flight.requests.per.connection=1
acks=all
enable.idempotence=true
Any idea how we can get the "only once" delivery on top of exactly once in case of 30 min isolated DCs?

When does kafka change leader?

I was running my services that work with kafka already for a year and no spontaneous changes of leader happens.
But for the last 2 weeks that started happens quite often.
Kafka log on that:
[2015-09-27 15:35:14,826] INFO [ReplicaFetcherManager on broker 2]
Removed fetcher for partitions [myTopic] (kafka.server.ReplicaFetcherManager)
[2015-09-27 15:35:14,830] INFO Truncating log myTopic-0 to offset 11520979. (kafka.log.Log)
[2015-09-27 15:35:14,845] WARN [Replica Manager on Broker 2]: Fetch request with correlation id 713276 from client ReplicaFetcherThread-0-2 on partition [myTopic,0] failed due to Leader not local for partition [myTopic,0] on broker 2 (kafka.server.ReplicaManager)
[2015-09-27 15:35:14,857] WARN [Replica Manager on Broker 2]: Fetch request with correlation id 256685 from client mirrormaker-1 on partition [myTopic,0] failed due to Leader not local for partition [myTopic,0] on broker 2 (kafka.server.ReplicaManager)
[2015-09-27 15:35:20,171] INFO [ReplicaFetcherManager on broker 2] Removed fetcher for partitions [myTopic,0] (kafka.server.ReplicaFetcherManager)
What can cause switching leader? If there is info in some kafka documentation - please - just point the link. I've failed to find.
System configuration
kafka version: kafka_2.10-0.8.2.1
os: Red Hat Enterprise Linux Server release 6.5 (Santiago)
server.properties (differs from default):
broker.id=001
socket.send.buffer.bytes=1048576
socket.receive.buffer.bytes=1048576
socket.request.max.bytes=104857600
log.flush.interval.messages=10000
log.flush.interval.ms=1000
log.retention.bytes=-1
controlled.shutdown.enable=true
auto.create.topics.enable=false
It appears like lead broker is down for that partition. It might be that data directroy(log.dirs) configured in server.properties is out of space and broker is not able to accommodate.
Also, what is replication factor of topic and cluster size of brokers?
I am assuming you have one topic and one partition with a replication factor of 2. Which is not a good configuration for optimal Kafka performance and consumers.
Your Logs are not clear enough for leader switch. Major issue in your topic may be having the only one leader due to the only partition. Now the single file in your logs is getting bigger in size day by day. Kafka internally does rebalancing at some level(details are not confirmed). That can be the reason for your leader switch. But i am not sure.
Also in your 2nd log line its says some of the logs are truncated. Can you please go though the logs in details and check is this happening only after truncation?
As you already mentioned you already checked your Kafka log directory files and their size. Please run the describe when you got this issue. The leader switch will reflect here as well. Or if you can setup some dashboard that will display the leader for past time. Then it will be easy for you to find the root cause.
bin/kafka-topics.sh --describe --zookeeper Zookeeperhost:Port --topic TopicName
Suggestion: i will suggest you to create a new topic with more partitions(read Kafka documentation to get a good idea about optimum number of partitions) and start writing to it. Or you can check, how to change partitions for current topic.
Last Thing: Is leader switch causing some issues in your Clients or you are worried only about warnings?