Kafka INVALID_FETCH_SESSION_EPOCH - apache-kafka

We are using a kafka broker setup with a kafka streams application that runs using Spring cloud stream kafka. Although it seems to run fine, we do get the following error statements in our log:
2019-02-21 22:37:20,253 INFO kafka-coordinator-heartbeat-thread | anomaly-timeline org.apache.kafka.clients.FetchSessionHandler [Consumer clientId=anomaly-timeline-56dc4481-3086-4359-a8e8-d2dae12272a2-StreamThread-1-consumer, groupId=anomaly-timeline] Node 2 was unable to process the fetch request with (sessionId=1290440723, epoch=2089): INVALID_FETCH_SESSION_EPOCH.
I searched the internet but there is not much information on this error. I guessed that it could have something to do with a difference in time settings between the broker and the consumer, but both machines have the same timeserver settings.
Any idea how this can be resolved?

There is a concept of fetch session, introduced within KIP-227 since 1.1.0 release: https://cwiki.apache.org/confluence/display/KAFKA/KIP-227%3A+Introduce+Incremental+FetchRequests+to+Increase+Partition+Scalability
Kafka brokers, which are replica followers, fetch messages from the leader. In order to avoid sending full metadata each time for all partitions, only those partitions which changed are sent within the same fetch session.
When we look into Kafka's code, we can see an example, when this is returned:
if (session.epoch != expectedEpoch) {
info(s"Incremental fetch session ${session.id} expected epoch $expectedEpoch, but " +
s"got ${session.epoch}. Possible duplicate request.")
new FetchResponse(Errors.INVALID_FETCH_SESSION_EPOCH, new FetchSession.RESP_MAP, 0, session.id)
} else {
src: https://github.com/axbaretto/kafka/blob/ab2212c45daa841c2f16e9b1697187eb0e3aec8c/core/src/main/scala/kafka/server/FetchSession.scala#L493
In general, if you don't have thousands of partitions and, at the same time, this doesn't happen very often, then it shouldn't worry you.

It seems as this might be caused by Kafka-8052 issue, which was fixed for Kafka 2.3.0

Indeed, you can have this message when rolling or retention-based deletion occurs, as zen pointed out in comments. It's not a problem if it doesn't happen all the time. If it does, check your log.roll and log.retention configurations.

Updating the client version to 2.3 (same version from broker) fix it for me.

In our case, The root cause was kafka Broker - client incompatibility. If your cluster is behind the client version you might see all kinds of odd problems such as this.
Our kafka broker is on 1.x.x and our kafka-consumer was on 2.x.x. As soon as we downgraded our spring-cloud-dependencies to Finchley.RELEASE our problem was solved.
dependencyManagement {
imports {
mavenBom "org.springframework.cloud:spring-cloud-dependencies:Finchley.RELEASE"
}
}

Related

Records associated to a Kafka batch listener are not consumed for some partitions after several rebalances (resiliency testing)

Some weeks ago my project has been updated to use Kafka 3.2.1 instead of using the one coming with Spring Boot 2.7.3 (3.1.1). We made this upgrade to avoid an issue in Kafka streams – Illegal state and argument exceptions were not ending in the uncaught exception handler.
On the consumer side, we also moved to the cooperative sticky assignator.
In parallel, we started some resiliency tests and we started to have issues with Kafka records that are not consumed anymore on some partitions when using a Kafka batch listener. The issue occurred after several rebalances caused by the test (deployment is done in Kubernetes and we stopped some pods, micro services and broker instances). The issue not present on every listeners. Kafka brokers and micro-services are up and running.
During our investigations,
we enabled Kafka events and we can clearly see that the consumer is started
we can see in the logs that the partitions that are not consuming events are assigned.
debug has been enabled on the KafkaMessageListenerContainer. We see a lot of occurrences of Receive: 0 records and Commit list: {}
Is there any blocking points to use Kafka 3.2.1 with Spring Boot/Kafka 2.7.3/2.8.8?
Any help or other advices are more than welcome to progress our investigations.
Multiple listeners are defined, the retry seems to be fired from another listener (shared err handler?).
This is a known bug, fixed in the next release:
https://github.com/spring-projects/spring-kafka/issues/2382
https://github.com/spring-projects/spring-kafka/commit/3de1e89ba697ead04de171cfa35273bb0daddbe6
Temporary work around is to give each container its own error handler.

Kafka Connect Skipping Messages due to Confluent Interceptor

I am seeing following messages in my connect log
WARN Monitoring Interceptor skipped 2294 messages with missing or invalid timestamps for topic TEST_TOPIC_1. The messages were either corrupted or using an older message format. Please verify that all your producers support timestamped messages and that your brokers and topics are all configured with log.message.format.version, and message.format.version >= 0.10.0 respectively. You may also experience this if you are consuming older messages produced to Kafka prior to any of those changes taking place. (io.confluent.monitoring.clients.interceptor.MonitoringInterceptor)
I have changed my kafka broker with this
KAFKA_INTER_BROKER_PROTOCOL_VERSION: 0.11.0
KAFKA_LOG_MESSAGE_FORMAT_VERSION: 0.11.0
I am guessing this is reducing my overall producer throughput and I am trying load testing.
PS:
I don't want to remove the confluent interceptor because it helps me with throughput and consumer lag.
CONNECT_PRODUCER_INTERCEPTOR_CLASSES: "io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor"
CONNECT_CONSUMER_INTERCEPTOR_CLASSES: "io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor"
Any way to not skip those messages, I am using pepperbox to produce messages and it doesn't have timestamp
{
"messageId":{{SEQUENCE("messageId", 1, 1)}},
"messageBody":"{{RANDOM_ALPHA_NUMERIC("abcedefghijklmnopqrwxyzABCDEFGHIJKLMNOPQRWXYZ", 2700)}}",
"messageCategory":"{{RANDOM_STRING("Finance", "Insurance", "Healthcare", "Shares")}}",
"messageStatus":"{{RANDOM_STRING("Accepted","Pending","Processing","Rejected")}}"
}
Thanks in advance!
Look at the Kafka version in the pom, and you'll see it's using Kafka 0.9
Timestamps were added to Kafka as of 0.10.2.
As the error says Please verify that all your producers support timestamped messages.
Recompile the project with a new version, and all produced records will automatically have a timestamp, and therefore not be skipped.
Or, use a different tool like JMeter or Kafka Connect Datagen.

Exactly once in kafka streams- not working

I am testing exactly once in kafka streams by shutting down multiple brokers.
But when i restart the brokers same message is getting produced multiple times on outbound topic.
I am using confluent version 6.1.0
Setting processing guarantee to exactly once beta
acks is set to all
Can any one please help me understand if i am missing any configurations?

Kafka : Failed to update metadata after 60000 ms with only one broker down

We have a kafka producer configured as -
metadata.broker.list=broker1:9092,broker2:9092,broker3:9092,broker4:9092
serializer.class=kafka.serializer.StringEncoder
request.required.acks=1
request.timeout.ms=30000
batch.num.messages=25
message.send.max.retries=3
producer.type=async
compression.codec=snappy
Replication Factor is 3 and total number of partition currently is 108
Rest of the properties are default.
This producer was running absolutely fine. Then, due to some reason, one of the broker went down. Then, our producer started to show the log as -
"Failed to update metadata after 60000 ms". Nothing else was there in the log and we were seeing this error. In some interval, few requests were getting blocked, even if producer was async.
This issue was resolved when the broker was again up and running.
What can be the reason of this? One broker down should not affect the system as a whole as per my understanding.
Posting the answer for someone who might face this issue -
The reason is older version of Kafka Producer. The kafka producers take bootstrap servers as list. In older versions, for fetching metadata, producers will try to connect with all the servers in Round Robin fashion. So, if one of the broker is down, the requests going to this server will fail and this message will come.
Solution:
Upgrade to newer producer version.
can reduce metadata.fetch.timeout.ms settings: This will ensure the main thread is not getting blocked and send will fail soon. Default value is 60000ms. Not needed in higher version
Note: Kafka send method is blocked till the producer is able to write to buffer.
I got the same error because I forgot to create the topic. Once I created the topic the issue was resolved.

Consume from Kafka 0.10.x topic using Storm 0.10.x (KafkaSpout)

I am not sure if this a right question to ask in this forum. We were consuming from a Kafka topic by Storm using the Storm KafkaSpout connector. It was working fine till now. Now we are supposed to connect to a new Kafka cluster having upgraded version 0.10.x from the same Storm env which is running on version 0.10.x.
From storm documentation (http://storm.apache.org/releases/1.1.0/storm-kafka-client.html) I can see that storm 1.1.0 is compatible with Kafka 0.10.x onwards supporting the new Kafka consumer API. But in that case I won't be able to run the topology in my end (please correct me if I am wrong).
Is there any work around for this?
I have seen that even if the New Kafka Consumer API has removed ZooKeeper dependency but we can still consume message from it using the old Kafka-console-consumer.sh by passing the --zookeeper flag instead of new –bootstrap-server flag (recommended). I run this command from using Kafka 0.9 and able to consume from a topic hosted on Kafka 0.10.x
When we are trying to connect getting the below exception:
java.lang.RuntimeException: java.lang.RuntimeException: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /brokers/topics/mytopic/partitions
at storm.kafka.DynamicBrokersReader.getBrokerInfo(DynamicBrokersReader.java:81) ~[stormjar.jar:?]
at storm.kafka.trident.ZkBrokerReader.<init>(ZkBrokerReader.java:42) ~[stormjar.jar:?]
But we are able to connect to the remote ZK server and validated that the path exists:
./zkCli.sh -server remoteZKServer:2181
[zk: remoteZKServer:2181(CONNECTED) 5] ls /brokers/topics/mytopic/partitions
[3, 2, 1, 0]
As we can see above that it's giving us expected output as the topic has 4 partitions in it.
At this point have the below questions:
1) Is it at all possible to connect to Kafka 0.10.x using Storm version 0.10.x ? Has one tried this ?
2) Even if we are able to consume, do we need to make any code change in order to retrieve the message offset in case of topology shutdown/restart. I am asking this as we will passing the Zk cluster details instead of the brokers info as supported in old KafkaSpout version.
Running out of options here, any pointers would be highly appreciated
UPDATE:
We are able to connect and consume from the remote Kafka topic while running it locally using eclipse. To make sure storm does not uses the in-memory zk we have used the overloaded constructor LocalCluster("zkServer",port), it's working fine and we can see the data coming. This lead us to conclude that version compatibility might not be the issue here.
However still no luck when deployed the topology in cluster.
We have verified the connectivity from storm box to zkservers
The znode seems fine also ..
At this point really need some pointers here, what could possibly be wrong with this and how do we debug that? Never worked with Kafka 0.10x before so not sure what exactly are we missing.
Really appreciate some help and suggestions
Storm 0.10x is compatible with Kafka 0.10x . We can still uses the old KafkaSpout that depends on zookeeper based offset storage mechanism.
The connection loss exception was coming as we were trying to reach a remote Kafka cluster that does not allow/accept connection from our end. We need to open specific firewall port so that the connection can be established. It seems that while running topology is cluster mode all the supervisor nodes should be able to talk to the zookeeper, so the firewall should be open for each one of them.