How to recover lost message in kafka consumer

How to recover lost message in kafka consumer - apache-kafka

I'm writing an application in Apache camel. I am consume messages from some Kafka topic via camel Kafka component and dumps into database for recovery in case of any crash/restart happens. Below is the camel URI
kafka:?autoCommitEnable=false&groupId=r&keySerializerClass=org.apache.kafka.common.serialization.StringSerializer&serializerClass=org.apache.kafka.common.serialization.StringSerializer&topic=
My use case is - I have consumed some message(s) from Kafka but could not dumped the same into the database for recovery and crash happens.Now how to get the all the lost messages with the same consumer group ID after restarting the application ?
Thanks

Now how to get the all the lost messages with the same consumer group ID after restarting the application ?
Actually kafka store the consumer offset for you, If you do commit offset in your application. So when you restart the application, It will consume message from the last offset stored in kafka.
You could set the AutocommitEnable=true OR
I also found this https://github.com/apache/camel/blob/camel-2.18.2/components/camel-kafka/src/main/java/org/apache/camel/component/kafka/KafkaConsumer.java.
There are some piece code :
if (endpoint.getConfiguration().isAutoCommitEnable() != null
&& !endpoint.getConfiguration().isAutoCommitEnable()) {
long partitionLastoffset = partitionRecords.get(partitionRecords.size() - 1).offset();
consumer.commitSync(Collections.singletonMap(
partition, new OffsetAndMetadata(partitionLastoffset + 1)));
}
The Camel will take care of this even you do not set the AutocommitEnable.

Related

KSQL | Consumer lag | confluent cloud |

I am using kafka confluent cloud as a message queue in the eco-system. There are 2 topics, A and B.
Messages in B arrives a little later after messages of A is being published. ( in a delay of 30 secs )
I am joining these 2 topics using ksql, ksql server is deployed in in-premises and is connected to confluent cloud. In the KSQL i am joining these 2 topics as streams based on the common identifier, say requestId and create a new stream C. C is the joined stream.
At a times, C steam shows it has generated a lag it has not processed messages of A & B.
This lag is visible in the confluent cloud UI. When i login to ksql server i could see following error and after restart of ksql server everything works fine. This happens intermittently in 2 - 3 days.
Here is my configuration in the ksql server which is deployed in in-premises.
# A comma separated list of the Confluent Cloud broker endpoints
bootstrap.servers=${bootstrap_servers}
ksql.internal.topic.replicas=3
ksql.streams.replication.factor=3
ksql.logging.processing.topic.replication.factor=3
listeners=http://0.0.0.0:8088
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required username="${bootstrap_auth_key}" password="${bootstrap_secret_key}";
# Schema Registry specific settings
ksql.schema.registry.basic.auth.credentials.source=USER_INFO
ksql.schema.registry.basic.auth.user.info=${schema_registry_auth_key}:${schema_registry_secret_key}
ksql.schema.registry.url=${schema_registry_url}
#Additinoal settings
ksql.streams.producer.delivery.timeout.ms=2147483647
ksql.streams.producer.max.block.ms=9223372036854775807
ksql.query.pull.enable.standby.reads=false
#ksql.streams.num.standby.replicas=3 // TODO if we need HA 1+1
#num.standby.replicas=3
# Automatically create the processing log topic if it does not already exist:
ksql.logging.processing.topic.auto.create=true
# Automatically create a stream within KSQL for the processing log:
ksql.logging.processing.stream.auto.create=true
compression.type=snappy
ksql.streams.state.dir=${base_storage_directory}/kafka-streams
Error message in the ksql server logs.
[2020-11-25 14:08:49,785] INFO stream-thread [_confluent-ksql-default_query_CSAS_WINYES01QUERY_0-04b1e77c-e2ba-4511-b7fd-1882f63796e5-StreamThread-2] State transition from RUNNING to PARTITIONS_ASSIGNED (org.apache.kafka.streams.processor.internals.StreamThread:220)
[2020-11-25 14:08:49,790] ERROR [Consumer clientId=_confluent-ksql-default_query_CSAS_WINYES01QUERY_0-04b1e77c-e2ba-4511-b7fd-1882f63796e5-StreamThread-3-consumer, groupId=_confluent-ksql-default_query_CSAS_WINYES01QUERY_0] Offset commit failed on partition yes01-0 at offset 32606388: The coordinator is not aware of this member. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:1185)
[2020-11-25 14:08:49,790] ERROR [Consumer clientId=_confluent-ksql-default_query_CSAS_WINYES01QUERY_0-04b1e77c-e2ba-4511-b7fd-1882f63796e5-StreamThread-3-consumer, groupId=_confluent-ksql-default_query_CSAS_WINYES01QUERY_0] Offset commit failed on partition yes01-0 at offset 32606388: The coordinator is not aware of this member. (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:1185)
[2020-11-25 14:08:49,790] WARN stream-thread [_confluent-ksql-default_query_CSAS_WINYES01QUERY_0-04b1e77c-e2ba-4511-b7fd-1882f63796e5-StreamThread-3] Detected that the thread is being fenced. This implies that this thread missed a rebalance and dropped out of the consumer group. Will close out all assigned tasks and rejoin the consumer group. (org.apache.kafka.streams.processor.internals.StreamThread:572)
org.apache.kafka.streams.errors.TaskMigratedException: Consumer committing offsets failed, indicating the corresponding thread is no longer part of the group; it means all tasks belonging to this thread should be migrated.
at org.apache.kafka.streams.processor.internals.TaskManager.commitOffsetsOrTransaction(TaskManager.java:1009)
at org.apache.kafka.streams.processor.internals.TaskManager.commit(TaskManager.java:962)
at org.apache.kafka.streams.processor.internals.StreamThread.maybeCommit(StreamThread.java:851)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:714)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:551)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:510)
Caused by: org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be completed since the group has already rebalanced and assigned the partitions to another member. This means that the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time message processing. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:1251)
at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator$OffsetCommitResponseHandler.handle(ConsumerCoordinator.java:1158)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:1132)
at org.apache.kafka.clients.consumer.internals.AbstractCoordinator$CoordinatorResponseHandler.onSuccess(AbstractCoordinator.java:1107)
at org.apache.kafka.clients.consumer.internals.RequestFuture$1.onSuccess(RequestFuture.java:206)
at org.apache.kafka.clients.consumer.internals.RequestFuture.fireSuccess(RequestFuture.java:169)
at org.apache.kafka.clients.consumer.internals.RequestFuture.complete(RequestFuture.java:129)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient$RequestFutureCompletionHandler.fireCompletion(ConsumerNetworkClient.java:602)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.firePendingCompletedRequests(ConsumerNetworkClient.java:412)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:297)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:236)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:215)
Edit :
During this exception. i have verified the ksql server has enough RAM and CPU

Artemis - How to avoid TransactionRolledBackException for Non-Transactional session

I use live/backup with shared-storage, and I use a non-transacted JMS session. I always send one message, and I always receive one message then acknowledge and receive second message only after successful first acknowledge.
I got this exception in my non-transacted session:
Execution of JMS message listener failed. Caused by: [javax.jms.TransactionRolledBackException - AMQ219030: The transaction was rolled back on failover to a backup server]
javax.jms.TransactionRolledBackException: AMQ219030: The transaction was rolled back on failover to a backup server
at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.rollbackOnFailover(ClientSessionImpl.java:904)
at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.commit(ClientSessionImpl.java:927)
at org.apache.activemq.artemis.jms.client.ActiveMQMessage.acknowledge(ActiveMQMessage.java:719)
It happens because the session was marked as "rollbackOnly". I got this state after the following steps:
I use Spring-JMS. Consumer session works 24/7 (infinite loop session.receive())
The Master Node crashed, then the Master node was restarted
After recovery (After a couple of hours), I sent a message to the queue. The consumer read the message and throw Exception on acknowledge(because was marked as rollback-only)
I read message again (this is not very bad for my task) but Redelivery Count has not been increased
My consumer code:
onMessage(Message message) {
if (redeliveryCount(message) > 0){
processAsDublicate(message); // It's not invoked - it is error in my business logic.
}
}
I migrated from another broker and and I thought not to change the client logic
Question:
How to avoid TransactionRolledBackException for Non-Transactional session? If this is not possible i should change consumer code?
Thank you in Advance
UPDATE AFTER ANSWER:
https://github.com/apache/activemq-artemis/tree/2.14.0/examples/features/ha/replicated-failback
This example is not suitable for my case - I don't have non-acknowledged messages. I got this state after the following steps: 1) Restart server 2) consume message 3) acknoledge message
We use a broker for ~30 applications (24/7) ~ 200 consumers in total
For example, on the weekend we restart the JMS Broker
Will all consumers start getting this exception after consume new messages
(They don't have non-acknowledged messages)

The TransactionRolledBackException is expected as you can see in the replicated-failback example.
To prevent a consumer from receiving the same message more times, an idempotent consumer must be implemented, ie Apache Camel provides an Idempotent consumer component that would work with any JMS provider, see: http://camel.apache.org/idempotent-consumer.html

Kafka streams fail on decoding timestamp metadata inside StreamTask

We got strange errors on Kafka Streams during starting app
java.lang.IllegalArgumentException: Illegal base64 character 7b
at java.base/java.util.Base64$Decoder.decode0(Base64.java:743)
at java.base/java.util.Base64$Decoder.decode(Base64.java:535)
at java.base/java.util.Base64$Decoder.decode(Base64.java:558)
at org.apache.kafka.streams.processor.internals.StreamTask.decodeTimestamp(StreamTask.java:985)
at org.apache.kafka.streams.processor.internals.StreamTask.initializeTaskTime(StreamTask.java:303)
at org.apache.kafka.streams.processor.internals.StreamTask.initializeMetadata(StreamTask.java:265)
at org.apache.kafka.streams.processor.internals.AssignedTasks.initializeNewTasks(AssignedTasks.java:71)
at org.apache.kafka.streams.processor.internals.TaskManager.updateNewAndRestoringTasks(TaskManager.java:385)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:769)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:698)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:671)
and, as a result, error about failed stream: ERROR KafkaStreams - stream-client [xxx] All stream threads have died. The instance will be in error state and should be closed.
According to code inside org.apache.kafka.streams.processor.internals.StreamTask, failure happened due to error in decoding timestamp metadata (StreamTask.decodeTimestamp()). It happened on prod, and can't reproduce on stage.
What could be the root cause of such errors?
Extra info: our app uses Kafka-Streams and consumes messages from several kafka brokers using the same application.id and state.dir (actually we switch from one broker to another, but during some period we connected to both brokers, so we have two kafka streams, one per each broker). As I understand, consumer group lives on broker side (so shouldn't be a problem), but state dir is on client side. Maybe some race condition occurred due to using the same state.dir for two kafka streams? could it be the root cause?
We use kafka-streams v.2.4.0, kafka-clients v.2.4.0, Kafka Broker v.1.1.1, with the following configs:
default.key.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
default.value.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
default.timestamp.extractor: org.apache.kafka.streams.processor.WallclockTimestampExtractor
default.deserialization.exception.handler: org.apache.kafka.streams.errors.LogAndContinueExceptionHandler
commit.interval.ms: 5000
num.stream.threads: 1
auto.offset.reset: latest

Finally, we figured out what is the root cause of corrupted metadata by some consumer groups.
It was one of our internal monitoring tool (written with pykafka) that corrupted metadata by temporarily inactive consumer groups.
Metadata were unencrupted and contained invalid data like the following: {"consumer_id": "", "hostname": "monitoring-xxx"}.
In order to understand what exactly we have in consumer metadata, we could use the following code:
Map<String, Object> config = Map.of( "group.id", "...", "bootstrap.servers", "...");
String topicName = "...";
Consumer<byte[], byte[]> kafkaConsumer = new KafkaConsumer<byte[], byte[]>(config, new ByteArrayDeserializer(), new ByteArrayDeserializer());
Set<TopicPartition> topicPartitions = kafkaConsumer.partitionsFor(topicName).stream()
.map(partitionInfo -> new TopicPartition(topicName, partitionInfo.partition()))
.collect(Collectors.toSet());
kafkaConsumer.committed(topicPartitions).forEach((key, value) ->
System.out.println("Partition: " + key + " metadata: " + (value != null ? value.metadata() : null)));
Several options to fix already corrupted metadata:
change consumer group to a new one. caution that you might lose or duplicate messages depending on the latest or earliest offset reset policy. so for some cases, this option might be not acceptable
overwrite metadata manually (timestamp is encoded according to logic inside StreamTask.decodeTimestamp()):
Map<TopicPartition, OffsetAndMetadata> updatedTopicPartitionToOffsetMetadataMap = kafkaConsumer.committed(topicPartitions).entrySet().stream()
.collect(Collectors.toMap(Map.Entry::getKey, (entry) -> new OffsetAndMetadata((entry.getValue()).offset(), "AQAAAXGhcf01")));
kafkaConsumer.commitSync(updatedTopicPartitionToOffsetMetadataMap);
or specify metadata as Af////////// that means NO_TIMESTAMP in Kafka Streams.

Spring Kafka Stream - Unacknowledged message with no error

I am using #StreamListener to consume the Kafka message.
I have set autoCommitOffset to false and autoCommitOnError to false.
I am sending all failed message to DLQ topic as well for maxAttempt for failure. I have a question while testing the changes.
What will happen if I am not acknowledging the consumed message and also not throwing any error ? Will Kafka send the message automatically after sometime ?
when i throw error, replay kicks in and it does retry till my maxAttempt configuration and the failed message goes to DLQ topic.
Let me know if Kafka support retry if the consumer not throwing any error and not acknowledging the message.

What will happen if I am not acknowledging the consumed message and also not throwing any error ? Will Kafka send the message automatically after sometime ?
No; not unless you process no further messages, and even then, you will only get a redelivery after you restart the application.
Kafka doesn't "acknowledge" discrete messages; it just stores the last processed offset within a partition.

Cannot Restart Kafka Consumer Application, Failing due to OffsetOutOfRangeException

Currently, my Kafka Consumer streaming application is manually committing the offsets into Kafka with enable.auto.commit set to false.
The application failed when I tried restarting it throwing below exception:
org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions:{partition-12=155555555}
Assuming the above error is due to the message not present/partition deleted due to retention period, I tried below method:
I disabled the manual commit and enabled auto commit(enable.auto.commit=true and auto.offset.reset=earliest)
Still it fails with the same error
org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions:{partition-12=155555555}
Please suggest ways to restart the job so that it can successfully read the correct offset for which message/partition is present

You are trying to read offset 155555555 from partition 12 of topic partition, but -most probably- it might have already been deleted due to your retention policy.
You can either use Kafka Streams Application Reset Tool in order to reset your Kafka Streams application's internal state, such that it can reprocess its input data from scratch
$ bin/kafka-streams-application-reset.sh
Option (* = required) Description
--------------------- -----------
* --application-id <id> The Kafka Streams application ID (application.id)
--bootstrap-servers <urls> Comma-separated list of broker urls with format: HOST1:PORT1,HOST2:PORT2
(default: localhost:9092)
--intermediate-topics <list> Comma-separated list of intermediate user topics
--input-topics <list> Comma-separated list of user input topics
--zookeeper <url> Format: HOST:POST
(default: localhost:2181)
or start your consumer using a fresh consumer group ID.

I met the same problem and I use package org.apache.spark.streaming.kafka010 in my application.In the begining,I suscepted the auto.offset.reset strategy take no effect,but when I read the description of the method fixKafkaParams in the object KafkaUtils,i found the configuration has been overwrited.I guess the reason why it tweak the configuration ConsumerConfig.AUTO_OFFSET_RESET_CONFIG for executor is to keep consistent offset obtained by driver and executor.