I am trying to figure out is there any way to send failed records in Dead Letter topic in Spring Boot Kafka in Batch mode.
I don't want to make the records being sent in duplicate as it's consuming in batch and few are already processed.
I saw this link ofspring-kafka consumer batch error handling with spring boot version 2.3.7
I thought about a use case to stop container and start again without using DLT but again the issue of duplication will come in Batch mode.
#Garry Russel can you please provide a small code for batch error handling.
The RetryingBatchErrorHandler was added in spring-kafka version 2.5 (which comes with Boot 2.3).
The listener must throw an exception to indicate which record in the batch failed (either the complete record, or the index in the list).
Offsets for the records before the failed one are committed and the failed record can be retried and/or sent to the dead letter topic.
See https://docs.spring.io/spring-kafka/docs/current/reference/html/#recovering-batch-eh
There is a small example there.
The RetryingBatchErrorHandler was added in 2.3.7, but it sends the entire batch to the dead letter topic, which is typically not what you want (hence we added the RetryingBatchErrorHandler).
Related
Currently I setup 2 separate connectors running the JDBC Sink Connector to ingest topics produced from the producer to be read into the database. Sometimes, I see errors in the logs, which cause messages produced fails to get stored into the database.
The errors I constantly see is
Caused by: org.apache.kafka.common.errors.SerializationException: Error retrieving Avro schema for id:11
Caused by: io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException: Subject 'topic-io..models.avro.Topic' not found; error code404
Which is true because TopicRecordName is not supposed to be directed toward this topic but another topic that I directed to, it is just supposed to be directed toward models.avro.Topic
I was wondering if this happens constantly, is there a way to re-ingest those produced records/messages into the database after the messages got produced. For example, if messages got produced during 12am-1am, and some kind of errors showed up in the logs and failed to consume those messages during that timeframe, the configurations or offset can restore it by re-ingesting it to the database. The error is due to the schema registry link failed to read/ link to the correct schema link. It failed because it read the incorrect worker file, since one of my worker file have a value.converter.value.subject.name.strategy=io.confluent.kafka.serializers.subject.TopicRecordNameStrategy while the other connector does not read that subjectName.
Currently, I set the consumer.auto.offset.reset=earliest to start reading message.
Is there a way to get back those data like into a file and I can restore those data because I am deploying to production and there must be data consumed into the database at all times without any errors.
Rather than mess with the consumer group offsets, which would eventually cause correctly processed data to get consumed again and duplicated, you could use the dead letter queue configurations to send error records to a new topic, which you'd need to monitor and consume before the topic retention completely drops the events
https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues/
one of my worker file have a [different config]
This is why configuration management software is important. Don't modify one server in a distributed system without a process that updates them all. Ansible/Terraform are most common if you're not running the connectors in Kubernetes
I am trying understand how to handle failed consumer records. How to
we know there is record failure. What I am seeing is when the record
processing failed in the consumer with runtime exception consumer is
keep on retrying. But when the next record is available to process it
is commiting offset of the latest record, which is expected. My
question how to we know about failed record. In older messaging
systems failed messages are rolled back to queues and processing stops
there. Then we know the queue is down and we can take action.
I can record the failed record into some db table,but what happens if this recording fails?
I can move failures to error/ dead letter queues, again what happens if this moving fails?
I am using kafka 2.6 with spring boot 2.3.4. Any help would be appreciated
Sounds like you would need to disable auto commits and manually commit the offsets yourself when your scope of "sucessfully processed" is achieved. If you include external processes like a database, then you will also need to increase Kafka client timeouts so it doesnt think the consumer is dead while waiting on error logging/handling.
Given,
I have a Flink job that reads from ActiveMQ source & writes to a mysql database - keyed on an identifier. I have enabled checkpoints for this job every one second. I point the checkpoints to a Minio instance, I verified the checkpoints are working with the jobid. I deploy this job is an Openshift (Kubernetes underneath) - I can scale up/down this job as & when required.
Problem
When the job is deployed (rolling) or the job went down due to a bug/error, and if there were any unconsumed messages in ActiveMQ or unacknowledged messages in Flink (but written to the database), when the job recovers (or new job is deployed) the job process already processed messages, resulting in duplicate records inserted in the database.
Question
Shouldn't the checkpoints help the job recover from where it left?
Should I take the checkpoint before I (rolling) deploy new job?
What happens if the job quit with error or cluster failure?
As the jobid keeps changing on every deployment, how does the recovery happens?
Edit As I cannot expect idempotency from the database, to avoid duplicates saved into the database (Exactly-Once), can I write database specific (upsert) query to update if the given record is present & insert if not?
JDBC currently only supports at least once, meaning you get duplicate messages upon recovery. There is currently a draft to add support for exactly once, which would probably be released with 1.11.
Shouldn't the checkpoints help the job recover from where it left?
Yes, but the time between last successful checkpoints and recovery could produce the observed duplicates. I gave a more detailed answer on a somewhat related topic.
Should I take the checkpoint before I (rolling) deploy new job?
Absolutely. You should actually use cancel with savepoint. That is the only reliable way to change the topology. Additionally, cancel with savepoints avoids any duplicates in the data as it gracefully shuts down the job.
What happens if the job quit with error or cluster failure?
It should automatically restart (depending on your restart settings). It would use the latest checkpoint for recovery. That would most certainly result in duplicates.
As the jobid keeps changing on every deployment, how does the recovery happens?
You usually point explicitly to the same checkpoint directory (on S3?).
As I cannot expect idempotency from the database, is upsert the only way to achieve Exactly-Once processing?
Currently, I do not see a way around it. It should change with 1.11.
I am using Spring Boot 2.1.1.RELEASE and Spring Cloud Greenwich.RC2, and the managed version for spring-cloud-stream-binder-kafka is 2.1.0RC4. The Kafka version is 1.1.0. I have set the following properties as the messages should not be consumed if there is an error.
spring.cloud.stream.bindings.input.group=consumer-gp-1
...
spring.cloud.stream.kafka.bindings.input.consumer.autoCommitOnError=false
spring.cloud.stream.kafka.bindings.input.consumer.enableDlq=false
spring.cloud.stream.bindings.input.consumer.max-attempts=3
spring.cloud.stream.bindings.input.consumer.back-off-initial-interval=1000
spring.cloud.stream.bindings.input.consumer.back-off-max-interval=3000
spring.cloud.stream.bindings.input.consumer.back-off-multiplier=2.0
....
There are 20 partitions in the Kafka topic and Kerberos is used for authentication (not sure if this is relevant).
The Kafka consumer is calling a web service for every message it processes, and if the web service is unavailable then I expect that the consumer will then try to process the message for 3 times before it moves on to the next message. So for my test, I disabled the webservice, and therefore none of the message could be processed correctly. From the logs I can see that this is happening.
After a while I stopped and then restarted the Kafka consumer (webservice is still disabled). I was expecting that after the restart of the Kafka consumer, it would attempt to process the messages that was not successfully processed the first time around. From the logs (I printed out each message with its fields) after the restart of the Kafka Consumer I couldn't see this happening. I thought the partition might be influencing something, but I check the logs and all 20 partitions were assigned to this single consumer.
Is there a property I have missed? I thought the expected behavior when I restart the consumer the second time, is that Kafka broker would pass the records that were not successfully processed to the consumer again.
Thanks
Parameters working as expected. See comment.
I am using Oracle Fusion Middleware 12.1.3. Weblogic Server 12.1.3 and OSB 12.1.3.
I have created 1 connection factory and one topic. I have one producer sending messages to the topic and 3 consumers(subscribers)
I have also set redelivery failure settings (retry 3 times every half hour) in case there is a connection error or network issue the messages will be written back to the topic and will be retried.
but I want to make sure that the messages are retried in the same order that it was received.
i.e. for example there are 3 messages in the topic (message 1, message 2, message 3) and one of the subscriber is not able to consume the message so the message is pending in the topic and it will be retried.
but I want it to be retried in same order i.e. message 1, message2 and message 3.
Is there any specific setting to be done in weblogic or OSB to achieve this behavior
Have you tried using the weblogic-specific Unit of Order feature?
What Is Message Unit-Of-Order?
Message Unit-of-Order is a WebLogic Server value-added feature that enables a stand-alone message producer, or a group of producers acting as one, to group messages into a single unit with respect to the processing order. This single unit is called a Unit-of-Order and requires that all messages from that unit be processed sequentially in the order they were created.
You can configure it programmatically for more control, or administratively (via WLS console, attaching one to connection factories etc) if you don't have control over the messages produced.
For more info about how to attach the JMS headers to enable it, you might find this site helpful.