flink kafkaproducer send duplicate message in exactly once mode when checkpoint restore - streaming

I am writing a case to test flink two step commit, below is overview.
sink kafka is exactly once kafka producer. sink step is mysql sink extend two step commit. sink compare is mysql sink extend two step commit, and this sink will occasionally throw a exeption to simulate checkpoint failed.
When checkpoint is failed and restore, I find mysql two step commit will work fine, but kafka consumer will read offset from last success and kafka producer produce messages even he was done it before this checkpoint failed.
How to avoid duplicate message in this case?
Thanks for help.
env:
flink 1.9.1
java 1.8
kafka 2.11
kafka producer code:
dataStreamReduce.addSink(new FlinkKafkaProducer<>(
"flink_output",
new KafkaSerializationSchema<Tuple4<String, String, String, Long>>() {
#Override
public ProducerRecord<byte[], byte[]> serialize(Tuple4<String, String, String, Long> element, #Nullable Long timestamp) {
UUID uuid = UUID.randomUUID();
JSONObject jsonObject = new JSONObject();
jsonObject.put("uuid", uuid.toString());
jsonObject.put("key1", element.f0);
jsonObject.put("key2", element.f1);
jsonObject.put("key3", element.f2);
jsonObject.put("indicate", element.f3);
return new ProducerRecord<>("flink_output", jsonObject.toJSONString().getBytes(StandardCharsets.UTF_8));
}
},
kafkaProps,
FlinkKafkaProducer.Semantic.EXACTLY_ONCE
)).name("sink kafka");
checkpoint settings:
StreamExecutionEnvironment executionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment();
executionEnvironment.enableCheckpointing(10000);
executionEnvironment.getCheckpointConfig().setTolerableCheckpointFailureNumber(0);
executionEnvironment.getCheckpointConfig().setPreferCheckpointForRecovery(true);
mysql sink:
dataStreamReduce.addSink(
new TwoPhaseCommitSinkFunction<Tuple4<String, String, String, Long>,
Connection, Void>
(new KryoSerializer<>(Connection.class, new ExecutionConfig()), VoidSerializer.INSTANCE) {
int count = 0;
Connection connection;
#Override
protected void invoke(Connection transaction, Tuple4<String, String, String, Long> value, Context context) throws Exception {
if (count > 10) {
throw new Exception("compare test exception.");
}
PreparedStatement ps = transaction.prepareStatement(
" insert into test_two_step_compare(slot_time, key1, key2, key3, indicate) " +
" values(?, ?, ?, ?, ?) " +
" ON DUPLICATE KEY UPDATE indicate = indicate + values(indicate) "
);
ps.setString(1, context.timestamp().toString());
ps.setString(2, value.f0);
ps.setString(3, value.f1);
ps.setString(4, value.f1);
ps.setLong(5, value.f3);
ps.execute();
ps.close();
count += 1;
}
#Override
protected Connection beginTransaction() throws Exception {
LOGGER.error("compare in begin transaction");
try {
if (connection.isClosed()) {
throw new Exception("mysql connection closed");
}
}catch (Exception e) {
LOGGER.error("mysql connection is error: " + e.toString());
LOGGER.error("reconnect mysql connection");
String jdbcURI = "jdbc:mysql://";
Class.forName("com.mysql.jdbc.Driver");
Connection connection = DriverManager.getConnection(jdbcURI);
connection.setAutoCommit(false);
this.connection = connection;
}
return this.connection;
}
#Override
protected void preCommit(Connection transaction) throws Exception {
LOGGER.error("compare in pre Commit");
}
#Override
protected void commit(Connection transaction) {
LOGGER.error("compare in commit");
try {
transaction.commit();
} catch (Exception e) {
LOGGER.error("compare Commit error: " + e.toString());
}
}
#Override
protected void abort(Connection transaction) {
LOGGER.error("compare in abort");
try {
transaction.rollback();
} catch (Exception e) {
LOGGER.error("compare abort error." + e.toString());
}
}
#Override
protected void recoverAndCommit(Connection transaction) {
super.recoverAndCommit(transaction);
LOGGER.error("compare in recover And Commit");
}
#Override
protected void recoverAndAbort(Connection transaction) {
super.recoverAndAbort(transaction);
LOGGER.error("compare in recover And Abort");
}
})
.setParallelism(1).name("sink compare");

I'm not quite sure I understand the question correctly:
When checkpoint is failed and restore, I find mysql two step commit will work fine, but kafka producer will read offset from last success and produce message even he was done it before this checkpoint failed.
Kafka producer is not reading any data. So, I'm assuming your whole pipeline rereads old offsets and produces duplicates. If so, you need to understand how Flink ensures exactly once.
Periodic checkpoints are created to have a consistent state in case of failure.
These checkpoints contain the offset of the last successfully read record at the time of the checkpoint.
Upon recovery Flink will reread all records from the offset stored in the last successful checkpoint. Thus, the same records will be replayed as have been generated in between last checkpoint and failure.
The replayed records will restore the state right before the failure.
It will produce duplicate outputs originating from the replayed input records.
It is the responsibility of the sinks to ensure that no duplicates are effectively written to the target system.
For the last point, there are two options:
only output data, when a checkpoint has been written, such that no effective duplicates can ever appear in the target. This naive approach is very universal (independent of the sink) but adds the checkpointing interval to the latency.
let the sink deduplicate the output.
The latter option is used for the Kafka sink. It uses Kafka transactions for letting it deduplicate data. To avoid duplicates on consumer side, you need to ensure it's not reading uncommitted data as mentioned in the documentation. Also make sure your transaction timeout is large enough that it doesn't discard data between failure and recovery.

Related

KafkaStreams - how to handle corrupted data

I have rather question what do to do in scenerio that KafkaStreams service is consuming some message and during processing it uncaught exception is thrown. At the moment I have implemented UncaughtExceptionHandler which is closing and cleanup old streams and starting new one, but it starts to consumes the same message which means that it ends with infinite loop of restarting...
In the end should I check the type of error and commit this message somehow to stop processing this error prone message?
Is it possible from this setUncaughtExceptionHandler method?
Regards
If you like, you can better handle such messages as Poison pills and skip them using the following piece of code, and move forward.
public final class poisionPillEvent {
public static void ignoreEvents(String topic, String poisonedEventMessage, Consumer<?, ?> consumer) {
final Pattern offsetPattern = Pattern.compile("\\w*offset*\\w[ ]\\d+");
final Pattern partitionPattern = Pattern.compile("\\w*" + topic + "*\\w[-]\\d+");
try {
// Parse the event to get the partition number and offset, in order to `seek` past the poison pill.
Matcher mPart = partitionPattern.matcher(poisonedEventMessage);
Matcher mOff = offsetPattern.matcher(poisonedEventMessage);
mPart.find();
Integer partition = Integer.parseInt(mPart.group().replace(topic + "-", ""));
mOff.find();
Long offset = Long.parseLong(mOff.group().replace("offset ", ""));
consumer.seek(new TopicPartition(topic, partition), offset + 1);
} catch (Exception ex) {
LOGGER.error("Unable to seek past bad message. {}", ex.toString());
}
}
}

Spring cloud Kafka does infinite retry when it fails

Currently, I am having an issue where one of the consumer functions throws an error which makes Kafka retry the records again and again.
#Bean
public Consumer<List<RuleEngineSubject>> processCohort() {
return personDtoList -> {
for(RuleEngineSubject subject : personDtoList)
processSubject(subject);
};
}
This is the consumer the processSubject throws a custom error which causes it to fail.
processCohort-in-0:
destination: internal-process-cohort
consumer:
max-attempts: 1
batch-mode: true
concurrency: 10
group: process-cohort-group
The above is my binder for Kafka.
Currently, I am attempting to retry 2 times and then send to a dead letter queue but I have been unsuccessful and not sure which is the right approach to take.
I have tried to implement a custom handler that will handle the error when it fails but does not retry again and I am not sure how to send to a dead letter queue
#Bean
ListenerContainerCustomizer<AbstractMessageListenerContainer<?, ?>> customizer() {
return (container, dest, group) -> {
if (group.equals("process-cohort-group")) {
container.setBatchErrorHandler(new BatchErrorHandler() {
#Override
public void handle(Exception thrownException, ConsumerRecords<?, ?> data) {
System.out.println(data.records(dest).iterator().);
data.records(dest).forEach(r -> {
System.out.println(r.value());
});
System.out.println("failed payload='{}'" + thrownException.getLocalizedMessage());
}
});
}
};
}
This stops infinite retry but does not send a dead letter queue. Can I get suggestions on how to retry two times and then send a dead letter queue. From my understanding batch listener does not how to recover when there is an error, could someone help shine light on this
Retry 15 times then throw it to topicname.DLT topic
#Bean
public ConcurrentKafkaListenerContainerFactory kafkaListenerContainerFactory() {
ConcurrentKafkaListenerContainerFactory<String, String> factory =
new ConcurrentKafkaListenerContainerFactory<>();
factory.setCommonErrorHandler(
new DefaultErrorHandler(
new DeadLetterPublishingRecoverer(kafkaTemplate()), kafkaBackOffPolicy()));
factory.setConsumerFactory(kafkaConsumerFactory());
return factory;
}
#Bean
public ExponentialBackOffWithMaxRetries kafkaBackOffPolicy() {
var exponentialBackOff = new ExponentialBackOffWithMaxRetries(15);
exponentialBackOff.setInitialInterval(Duration.ofMillis(500).toMillis());
exponentialBackOff.setMultiplier(2);
exponentialBackOff.setMaxInterval(Duration.ofSeconds(2).toMillis());
return exponentialBackOff;
}
You need to configure a suitable error handler in the listener container; you can disable retry and dlq in the binding and use a DeadLetterPublishingRecoverer instead. See the answer Retry max 3 times when consuming batches in Spring Cloud Stream Kafka Binder

Kafka commitAsync Retries with Commit Order

I'm reading through Kafka the Definitive Guide and in the chapter on Consumers there is a blurb on "Retrying Async Commits":
A simple pattern to get commit order right for asynchronous retries is to use a monotonically increasing sequence number. Increase the sequence number every time you commit and add the sequence number at the time of the commit to the commitAsync callback. When you're getting ready to send a retry, check if the commit sequence number the callback got is equal to the instance variable; if it is, there was no newer commit and it is safe to retry. If the instance sequence number is higher, don't retry because a newer commit was already sent.
A quick example by the author would have been great here for dense folks like me. I am particularly unclear about the portion I bolded above.
Can anyone shed light on what this means or even better provide a toy example demonstrating this?
Here is what I think it is, but got to humble I could be wrong
try {
AtomicInteger atomicInteger = new AtomicInteger(0);
while (true) {
ConsumerRecords<String, String> records = consumer.poll(5);
for (ConsumerRecord<String, String> record : records) {
System.out.format("offset: %d\n", record.offset());
System.out.format("partition: %d\n", record.partition());
System.out.format("timestamp: %d\n", record.timestamp());
System.out.format("timeStampType: %s\n", record.timestampType());
System.out.format("topic: %s\n", record.topic());
System.out.format("key: %s\n", record.key());
System.out.format("value: %s\n", record.value());
}
consumer.commitAsync(new OffsetCommitCallback() {
private int marker = atomicInteger.incrementAndGet();
#Override
public void onComplete(Map<TopicPartition, OffsetAndMetadata> offsets,
Exception exception) {
if (exception != null) {
if (marker == atomicInteger.get()) consumer.commitAsync(this);
} else {
//Cant' try anymore
}
}
});
}
} catch (WakeupException e) {
// ignore for shutdown
} finally {
consumer.commitSync(); //Block
consumer.close();
System.out.println("Closed consumer and we are done");
}

Kafka Transactional Producer

I am using Kafka 2 and I was going through the following link.
https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging
Below is my sample code for Transactional producer.
My code:
public void runProducer(final int sendMessageCount) throws Exception {
final Producer<Long, String> producer = createProducer();
producer.initTransactions();
final long time = System.currentTimeMillis();
try {
producer.beginTransaction();
for (long index = time; index < (time + sendMessageCount); index++) {
final ProducerRecord<Long, String> record =
new ProducerRecord<>(TOPIC, index,
"Test " + index);
// send returns Future
producer.send(record).get();
}
producer.commitTransaction();
}
catch (ProducerFencedException | OutOfOrderSequenceException | AuthorizationException e) {
e.printStackTrace();
// We can't recover from these exceptions, so our only option is to close the producer and exit.
producer.close();
}
catch (final KafkaException e) {
e.printStackTrace();
// For all other exceptions, just abort the transaction and try again.
producer.abortTransaction();
}
finally {
producer.flush();
producer.close();
}
}
Questions:
Do we need to call endTransaction after commitTransaction ?
Do we need to call sendOffsetsToTransaction? What will happen if I don't include this?
How does it work when we deploy the same code to multiple servers with same transactionId? Do we need to have a separate transactionId for each instance? Say, machine1 crashes after beginTransaction() and after sending few records? How does machine2 with same transactionId recovers.
Machine1 is using transactionId "test" and it crashed after beginTransaction() and after producing few records. When the same instance comes up how does it resume the same transaction? We will actually again start from init & begin transaction.
How does it work for the same topic which was not involving in transaction and involving in transaction now? I am starting a new consumerGroup with transaction_committed, Will it read the messages which were committed before the transaction? Will the consumer with transaction_uncommitted see the messages which were aborted by transaction?

Consumer.poll() returns new records even without committing offsets?

If I have a enable.auto.commit=false and I call consumer.poll() without calling consumer.commitAsync() after, why does consumer.poll() return
new records the next time it's called?
Since I did not commit my offset, I would expect poll() would return the latest offset which should be the same records again.
I'm asking because I'm trying to handle failure scenarios during my processing. I was hoping without committing the offset, the poll() would return the same records again so I can re-process those failed records again.
public class MyConsumer implements Runnable {
#Override
public void run() {
while (true) {
ConsumerRecords<String, LogLine> records = consumer.poll(Long.MAX_VALUE);
for (ConsumerRecord record : records) {
try {
//process record
consumer.commitAsync();
} catch (Exception e) {
}
/**
If exception happens above, I was expecting poll to return new records so I can re-process the record that caused the exception.
**/
}
}
}
}
The starting offset of a poll is not decided by the broker but by the consumer. The consumer tracks the last received offset and asks for the following bunch of messages during the next poll.
Offset commits come into play when a consumer stops or fails and another instance that is not aware of the last consumed offset picks up consumption of a partition.
KafkaConsumer has pretty extensive Javadoc that is well worth a read.
Consumer will read from last commit offset if it get re balanced (means if any consumer leave the group or new consumer added) so handling de-duplication does not come straight forward in kafka so you have to store the last process offset in external store and when rebalance happens or app restart you should seek to that offset and start processing or you should check against some unique key in message against DB to find is dublicate
I would like to share some code how you can solve this in Java code.
The approach is that you poll the records, try to process them and if an exception occurs, you seek to the minima of the topic partitions. After that, you do the commitAsync().
public class MyConsumer implements Runnable {
#Override
public void run() {
while (true) {
List<ConsumerRecord<String, LogLine>> records = StreamSupport
.stream( consumer.poll(Long.MAX_VALUE).spliterator(), true )
.collect( Collectors.toList() );
boolean exceptionRaised = false;
for (ConsumerRecord<String, LogLine> record : records) {
try {
// process record
} catch (Exception e) {
exceptionRaised = true;
break;
}
}
if( exceptionRaised ) {
Map<TopicPartition, Long> offsetMinimumForTopicAndPartition = records
.stream()
.collect( Collectors.toMap( r -> new TopicPartition( r.topic(), r.partition() ),
ConsumerRecord::offset,
Math::min
) );
for( Map.Entry<TopicPartition, Long> entry : offsetMinimumForTopicAndPartition.entrySet() ) {
consumer.seek( entry.getKey(), entry.getValue() );
}
}
consumer.commitAsync();
}
}
}
With this setup, you poll the messages again and again until you successfully process all messages of one poll.
Please note, that your code should be able to handle a poison pill. Otherwise, your code will stuck in an endless loop.