Fault Tolerance of FlinkKafkaConsumer in HiBench - apache-kafka

I am running some experiments to test the fault tolerance capabilities of Apache Flink. I am currently using the HiBench framework with the WordCount micro benchmark implemented for Flink.
I noticed that if I kill a TaskManager during an execution, the state of the Flink operators is recovered after the automatic "redeploy" but many (all?) tuples sent from the benchmark to Kafka are missed (stored in Kafka but not received in Flink).
It seems that after the recovery, the FlinkKafkaConsumer (the benchmark uses FlinkKafkaConsumer08) in place of start reading from the last offset read before the failure start reading from the latest available one (losing all the event sent during the failure).
Any suggestion?
Thanks!

The problem was with the HiBench framework itself and with the latest version of Flink.
I had to update the version of Flink in the benchmark in order to use the "setStartFromGroupOffsets()" method in the Kafka consumer.

Related

Can new flink Kafka consumer (KafkaSource) start from the old FlinkKafkaConsumer's Savepoint/checkpoint?

I have a job which is running with old flink Kafka consumer ( FlinkKafkaConsumer ) Now I want to migrate it to KafkaSource . But I am not sure what will be the impact of this migration. I want my job to start from the latest successful checkpoint taken by old FlinkKafkaConsumer, Is that possible? If it is not possible then what should be the right way for me to migrate Kafka consumer?
Assuming the same configuration, the two should be able to be used interchangeably as long as your previous group-id configuration for the consumer matches the one used by your earlier implementation. You can use this in conjunction with OffsetsInitializer.latest() to ensure that you continue reading from the same offsets that were previously committed:
KafkaSource.<YourExampleClass>builder()
...
.setGroupId("your-previous-group-id")
.setStartingOffsets(OffsetsInitializer.latest())
While the two should just work, it's worth noting your specific pipeline and how it uses parallelism could reveal some of the differences between FlinkKafkaConsumer and the newer KafkaSource:
the KafkaSource behaves differently than FlinkKafkaConsumer in the case where the number of Kafka partitions is smaller than the parallelism of Flink's Kafka Source operator.
How to upgrade from FlinkKafkaConsumer to KafkaSource has been included in the release notes for Flink 1.14, when the FlinkKafkaConsumer was deprecated. You can find it at https://nightlies.apache.org/flink/flink-docs-release-1.14/release-notes/flink-1.14/#deprecate-flinkkafkaconsumer

KafkaStreams processing guarantee exactly_once and exactly_once_beta difference

The question is simple, what is the difference between those two guarantees in Kafka Streams?
processing.guarantee: exactly_once / exactly_once_beta
Docs says
Using "exactly_once" requires broker version 0.11.0 or newer, while using "exactly_once_beta" requires broker version 2.5 or newer. Note that if exactly-once processing is enabled, the default for parameter commit.interval.ms changes to 100ms.
But there's nothing about difference.
When you configure exactly_once_beta, transaction processing will be done using a new implementation, enabling better performance as the number of producers increases.
Note however that a two-step migration will be necessary if you have been using exactly_once with an earlier Kafka version.

Will flink resume from the last offset after executing yarn application kill and running again?

I use FlinkKafkaConsumer to consume kafka and enable checkpoint. Now I'm a little confused on the offset management and checkpoint mechanism.
I have already know flink will start reading partitions from the consumer group’s.
https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-consumers-start-position-configuration
and the offset will store into checkpoint in remote fileSystem.
https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-consumers-and-fault-tolerance
What happen if I stop the application by executing the yarn application -kill appid
and run the start command like ./bin flink run ...?
Will flink get the offset from checkpoint or from group-id managed by kafka?
If you run the job again without defining a savepoint ($ bin/flink run -s :savepointPath [:runArgs]) flink will try to get the offsets of your consumer-group from kafka (in older versions from zookeeper). But you will loose all other state of your flink job (which might be ignorable if you have a stateless flink job).
I must admit that this behaviour is quite confusing. By default starting a job without a savepoint is like starting from zero. As far as I know only the implementation of the kafka source differs from that behaviour. If you wanna change that behaviour you can set the setStartFromGroupOffsets of the FlinkKafkaConsumer[08/09/10] to false. This is described here: Kafka Consumers Start Position Configuration
It might be worth having a closer look at the documentation of flink: What is a savepoint and how does it differ from checkpoints.
In a nutshell
Checkpoints:
The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. A Checkpoint’s lifecycle is managed by Flink
Savepoints:
Savepoints are created, owned, and deleted by the user. Their use-case is for planned, manual backup and resume
There are currently ongoing discussions on how to "unify" savepoints and checkpoints. Find a lot of technical details here: Flink improvals 47: Checkpoints vs Savepoints

Test kafka and flink integration flow

I would like to test Kafka / Flink integration with FlinkKafkaConsumer011 and FlinkKafkaProducer011 for example.
The process will be :
read from kafka topic with Flink
some manipulation with Flink
write into another kafka topic with Flink
With a string example it will be, read string from input topic, convert to uppercase, write into a new topic.
The question is how to test the flow ?
When I say test this is Unit/Integration test.
Thanks!
Flink documentation has a little doc on how you can write unit\integration tests for your transformation operators: link. The doc also has a little section about testing checkpointing and state handling, and about using AbstractStreamOperatorTestHarness.
However, I think you are more interested in end-to-end integration testing (including testing sources and sinks). For that, you can start a Flink mini cluster. Here is a link to an example code that starts a Flink mini cluster: link.
You can also launch a Kafka Broker within a JVM and use it for your testing purposes. Flink's Kafka connector does that for integration tests. Here is a sample code starting the Kafka server: link.
If you are running locally, you can use a simple generator app to generate messages for your source Kafka Topic (there are many available. You can generate messages continuously or based on different configured interval). Here is an example on how you can set Flink's job global parameters when running locally: Kafka010Example.
Another alternative is to create an integration environment (vs. production) to run your end-to-end testing. You will be able to get a real feel of how your program will behave in a production-like environment. It is always advised to have a complete parallel testing environment - including a test source\sink Kafka topics.

Implement Kafka Streams Processor in .Net?

Is that possible?
The official .Net client confluent-kafka-dotnet only seems to provide consumer and producer functionality.
And (from what I remember looking into Kafka streams quite a while back) I believe Kafka Streams processors always run on the JVMs that run Kafka itself. In that case, it would be principally impossible.
Yes, it is possible to re-implement Apache Kafka's Streams client library (a Java library) in .NET. But at the moment there doesn't exist such a ready-to-use Kafka Streams implementation for .NET.
And (from what I remember looking into Kafka streams quite a while back) I believe Kafka Streams processors always run on the JVMs that run Kafka itself. In that case, it would be principally impossible.
No, Kafka Streams "processors" as you call them do not run in (the JVMs of) the Kafka brokers, which would be server-side.
Instead, the Kafka Streams client library is used to implement client-side Java/Scala/Clojure/... applications for stream processing. These applications talk to the Kafka brokers (which form the Kafka cluster) over the network.
May 2020 there seems to be a project in the making to support Kafka Streams in .NET:
https://github.com/LGouellec/kafka-stream-net
As per their road-map they are now in early beta and intend to get to v1 but the end of the year or beginning of next