Can new flink Kafka consumer (KafkaSource) start from the old FlinkKafkaConsumer's Savepoint/checkpoint? - apache-kafka

I have a job which is running with old flink Kafka consumer ( FlinkKafkaConsumer ) Now I want to migrate it to KafkaSource . But I am not sure what will be the impact of this migration. I want my job to start from the latest successful checkpoint taken by old FlinkKafkaConsumer, Is that possible? If it is not possible then what should be the right way for me to migrate Kafka consumer?

Assuming the same configuration, the two should be able to be used interchangeably as long as your previous group-id configuration for the consumer matches the one used by your earlier implementation. You can use this in conjunction with OffsetsInitializer.latest() to ensure that you continue reading from the same offsets that were previously committed:
KafkaSource.<YourExampleClass>builder()
...
.setGroupId("your-previous-group-id")
.setStartingOffsets(OffsetsInitializer.latest())
While the two should just work, it's worth noting your specific pipeline and how it uses parallelism could reveal some of the differences between FlinkKafkaConsumer and the newer KafkaSource:
the KafkaSource behaves differently than FlinkKafkaConsumer in the case where the number of Kafka partitions is smaller than the parallelism of Flink's Kafka Source operator.

How to upgrade from FlinkKafkaConsumer to KafkaSource has been included in the release notes for Flink 1.14, when the FlinkKafkaConsumer was deprecated. You can find it at https://nightlies.apache.org/flink/flink-docs-release-1.14/release-notes/flink-1.14/#deprecate-flinkkafkaconsumer

Related

Can we update a consumer offset in kafka 0.10?

I am using older version of Kafka, 0.10. Is there any way by which I can update the consumer offset for a topic to an arbitrary number?
In Kafka 0.10, I don't think there was a tool to easily update a consumer offset.
You basically have 2 options:
Use the tools from a more recent Kafka version. Nowadays, consumer offsets can be updated using both the kafka-consumer-groups.sh tool or the AdminClient (only in trunk at the moment, it will be in Kafka 2.5).
Write a small application that starts a consumer and calls commitSync() to update its consumer offsets, like in ConsumerGroupCommand.resetOffsets()

Will flink resume from the last offset after executing yarn application kill and running again?

I use FlinkKafkaConsumer to consume kafka and enable checkpoint. Now I'm a little confused on the offset management and checkpoint mechanism.
I have already know flink will start reading partitions from the consumer group’s.
https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-consumers-start-position-configuration
and the offset will store into checkpoint in remote fileSystem.
https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html#kafka-consumers-and-fault-tolerance
What happen if I stop the application by executing the yarn application -kill appid
and run the start command like ./bin flink run ...?
Will flink get the offset from checkpoint or from group-id managed by kafka?
If you run the job again without defining a savepoint ($ bin/flink run -s :savepointPath [:runArgs]) flink will try to get the offsets of your consumer-group from kafka (in older versions from zookeeper). But you will loose all other state of your flink job (which might be ignorable if you have a stateless flink job).
I must admit that this behaviour is quite confusing. By default starting a job without a savepoint is like starting from zero. As far as I know only the implementation of the kafka source differs from that behaviour. If you wanna change that behaviour you can set the setStartFromGroupOffsets of the FlinkKafkaConsumer[08/09/10] to false. This is described here: Kafka Consumers Start Position Configuration
It might be worth having a closer look at the documentation of flink: What is a savepoint and how does it differ from checkpoints.
In a nutshell
Checkpoints:
The primary purpose of Checkpoints is to provide a recovery mechanism in case of unexpected job failures. A Checkpoint’s lifecycle is managed by Flink
Savepoints:
Savepoints are created, owned, and deleted by the user. Their use-case is for planned, manual backup and resume
There are currently ongoing discussions on how to "unify" savepoints and checkpoints. Find a lot of technical details here: Flink improvals 47: Checkpoints vs Savepoints

Flink kafka - Flink job not sending messages to different partitions

I have the below configuration:
One kafka topic with 2 partitions
One zookeeper instance
One kafka instance
Two consumers with same group id
Flink job snippet:
speStream.addSink(new FlinkKafkaProducer011(kafkaTopicName,new
SimpleStringSchema(), props));
Scenario 1:
I have written a flink job (Producer) on eclipse which is reading a file from a folder and putting the msgs on kafka topic.
So when i run this code using eclipse, it works fine.
For example : If I place a file with 100 records, flink sends few msgs to partition 1 & few msgs to partition 2 and hence both the consumers gets few msgs.
Scenario 2:
When i create the jar of the above code and run it on flink server, flink sends all the msgs to a single partition and hence only one consumer get all the msgs.
I want the scenario 1 using the jar created in scenario 2.
For Flink-Kafka Producers, add "null" as the last parameter.
speStream.addSink(new FlinkKafkaProducer011(
kafkaTopicName,
new SimpleStringSchema(),
props,
(FlinkKafkaPartitioner) null)
);
The short explanation for this is that this turns off Flink from using the default partitioner FlinkFixedPartitioner. This being turned off as the default will allow Kafka to distribute the data amongst its partitions as it sees fit. If this is NOT turned off, then each parallelism/task slot used for the sink that utilizes the FlinkKafkaProducer will only write to one partition per parallelism/task slot.
If you do not provide a FlinkKafkaPartitioner or do not explicitly say to use Kafka's one a FlinkFixedPartitioner will be used, meaning that all events from one task will end up in the same partition.
To use Kafka's partitioner use this ctor:
speStream.addSink(new FlinkKafkaProducer011(kafkaTopicName,new SimpleStringSchema(), props), Optional.empty());
The difference between running from IDE and eclipse are probably because of different setup for parallelism or partitioning within Flink.

Increase number of partitions in a Kafka topic from a Kafka client

I'm a new user of Apache Kafka and I'm still getting to know the internals.
In my use case, I need to increase the number of partitions of a topic dynamically from the Kafka Producer client.
I found other similar questions regarding increasing the partition size, but they utilize the zookeeper configuration. But my kafkaProducer has only the Kafka broker config, but not the zookeeper config.
Is there any way I can increase the number of partitions of a topic from the Producer side? I'm running Kafka version 0.10.0.0.
As of Kafka 0.10.0.1 (latest release): As Manav said it is not possible to increase the number of partitions from the Producer client.
Looking ahead (next releases): In an upcoming version of Kafka, clients will be able to perform some topic management actions, as outlined in KIP-4. A lot of the KIP-4 functionality is already completed and available in Kafka's trunk; the code in trunk as of today allows client to create and to delete topics. But unfortunately, for your use case, increasing the number of partitions is still not possible yet -- this is in scope for KIP-4 (see Alter Topics Request) but is not completed yet.
TL;DR: The next versions of Kafka will allow you to increase the number of partitions of a Kafka topic, but this functionality is not yet available.
It is not possible to increase the number of partitions from the Producer client.
Any specific use case use why you cannot use the broker to achieve this ?
But my kafkaProducer has only the Kafka broker config, but not the
zookeeper config.
I don't think any client will let you change the broker config. You can only access (read) the server side config at max.
Your producer can provide different keys for ProducerRecord's. The broker would place them in different partitions. For example, if you need two partitions, use keys "abc" and "xyz".
This can be done in version 0.9 as well.

In Storm, how to migrate offsets to store in Kafka?

I've been having all sorts of instabilities related to Kafka and offsets. Things like workers crashing on startup with exceptions related to invalidate offsets, and other things I don't understand.
I read that it is recommended to migrate offsets to be stored in Kafka instead of Zookeeper. I found the below in the Kafka documentation:
Migrating offsets from ZooKeeper to Kafka Kafka consumers in
earlier releases store their offsets by default in ZooKeeper. It is
possible to migrate these consumers to commit offsets into Kafka by
following these steps: 1. Set offsets.storage=kafka and
dual.commit.enabled=true in your consumer config. 2. Do a rolling
bounce of your consumers and then verify that your consumers are
healthy. 3. Set dual.commit.enabled=false in your consumer config. 4. Do
a rolling bounce of your consumers and then verify that your consumers
are healthy.
A roll-back (i.e., migrating from Kafka back to ZooKeeper) can also
be performed using the above steps if you set
offsets.storage=zookeeper.
http://kafka.apache.org/documentation.html#offsetmigration
But, again, I don't understand what this is instructing me to do. I don't see anywhere in my topology config where I configure where offsets are stored. Is it buried in the cluster yaml?
Any advice on if storing offsets in Kafka, rather than Zookeeper, is a good idea? And how I can perform this change?
At the time of this writing Storm's Kafka spout (see documentation/README at https://github.com/apache/storm/tree/master/external/storm-kafka) only supports managing consumer offsets in ZooKeeper. That is, all current Storm versions (up to 0.9.x and including 0.10.0 Beta) still rely on ZooKeeper for storing such offsets. Hence you should not perform the ZK->Kafka offset migration you referenced above because Storm isn't compatible yet.
You will need to wait until the Storm project -- specifically, its Kafka spout -- supports managing consumer offsets via Kafka (instead of ZooKeeper). And yes, in general it is better to store consumer offsets in Kafka rather than ZooKeeper, but alas Storm isn't there yet.
Update November 2016:
The situation in Storm has improved in the meantime. There's now a new, second Kafka spout that is based on Kafka's new 0.10 consumer client, which stores consumer offsets in Kafka (and not in ZooKeeper): https://github.com/apache/storm/tree/master/external/storm-kafka-client.
However, at the time I am writing this, there are still several issues being reported by the users in the storm-user mailing list (such as Urgent help! kafka-spout stops fetching data after running for a while), so I'd use this new Kafka spout with care, and only after thorough testing.