Flink kafka - Flink job not sending messages to different partitions - apache-kafka

I have the below configuration:
One kafka topic with 2 partitions
One zookeeper instance
One kafka instance
Two consumers with same group id
Flink job snippet:
speStream.addSink(new FlinkKafkaProducer011(kafkaTopicName,new
SimpleStringSchema(), props));
Scenario 1:
I have written a flink job (Producer) on eclipse which is reading a file from a folder and putting the msgs on kafka topic.
So when i run this code using eclipse, it works fine.
For example : If I place a file with 100 records, flink sends few msgs to partition 1 & few msgs to partition 2 and hence both the consumers gets few msgs.
Scenario 2:
When i create the jar of the above code and run it on flink server, flink sends all the msgs to a single partition and hence only one consumer get all the msgs.
I want the scenario 1 using the jar created in scenario 2.

For Flink-Kafka Producers, add "null" as the last parameter.
speStream.addSink(new FlinkKafkaProducer011(
kafkaTopicName,
new SimpleStringSchema(),
props,
(FlinkKafkaPartitioner) null)
);
The short explanation for this is that this turns off Flink from using the default partitioner FlinkFixedPartitioner. This being turned off as the default will allow Kafka to distribute the data amongst its partitions as it sees fit. If this is NOT turned off, then each parallelism/task slot used for the sink that utilizes the FlinkKafkaProducer will only write to one partition per parallelism/task slot.

If you do not provide a FlinkKafkaPartitioner or do not explicitly say to use Kafka's one a FlinkFixedPartitioner will be used, meaning that all events from one task will end up in the same partition.
To use Kafka's partitioner use this ctor:
speStream.addSink(new FlinkKafkaProducer011(kafkaTopicName,new SimpleStringSchema(), props), Optional.empty());
The difference between running from IDE and eclipse are probably because of different setup for parallelism or partitioning within Flink.

Related

Migrate Kafka Topic to new Cluster (and impact on Druid)

I am ingesting data into Druid from Kafka's topic. Now I want to migrate my Kafka Topic to the new Kafka Cluster. What are the possible ways to do this without duplication of data and without downtime?
I have considered below possible ways to migrate Topic to the new Kafka Cluster.
Manual Migration:
Create a topic with the same configuration in the new Kafka cluster.
Stop pushing data in the Kafka cluster.
Start pushing data in the new cluster.
Stop consuming from the old cluster.
Start consuming from the new cluster.
Produce data in both Kafka clusters:
Create a topic with the same configuration in the new Kafka cluster.
Start producing messages in both Kafka clusters.
Change Kafka topic configration in Druid.
Reset Kafka topic offset in Druid.
Start consuming from the new cluster.
After successful migration, stop producing in the old Kafka cluster.
Use Mirror Maker 2:
MM2 creates Kafka's topic in a new cluster.
Start replicating data in both clusters.
Move producer and consumer to the new Kafka cluster.
The problem with this approach:
Druid manages Kafka topic's offset in its metadata.
MM2 will create two topics with the same name(with prefix) in the new cluster.
Does druid support the topic name with regex?
Note: Druid manages Kafka topic offset in its metadata.
Druid Version: 0.22.1
Old Kafka Cluster Version: 2.0
Maybe a slight modification of your number 1:
Start publishing to the new cluster.
Wait for the current supervisor to catch up all the data in the old topic.
Suspend the supervisor. This will force all the tasks to write and publish the segments. Wait for all the tasks for this supervisor to succeed. This is where "downtime" starts. All of the currently ingested data is still queryable while we switch to the new cluster. New data is being accumulated in the new cluster, but not being ingested in Druid.
All the offset information of the current datasource are stored in Metadata Storage. Delete those records using
delete from druid_dataSource where datasource={name}
Terminate the current supervisor.
Submit the new spec with the new topic and new server information.
You can follow these steps:
1- On your new cluster, create your new topic (the same name or new name, doesn't matter)
2- Change your app config to send messages to new kafka cluster
3- Wait till druid consume all messages from the old kafka, you can ensure when data is being consumed by checking supervisor's lagging and offset info
4- Suspend the task, and wait for the tasks to publish their segment and exit successfully
5- Edit druid's datasource, make sure useEarliestOffset is set to true, change the info to consume from new kafka cluster (and new topic name if it isn't the same)
6- Save the schema and resume the task. Druid will hit the wall when checking the offset, because it cannot find them in new kafka, and then it starts from the beginning
Options 1 and 2 will have downtime and you will lose all data in the existing topic.
Option 2 cannot guarantee you wont lose data or generate duplicates as you try to send messages to multiple clusters together.
There will be no way to migrate the Druid/Kafka offset data to the new cluster without at least trying MM2. You say you can reset the offset in Option 2, so why not do the same with Option 3? I haven't used Druid, but it should be able to support consuming from multiple topics, with pattern or not. With option 3, you don't need to modify any producer code until you are satisfied with the migration process.

Nifi: Create a Kafka topic with PublishKafka with a specific number of partitions/replication-factor

I am using Apache Nifi version 1.10.0. I have put some data into Kafka from Nifi using the PublishKafka_2_0 processor. I have three Kafka brokers running along side with Kafka. I am getting the data from Nifi but the topic that is created in Nifi have a replication-factor of 1 and partitions of 1.
How can I change the default value of replication-factor and partitions when creating a new topic in PublishKafka? In other words, I want the processor to create new topics with partitions=3 and replication-factors=3 instead of 1.
I understand that this can be changed after the topic is created but I would like it to be done dynamically at creation.
If I understand your setup correctly, you are relying on the client side for topic creation, i.e. topics are created when NiFi attempts to produce/consume/fetch metadata for a non-existent topic. In this scenario, Kafka will use num.partitions and default.replication.factor settings for a new topic that are defined in broker config. (Kafka defaults to 1 for both.) Currently, updating these values in server.properties is the only way to control auto-created topics' configuration.
KIP-487 is being worked on to allow producers to control topic creation (as opposed to being server-side, one-for-all verdict), but even in that implementation there is no plan for a client to control number of partitions or replication factor.

two kafka consumer in same group and one partition

I have kafka topic and one partiton .When i create consumer missed to mention the group .id . there are 2 different consumer for the same partition .
I can receive messages and my flink source extends FlinkKafkaConsumer011.
Ideally one consumer should receive message since its a same group but in my case both the consumers are receiving messages not sure why ???.
To test have restarted one of my job and newly started consumer is not picking where it has left because other consumer is commiting offset it seems.
Flink 1.6.0
Kafka 0.11.2

How exactly Apache Nifi ConsumeKafka_1_0 processor works

I have Nifi cluster of and Kafka is also installed there.
Created one topic with 5 partitions, start consuming that topic with one gourp-id. So that each partition will get unique messages.
Now I created the 5 ConsumeKafka_1_0 processors having the intent of getting unique messages on each consumer side. But only 2 of the ConsumeKafka_1_0 are consuming all the messages rest is setting ideal.
Now what I did is started the 5 command line Kafka consumer, and what happened is, I was able to see the all the partitions are getting the messages and able to consume them from command line consumer in round-robin fashion only.
Also, I tried descried the Kafka group and what I saw was only 2 of the Nifi ConsumeKafka_1_0 is consuming all the 5 partitions and rest is ideal, see the snapshot.
Would you please let me what I am doing wrong here with Nifi consumer processor.
Note - i used Nifi version is 1.5 and Kafka version is 1.0.
I've written this article which explains how the integration with Kafka works:
https://bryanbende.com/development/2016/09/15/apache-nifi-and-apache-kafka
The Apache Kafka client (used by NiFi) is what assigns partitions to the consumers.
Typically if you had a 5 node NiFi cluster, with 1 ConsumeKafka processor on the canvas with 1 concurrent task, then each node would be consuming 1 partition.

Increase number of partitions in a Kafka topic from a Kafka client

I'm a new user of Apache Kafka and I'm still getting to know the internals.
In my use case, I need to increase the number of partitions of a topic dynamically from the Kafka Producer client.
I found other similar questions regarding increasing the partition size, but they utilize the zookeeper configuration. But my kafkaProducer has only the Kafka broker config, but not the zookeeper config.
Is there any way I can increase the number of partitions of a topic from the Producer side? I'm running Kafka version 0.10.0.0.
As of Kafka 0.10.0.1 (latest release): As Manav said it is not possible to increase the number of partitions from the Producer client.
Looking ahead (next releases): In an upcoming version of Kafka, clients will be able to perform some topic management actions, as outlined in KIP-4. A lot of the KIP-4 functionality is already completed and available in Kafka's trunk; the code in trunk as of today allows client to create and to delete topics. But unfortunately, for your use case, increasing the number of partitions is still not possible yet -- this is in scope for KIP-4 (see Alter Topics Request) but is not completed yet.
TL;DR: The next versions of Kafka will allow you to increase the number of partitions of a Kafka topic, but this functionality is not yet available.
It is not possible to increase the number of partitions from the Producer client.
Any specific use case use why you cannot use the broker to achieve this ?
But my kafkaProducer has only the Kafka broker config, but not the
zookeeper config.
I don't think any client will let you change the broker config. You can only access (read) the server side config at max.
Your producer can provide different keys for ProducerRecord's. The broker would place them in different partitions. For example, if you need two partitions, use keys "abc" and "xyz".
This can be done in version 0.9 as well.