I'm looking for a way how to distribute messages between two Kafka topics. In original topic I have 20 partitions with 1000000 messages per partition. I want to have a new topic with 1000 partitions and spread messages across new wider partition range.
1T -> 20P -> 1000000 messages per partition (total 20m/topic)
2T -> 1000P -> 20000 messages per partition (total 20m/topic)
Is it possible to do that in Kafka (via topic mirroring or some other technique)?
You could use MirrorMaker (version 1) that comes with Kafka. This tool is mainly used to replicate data from one data center to another. It is build on the assumption that the topic names stay the same in both clusters.
However, you can provide your customised MessageHandler that renames a topic.
package org.xxx.java;
import java.util.Collections;
import java.util.List;
import kafka.consumer.BaseConsumerRecord;
import kafka.tools.MirrorMaker;
import org.apache.kafka.clients.producer.ProducerRecord;
/**
* An example implementation of MirrorMakerMessageHandler that allows to rename topic.
*/
public class TopicRenameHandler implements MirrorMaker.MirrorMakerMessageHandler {
private final String newName;
public TopicRenameHandler(String newName) {
this.newName = newName;
}
public List<ProducerRecord<byte[], byte[]>> handle(BaseConsumerRecord record) {
return Collections.singletonList(new ProducerRecord<byte[], byte[]>(newName, record.partition(), record.key(), record.value()));
}
}
I used the following dependencies in my pom.xml file
<properties>
<kafka.version>2.5.0</kafka.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>${kafka.version}</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.13</artifactId>
<version>${kafka.version}</version>
</dependency>
</dependencies>
Compile the code above and make sure to add your class into the CLASSPATH
export CLASSPATH=$CLASSPATH:/.../target/MirrorMakerRenameTopics-1.0.jar
Now, together with some basic consumer.properties
bootstrap.servers=localhost:9092
client.id=mirror-maker-consumer
group.id=mirror-maker-rename-topic
auto.offset.reset=earliest
and producer.properties
bootstrap.servers=localhost:9092
client.id=mirror-maker-producer
you can call the kafka-mirror-maker as below
kafka-mirror-maker --consumer.config /path/to/consumer.properties \
--producer.config /path/to/producer.properties \
--num.streams 1 \
--whitelist="topicToBeRenamed" \
--message.handler org.xxx.java.TopicRenameHandler \
--message.handler.args "newTopicName"
Please note the following two caveats with this approach:
As you are planning to change the number of partitions the ordering of the messages within the new topic might be different compared to the old topic. Messages are getting partitioned by the key in Kafka by default.
Using the MirrorMaker will not copy your existing offsets in the old topic but rather start writing new offsets. So, there will be (almost) no relation between the offsets from the old topic to the offsets of the new topic.
Related
Actually i have a springboot based micro-service , and i have used kafka to produce/consume data from different system.
Now my question is i have two different topics and based on topics i have two different consumer classes to consume data,
how to define multiple consumer properties in application.yml file ?
I configured for one consumer in application.yml like below :-
spring:
kafka:
consumer:
bootstrapservers: http://199.968.98.101:9092
group-id: groupid-QA-02
auto-offset-reset: latest
key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
value-deserializer: org.apache.kafka.common.serialization.StringDeserializer
I am using #KafkaListener in my consumer classes
example of consumer method which i used in code
#KafkaListener(topics = "${app.topic.b2b_tf_ta_req}", groupId = "${app.topic.groupoId}")
public void consume(String message) throws Exception {
}
As far as I know bootstrap-servers accept comma separated list of servers
i.e. if you set it to server1:9092,server2:9092 kafka should connect to all of them
How can I list the LAG offet for a specific topic and groupID.
Fow now I would like to compose a tool in java , to show that the current offset and LAG for a kafka topic and GroupId.
ANd this toll can be used as a library so that I dont think the kafka check offset command line is my solution .
Moreover , I dont want to start a consumer for doing this , because this consumer will join the group and assigned with patitions , it is Intrusive
Any one please tell me is there any tool/API that I can use for doing this please ?》?
my kafka version is
`
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.9.0.1</version>
</dependency>
`
I am using kafka with flink.
In a simple program, I used flinks FlinkKafkaConsumer09, assigned the group id to it.
According to Kafka's behavior, when I run 2 consumers on the same topic with same group.Id, it should work like a message queue. I think it's supposed to work like:
If 2 messages sent to Kafka, each or one of the flink program would process the 2 messages totally twice(let's say 2 lines of output in total).
But the actual result is that, each program would receive 2 pieces of the messages.
I have tried to use consumer client that came with the kafka server download. It worked in the documented way(2 messages processed).
I tried to use 2 kafka consumers in the same Main function of a flink programe. 4 messages processed totally.
I also tried to run 2 instances of flink, and assigned each one of them the same program of kafka consumer. 4 messages.
Any ideas?
This is the output I expect:
1> Kafka and Flink2 says: element-65
2> Kafka and Flink1 says: element-66
Here's the wrong output i always get:
1> Kafka and Flink2 says: element-65
1> Kafka and Flink1 says: element-65
2> Kafka and Flink2 says: element-66
2> Kafka and Flink1 says: element-66
And here is the segment of code:
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
ParameterTool parameterTool = ParameterTool.fromArgs(args);
DataStream<String> messageStream = env.addSource(new FlinkKafkaConsumer09<>(parameterTool.getRequired("topic"), new SimpleStringSchema(), parameterTool.getProperties()));
messageStream.rebalance().map(new MapFunction<String, String>() {
private static final long serialVersionUID = -6867736771747690202L;
#Override
public String map(String value) throws Exception {
return "Kafka and Flink1 says: " + value;
}
}).print();
env.execute();
}
I have tried to run it twice and also in the other way:
create 2 datastreams and env.execute() for each one in the Main function.
There was a quite similar question on the Flink user mailing list today, but I can't find the link to post it here. So here a part of the answer:
"Internally, the Flink Kafka connectors don’t use the consumer group
management functionality because they are using lower-level APIs
(SimpleConsumer in 0.8, and KafkaConsumer#assign(…) in 0.9) on each
parallel instance for more control on individual partition
consumption. So, essentially, the “group.id” setting in the Flink
Kafka connector is only used for committing offsets back to ZK / Kafka
brokers."
Maybe that clarifies things for you.
Also, there is a blog post about working with Flink and Kafka that may help you (https://data-artisans.com/blog/kafka-flink-a-practical-how-to).
Since there is not much use of group.id of flink kafka consumer other than commiting offset to zookeeper. Is there any way of offset monitoring as far as flink kafka consumer is concerned. I could see there is a way [with the help of consumer-groups/consumer-offset-checker] for console consumers but not for flink kafka consumers.
We want to see how our flink kafka consumer is behind/lagging with kafka topic size[total number of messages in topic at given point of time], it is fine to have it at partition level.
My objective is to setup a high throughput cluster using Kafka as source & Flink as the stream processing engine. Here's what I have done.
I have setup a 2-node cluster the following configuration on the master and the workers.
Master flink-conf.yaml
jobmanager.rpc.address: <MASTER_IP_ADDR> #localhost
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 256
taskmanager.heap.mb: 512
taskmanager.numberOfTaskSlots: 50
parallelism.default: 100
Worker flink-conf.yaml
jobmanager.rpc.address: <MASTER_IP_ADDR> #localhost
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 512 #256
taskmanager.heap.mb: 1024 #512
taskmanager.numberOfTaskSlots: 50
parallelism.default: 100
The slaves file on the Master node looks like this:
<WORKER_IP_ADDR>
localhost
The flink setup on both nodes is in a folder which has the same name. I start up the cluster on the master by running
bin/start-cluster-streaming.sh
This starts up the task manager on the Worker node.
My input source is Kafka. Here is the snippet.
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> stream =
env.addSource(
new KafkaSource<String>(kafkaUrl,kafkaTopic, new SimpleStringSchema()));
stream.addSink(stringSinkFunction);
env.execute("Kafka stream");
Here is my Sink function
public class MySink implements SinkFunction<String> {
private static final long serialVersionUID = 1L;
public void invoke(String arg0) throws Exception {
processMessage(arg0);
System.out.println("Processed Message");
}
}
Here are the Flink Dependencies in my pom.xml.
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-core</artifactId>
<version>0.9.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients</artifactId>
<version>0.9.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka</artifactId>
<version>0.9.0</version>
</dependency>
Then I run the packaged jar with this command on the master
bin/flink run flink-test-jar-with-dependencies.jar
However when I insert messages into the Kafka topic I am able to account for all messages coming in from my Kafka topic (via debug messages in the invoke method of my SinkFunction implementation) on the Master node alone.
In the Job manager UI I am able to see 2 Task managers as below:
Also The dashboard looks like so :
Questions:
Why are the worker nodes not getting the tasks?
Am I missing some configuration?
When reading from a Kafka source in Flink, the maximum degree of parallelism for the source task is limited by the number of partitions of a given Kafka topic. A Kafka partition is the smallest unit which can be consumed by a source task in Flink. If there are more partitions than source tasks, then some tasks will consume multiple partitions.
Consequently, in order to supply input to all of your 100 tasks, you should assure that your Kafka topic has at least 100 partitions.
If you cannot change the number of partitions of your topic, then it is also possible to initially read from Kafka using a lower degree of parallelism using the setParallelism method. Alternatively, you can use the rebalance method which will shuffle your data across all available tasks of the preceding operation.
In Kafka 0.8beta a topic can be created using a command like below as mentioned here
bin/kafka-create-topic.sh --zookeeper localhost:2181 --replica 2 --partition 3 --topic test
the above command will create a topic named "test" with 3 partitions and 2 replicas per partition.
Can I do the same thing using Java ?
So far what I found is using Java we can create a producer as seen below
Producer<String, String> producer = new Producer<String, String>(config);
producer.send(new KeyedMessage<String, String>("mytopic", msg));
This will create a topic named "mytopic" with the number of partition specified using the "num.partitions" attribute and start producing.
But is there a way to define the partition and replication also ? I couldn't find any such example. If we can't then does that mean we always need to create topic with partitions and replication (as per our requirement) before and then use the producer to produce message within that topic. For example will it be possible if I want to create the "mytopic" the same way but with different number of partition (overriding the num.partitions attribute) ?
Note: My answer covers Kafka 0.8.1+, i.e. the latest stable version available as of April 2014.
Yes, you can create a topic programatically via the Kafka API. And yes, you can specify the desired number of partitions as well as the replication factor for the topic.
Note that the recently released Kafka 0.8.1+ provides a slightly different API than Kafka 0.8.0 (which was used by Biks in his linked reply). I added a code example to create a topic in Kafka 0.8.1+ to my reply to the question How Can we create a topic in Kafka from the IDE using API that Biks was referring to above.
`
import kafka.admin.AdminUtils;
import kafka.cluster.Broker;
import kafka.utils.ZKStringSerializer$;
import kafka.utils.ZkUtils;
String zkConnect = "localhost:2181";
ZkClient zkClient = new ZkClient(zkConnect, 10 * 1000, 8 * 1000, ZKStringSerializer$.MODULE$);
ZkUtils zkUtils = new ZkUtils(zkClient, new ZkConnection(zkConnect), false);
Properties pop = new Properties();
AdminUtils.createTopic(zkUtils, topic.getTopicName(), topic.getPartitionCount(), topic.getReplicationFactor(),
pop);
zkClient.close();`