Copy data between kafka topics using kafka connectors - apache-kafka

I'm new to Kafka, now I need to copy data from one kafka topic to another. I'm wondering what are the possible ways to do so? The ways I can think of are following:
Kakfa consumer + Kafka producer
Kafka streams
Kafka sink connector + producer
Kafka consumer + source connector
My questions is: is that possible to use two kafka connectors in between? E.g. sink connector + source connector. Is so, could you please provide me some good examples? Or some hints of how to do so?
Thanks in advance!

All the methods you listed are possible. Which one is the best really depends on the control you want over the process or whether it's a one off operation or something you want to keep running.
Kafka Streams offers a easy way to flow one topic into another via the DSL
You could do something like (demo code obviously not for production!):
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-wordcount");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
final Serde<byte[]> bytesSerdes = Serdes.ByteArray();
final StreamsBuilder builder = new StreamsBuilder();
KStream<byte[], byte[]> input = builder.stream(
"input-topic",
Consumed.with(bytesSerdes, bytesSerdes)
);
input.to("output-topic", Produced.with(bytesSerdes, bytesSerdes));
final KafkaStreams streams = new KafkaStreams(builder.build(), props);
try {
streams.start();
Thread.sleep(60000L);
} catch (Exception e) {
e.printStackTrace();
} finally {
streams.close();
}

Related

Flink kafka consumer fetch messages from specific partition

We want to achieve parallelism while reading a message form kafka. hence we wanted to specify partition number in flinkkafkaconsumer. It will read messages from all partition in kafka instead of specific partition number. Below is sample code:
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "Message-Test-Consumers");
properties.setProperty("partition", "1"); //not sure about this syntax.
FlinkKafkaConsumer<String> kafkaConsumer = new FlinkKafkaConsumer<String>("EventLog", new SimpleStringSchema(), properties);
Please suggest any better option to get the parallelism.
I don't believe there is a mechanism to restrict which partitions Flink will read from. Nor do I see how this would help you achieve your goal of reading from the partitions in parallel, which Flink does regardless.
The Flink Kafka source connector reads from all available partitions, in parallel. Simply set the parallelism of the kafka source connector to whatever parallelism you desire, keeping in mind that the effective parallelism cannot exceed the number of partitions. In this way, each instance of Flink's Kafka source connector will read from one or more partitions. You can also configure the kafka consumer to automatically discover new partitions that may be created while the job is running.

Use the same topic as a source more than once with Kafka Streams DSL

Is there a way to use the same topic as the source for two different processing routines, when using Kafka Streams DSL?
StreamsBuilder streamsBuilder = new StreamsBuilder();
// use the topic as a stream
streamsBuilder.stream("topic")...
// use the same topic as a source for KTable
streamsBuilder.table("topic")...
return streamsBuilder.build();
Naive implementation from above throws a TopologyException at runtime: Invalid topology: Topic topic has already been registered by another source. Which is totally valid, if we dive into underlying Processor API. Is using it the only way out?
UPDATE:
The closest alternative I've found so far:
StreamsBuilder streamsBuilder = new StreamsBuilder();
final KStream<Object, Object> stream = streamsBuilder.stream("topic");
// use the topic as a stream
stream...
// create a KTable from the KStream
stream.groupByKey().reduce((oldValue, newValue) -> newValue)...
return streamsBuilder.build();
Reading the same topic as stream and as table is semantically questionable IMHO. Streams model immutable facts, while changelog topic that you would use to read into a KTable model updates.
If you want to use a single topic in multiple streams, you can reuse the same KStream object multiple times (it's semantically like a broadcast):
KStream stream = ...
stream.filter();
stream.map();
Also compare: https://issues.apache.org/jira/browse/KAFKA-6687 (there are plans to remove this restriction. I doubt, we will allow to use one topic as KStream and KTable at the same time though—compare my comment from above).
yes, you can, but for that you need to have multiple StreamsBuilder
StreamsBuilder streamsBuilder1 = new StreamsBuilder();
streamsBuilder1.stream("topic");
StreamsBuilder streamsBuilder2 = new StreamsBuilder();
streamsBuilder2.table("topic");
Topology topology1 = streamsBuilder1.build();
Topology topology2 = streamsBuilder2.build();
KafkaStreams kafkaStreams1 = new KafkaStreams(topology1, streamsConfig1);
KafkaStreams kafkaStreams2 = new KafkaStreams(topology2, streamsConfig2);
Also make sure that you have different application.id values for each of StreamsConfig

Reading latest data from Kafka broker in Apache Flink

I want to receive the latest data from Kafka to Flink Program, but Flink is reading the historical data.
I have set auto.offset.reset to latest as shown below, but it did not work
properties.setProperty("auto.offset.reset", "latest");
Flink Programm is receiving the data from Kafka using below code
//getting stream from Kafka and giving it assignTimestampsAndWatermarks
DataStream<JoinedStreamEvent> raw_stream = envrionment.addSource(new FlinkKafkaConsumer09<JoinedStreamEvent>("test",
new JoinSchema(), properties)).assignTimestampsAndWatermarks(new IngestionTimeExtractor<>());
I was following the discussion on
https://issues.apache.org/jira/browse/FLINK-4280 , which suggests adding the source in below mentioned way
Properties props = new Properties();
...
FlinkKafkaConsumer kafka = new FlinkKafkaConsumer("topic", schema, props);
kafka.setStartFromEarliest();
kafka.setStartFromLatest();
kafka.setEnableCommitOffsets(boolean); // if true, commits on checkpoint if checkpointing is enabled, otherwise, periodically.
kafka.setForwardMetrics(boolean);
...
env.addSource(kafka)
I did the same, however, I was not able to access the setStartFromLatest()
FlinkKafkaConsumer09 kafka = new FlinkKafkaConsumer09<JoinedStreamEvent>( "test", new JoinSchema(),properties);
What should I do to receive the latest values that are being sent to
Kafka rather than receiving values from history?
The problem was solved by creating the new group id named test1 both for the sender and consumer and keeping the topic name same as test.
Now I am wondering, is this the best way to solve this issue? because
every time I need to give a new group id
Is there some way I can just read data that is being sent to Kafka?
I believe this could work for you. It did for me. Modify the properties and your kafka topic.
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "ip:port");
properties.setProperty("zookeeper.connect", "ip:port");
properties.setProperty("group.id", "your-group-id");
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer09<>("your-topic", new SimpleStringSchema(), properties));
stream.writeAsText("your-path", FileSystem.WriteMode.OVERWRITE)
.setParallelism(1);
env.execute();
}

Apache Storm: Support of topic wildcards in Kafka spout

We have a topology that has multiple kafka spout tasks. Each spout task is supposed to read a subset of messages from a set of Kafka topics. Topics have to be subscribed using a wild card such as AAA.BBB.*. The expected behaviour would be that all spout tasks collectively will consume all messages in all of the topics that match the wild card. Each message is only routed to a single spout task (Ignore failure scenarios). Is this currently supported?
Perhaps you could use DynamicBrokersReader class.
Map conf = new HashMap();
...
conf.put("kafka.topic.wildcard.match", true);
wildCardBrokerReader = new DynamicBrokersReader(conf, connectionString, masterPath, "AAA.BBB.*");
List<GlobalPartitionInformation> partitions = wildCardBrokerReader.getBrokerInfo();
...
for (GlobalPartitionInformation eachTopic: partitions) {
StaticHosts hosts = new StaticHosts(eachTopic);
SpoutConfig spoutConfig = new SpoutConfig(hosts, eachTopic.topic, zkRoot, id);
KafkaSpout spout = new KafkaSpout(spoutConfig);
}
... // Wrap those created spout instances into a container

What are the differences between Kafka and MapR streams from coding perspective?

What are the differences between Kafka and MapR streams from coding perspective? I need to implement the MapR streams in future but currently I have only access to Kafka. So exploring the Kafka right now is useful? So that I can easily pick up on MapR streams once I get the access?
As such there is no big difference in Kafka and MapR Stream API in terms of coding.
But there are some differences in terms of configuration and API arguments:
Kafka supports Receiver and Direct both approaches, but MapR stream supports only Direct approach.
The offset reset configuration value for reading the data from start, is smallest in Kafka, but in MapR Stream it is earliest.
The Kafka API supports for passing the Key and Value deserializer arguments in method, but in MapR stream API you have to configure them in Kafka params map against key.deserializer and value.deserializer keys.
Example of Direct approach for Kafka and MapR Stream API calls to receive the DStream:
Kafka API:
// setting the topic.
HashSet<String> topicsSet = new HashSet<String>(Arrays.asList("myTopic"));
// setting the broker list.
Map<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", "localhost:9092");
// To read the messages from start.
kafkaParams.put("auto.offset.reset", "smallest");
// creating the DStream
JavaPairInputDStream<byte[], byte[]> kafkaStream = KafkaUtils.createDirectStream(streamingContext, byte[].class, byte[].class, DefaultDecoder.class, DefaultDecoder.class, kafkaParams, topicsSet);
MapR Stream API:
// setting the topic.
HashSet<String> topicsSet = new HashSet<String>(Arrays.asList("myTopic"));
// setting the broker list.
Map<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", "localhost:9092");
// To read the messages from start.
kafkaParams.put("auto.offset.reset", "earliest");
// setting up the key and value deserializer
kafkaParams.put("key.deserializer", StringDeserializer.class.getName());
kafkaParams.put("value.deserializer", ByteArrayDeserializer.class.getName());
// creating the DStream
JavaPairInputDStream<byte[], byte[]> kafkaStream = KafkaUtils.createDirectStream(streamingContext, byte[].class, byte[].class, kafkaParams, topicsSet);
I hope the above explanation help you in understanding the differences between Kafka and MapR Stream API's.
Thanks,
Hokam
www.streamanalytix.com
I haven't used MapR Streams (since it is not open source), but my understanding is that they cloned the Kafka 0.9 Java API. So, if you are using Kafka 0.9 clients, it should be pretty similar (but you need to use their client, not Apache's).
In addition, note that clients in other languages will not be available. And other Apache projects that use different APIs (notably Spark Streaming) will require special MapR compatible versions.