What are the differences between Kafka and MapR streams from coding perspective? - apache-kafka

What are the differences between Kafka and MapR streams from coding perspective? I need to implement the MapR streams in future but currently I have only access to Kafka. So exploring the Kafka right now is useful? So that I can easily pick up on MapR streams once I get the access?

As such there is no big difference in Kafka and MapR Stream API in terms of coding.
But there are some differences in terms of configuration and API arguments:
Kafka supports Receiver and Direct both approaches, but MapR stream supports only Direct approach.
The offset reset configuration value for reading the data from start, is smallest in Kafka, but in MapR Stream it is earliest.
The Kafka API supports for passing the Key and Value deserializer arguments in method, but in MapR stream API you have to configure them in Kafka params map against key.deserializer and value.deserializer keys.
Example of Direct approach for Kafka and MapR Stream API calls to receive the DStream:
Kafka API:
// setting the topic.
HashSet<String> topicsSet = new HashSet<String>(Arrays.asList("myTopic"));
// setting the broker list.
Map<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", "localhost:9092");
// To read the messages from start.
kafkaParams.put("auto.offset.reset", "smallest");
// creating the DStream
JavaPairInputDStream<byte[], byte[]> kafkaStream = KafkaUtils.createDirectStream(streamingContext, byte[].class, byte[].class, DefaultDecoder.class, DefaultDecoder.class, kafkaParams, topicsSet);
MapR Stream API:
// setting the topic.
HashSet<String> topicsSet = new HashSet<String>(Arrays.asList("myTopic"));
// setting the broker list.
Map<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", "localhost:9092");
// To read the messages from start.
kafkaParams.put("auto.offset.reset", "earliest");
// setting up the key and value deserializer
kafkaParams.put("key.deserializer", StringDeserializer.class.getName());
kafkaParams.put("value.deserializer", ByteArrayDeserializer.class.getName());
// creating the DStream
JavaPairInputDStream<byte[], byte[]> kafkaStream = KafkaUtils.createDirectStream(streamingContext, byte[].class, byte[].class, kafkaParams, topicsSet);
I hope the above explanation help you in understanding the differences between Kafka and MapR Stream API's.
Thanks,
Hokam
www.streamanalytix.com

I haven't used MapR Streams (since it is not open source), but my understanding is that they cloned the Kafka 0.9 Java API. So, if you are using Kafka 0.9 clients, it should be pretty similar (but you need to use their client, not Apache's).
In addition, note that clients in other languages will not be available. And other Apache projects that use different APIs (notably Spark Streaming) will require special MapR compatible versions.

Related

Copy data between kafka topics using kafka connectors

I'm new to Kafka, now I need to copy data from one kafka topic to another. I'm wondering what are the possible ways to do so? The ways I can think of are following:
Kakfa consumer + Kafka producer
Kafka streams
Kafka sink connector + producer
Kafka consumer + source connector
My questions is: is that possible to use two kafka connectors in between? E.g. sink connector + source connector. Is so, could you please provide me some good examples? Or some hints of how to do so?
Thanks in advance!
All the methods you listed are possible. Which one is the best really depends on the control you want over the process or whether it's a one off operation or something you want to keep running.
Kafka Streams offers a easy way to flow one topic into another via the DSL
You could do something like (demo code obviously not for production!):
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "streams-wordcount");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
final Serde<byte[]> bytesSerdes = Serdes.ByteArray();
final StreamsBuilder builder = new StreamsBuilder();
KStream<byte[], byte[]> input = builder.stream(
"input-topic",
Consumed.with(bytesSerdes, bytesSerdes)
);
input.to("output-topic", Produced.with(bytesSerdes, bytesSerdes));
final KafkaStreams streams = new KafkaStreams(builder.build(), props);
try {
streams.start();
Thread.sleep(60000L);
} catch (Exception e) {
e.printStackTrace();
} finally {
streams.close();
}

Use the same topic as a source more than once with Kafka Streams DSL

Is there a way to use the same topic as the source for two different processing routines, when using Kafka Streams DSL?
StreamsBuilder streamsBuilder = new StreamsBuilder();
// use the topic as a stream
streamsBuilder.stream("topic")...
// use the same topic as a source for KTable
streamsBuilder.table("topic")...
return streamsBuilder.build();
Naive implementation from above throws a TopologyException at runtime: Invalid topology: Topic topic has already been registered by another source. Which is totally valid, if we dive into underlying Processor API. Is using it the only way out?
UPDATE:
The closest alternative I've found so far:
StreamsBuilder streamsBuilder = new StreamsBuilder();
final KStream<Object, Object> stream = streamsBuilder.stream("topic");
// use the topic as a stream
stream...
// create a KTable from the KStream
stream.groupByKey().reduce((oldValue, newValue) -> newValue)...
return streamsBuilder.build();
Reading the same topic as stream and as table is semantically questionable IMHO. Streams model immutable facts, while changelog topic that you would use to read into a KTable model updates.
If you want to use a single topic in multiple streams, you can reuse the same KStream object multiple times (it's semantically like a broadcast):
KStream stream = ...
stream.filter();
stream.map();
Also compare: https://issues.apache.org/jira/browse/KAFKA-6687 (there are plans to remove this restriction. I doubt, we will allow to use one topic as KStream and KTable at the same time though—compare my comment from above).
yes, you can, but for that you need to have multiple StreamsBuilder
StreamsBuilder streamsBuilder1 = new StreamsBuilder();
streamsBuilder1.stream("topic");
StreamsBuilder streamsBuilder2 = new StreamsBuilder();
streamsBuilder2.table("topic");
Topology topology1 = streamsBuilder1.build();
Topology topology2 = streamsBuilder2.build();
KafkaStreams kafkaStreams1 = new KafkaStreams(topology1, streamsConfig1);
KafkaStreams kafkaStreams2 = new KafkaStreams(topology2, streamsConfig2);
Also make sure that you have different application.id values for each of StreamsConfig

How can I know that I have consumed all of a Kafka Topic?

I am using Flink v1.4.0. I am consuming data from a Kafka Topic using a Kafka FLink Consumer as per the code below:
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
// only required for Kafka 0.8
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties));
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
FlinkKafkaConsumer08<String> myConsumer = new FlinkKafkaConsumer08<>(...);
myConsumer.setStartFromEarliest(); // start from the earliest record possible
myConsumer.setStartFromLatest(); // start from the latest record
myConsumer.setStartFromGroupOffsets(); // the default behaviour
DataStream<String> stream = env.addSource(myConsumer);
...
Is there a way of knowing whether I have consumed the whole of the Topic? How can I monitor the offset? (Is that an adequate way of confirming that I have consumed all the data from within the Kafka Topic?)
Since Kafka is typically used with continuous streams of data, consuming "all" of a topic may or may not be a meaningful concept. I suggest you look at the documentation on how Flink exposes Kafka's metrics, which includes this explanation:
The difference between the committed offset and the most recent offset in
each partition is called the consumer lag. If the Flink topology is consuming
the data slower from the topic than new data is added, the lag will increase
and the consumer will fall behind. For large production deployments we
recommend monitoring that metric to avoid increasing latency.
So, if the consumer lag is zero, you're caught up. That said, you might wish to be able to compare the offsets yourself, but I don't know of an easy way to do that.
Kafka it's used as a streaming source and a stream does not have an end.
If im not wrong, Flink's Kafka connector pulls data from a Topic each X miliseconds, because all kafka consumers are Active consumers, Kafka does not notify you if there's new data inside a topic
So, in your case, just set a timeout and if you don't read data in that time, you have readed all of the data inside your topic.
Anyways, if you need to read a batch of finite data, you can use some of Flink's Windows or introduce some kind of marks inside your Kafka topic, to delimit the start and the of the batch.

Multiple Kafka streams for spark streaming

How to input kafka streams from two different topics (like view_event and click_event), how we can do this in Java so both the streams are received parallely. Till now I am trying this
JavaPairInputDStream<String, GenericData.Record> stream_1 =
KafkaUtils.createDirectStream(ssc, String.class, GenericData.Record.class, StringDecoder.class,
GenericDataRecordDecoder.class, props, topicsSet_1);
JavaPairInputDStream<String, GenericData.Record> stream_2 =
KafkaUtils.createDirectStream(ssc, String.class, GenericData.Record.class, StringDecoder.class,
GenericDataRecordDecoder.class, props, topicsSet_2);
But I observed some weirdness in output that sometimes only Stream_1 is received and only stream_2 is received.
Also I am using connector version 0.8.2.1 and I would be better if you provide sample code example.

Can Apache Kafka send non-string messages through a topic?

A mllib model is trained somewhere and I want it to be sent to somewhere else. When I try to send it through a kafka topic like this
val model = LogisticRegressionModel.load(sc, "/PATH/To/Model")
val producer=new Producer[String, LogisticRegressionModel](config)
val data=new KeyedMessage[String, LogisticRegressionModel(topic2,key,model)
producer.send(data)
producer.close()
I would encounter an error like this:
org.apache.spark.mllib.classification.LogisticRegressionModel cannot be cast to java.lang.String
So, is it possible for kafka to send non-string messages through a topic?
You can send non-string messages to Kafka topic using Kafka Producer. From 0.9.0 version its better to use Java Client instead of Scala Client.
All you need to do is specifying the correct Key, Value serializer in Properties like below.
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");