Multiple Kafka streams for spark streaming - apache-kafka

How to input kafka streams from two different topics (like view_event and click_event), how we can do this in Java so both the streams are received parallely. Till now I am trying this
JavaPairInputDStream<String, GenericData.Record> stream_1 =
KafkaUtils.createDirectStream(ssc, String.class, GenericData.Record.class, StringDecoder.class,
GenericDataRecordDecoder.class, props, topicsSet_1);
JavaPairInputDStream<String, GenericData.Record> stream_2 =
KafkaUtils.createDirectStream(ssc, String.class, GenericData.Record.class, StringDecoder.class,
GenericDataRecordDecoder.class, props, topicsSet_2);
But I observed some weirdness in output that sometimes only Stream_1 is received and only stream_2 is received.
Also I am using connector version 0.8.2.1 and I would be better if you provide sample code example.

Related

Kafka consumers (same group-id) stucked in reading always from same partition

I have 2 consumers running with the same group-id and reading from topic having 3 partititons and parsing messages with KafkaAvroDeserializer. The consumer has these settings:
def avroConsumerSettings[T <: SpecificRecordBase](schemaRegistry: String, bootstrapServer: String, groupId: String)(implicit
actorSystem: ActorSystem): ConsumerSettings[String, T] = {
val kafkaAvroSerDeConfig = Map[String, Any](
AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG -> schemaRegistry,
KafkaAvroDeserializerConfig.SPECIFIC_AVRO_READER_CONFIG -> true.toString
)
val kafkaAvroDeserializer = new KafkaAvroDeserializer()
kafkaAvroDeserializer.configure(kafkaAvroSerDeConfig.asJava, false)
val deserializer =
kafkaAvroDeserializer.asInstanceOf[Deserializer[T]]
ConsumerSettings(actorSystem, new StringDeserializer, deserializer)
.withBootstrapServers(bootstrapServer)
.withGroupId(groupId)
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
.withProperty(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true")
}
I tried to send a malformed message to test error handling and now my consumer is stucked (always retry reading from same partition because I'm using RestartSource.onFailuresWithBackoff); but what is strange to me (AFAIK each consumer in the same group-id cannot read from the same partition) is that if I run another consumer it stucks as well because it reads again from the same partition where unreadable message is.
Can someone help me understand what am I doing wrong?
When you restart the Kafka source after a failure, that results in a new consumer being created; eventually the consumer in the failed source is declared dead by Kafka, triggering a rebalance. In that rebalance, there are no external guarantees of which consumer in the group will be assigned which partition. This would explain why your other consumer in the group reads that partition.
The issue here with a poison message derailing consumption is a major reason I've developed a preference to treat keys and values from Kafka as blobs by using the ByteArrayDeserializer and do the deserialization myself in the stream, which gives me the ability to record (e.g. by logging; producing the message to a dead-letter topic for later inspection can also work) that there was a malformed message in the topic and move on by committing the offset. Either in Scala is particularly good for moving the malformed message directly to the committer.

How can I know that I have consumed all of a Kafka Topic?

I am using Flink v1.4.0. I am consuming data from a Kafka Topic using a Kafka FLink Consumer as per the code below:
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
// only required for Kafka 0.8
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "test");
DataStream<String> stream = env
.addSource(new FlinkKafkaConsumer08<>("topic", new SimpleStringSchema(), properties));
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
FlinkKafkaConsumer08<String> myConsumer = new FlinkKafkaConsumer08<>(...);
myConsumer.setStartFromEarliest(); // start from the earliest record possible
myConsumer.setStartFromLatest(); // start from the latest record
myConsumer.setStartFromGroupOffsets(); // the default behaviour
DataStream<String> stream = env.addSource(myConsumer);
...
Is there a way of knowing whether I have consumed the whole of the Topic? How can I monitor the offset? (Is that an adequate way of confirming that I have consumed all the data from within the Kafka Topic?)
Since Kafka is typically used with continuous streams of data, consuming "all" of a topic may or may not be a meaningful concept. I suggest you look at the documentation on how Flink exposes Kafka's metrics, which includes this explanation:
The difference between the committed offset and the most recent offset in
each partition is called the consumer lag. If the Flink topology is consuming
the data slower from the topic than new data is added, the lag will increase
and the consumer will fall behind. For large production deployments we
recommend monitoring that metric to avoid increasing latency.
So, if the consumer lag is zero, you're caught up. That said, you might wish to be able to compare the offsets yourself, but I don't know of an easy way to do that.
Kafka it's used as a streaming source and a stream does not have an end.
If im not wrong, Flink's Kafka connector pulls data from a Topic each X miliseconds, because all kafka consumers are Active consumers, Kafka does not notify you if there's new data inside a topic
So, in your case, just set a timeout and if you don't read data in that time, you have readed all of the data inside your topic.
Anyways, if you need to read a batch of finite data, you can use some of Flink's Windows or introduce some kind of marks inside your Kafka topic, to delimit the start and the of the batch.

Can Apache Kafka send non-string messages through a topic?

A mllib model is trained somewhere and I want it to be sent to somewhere else. When I try to send it through a kafka topic like this
val model = LogisticRegressionModel.load(sc, "/PATH/To/Model")
val producer=new Producer[String, LogisticRegressionModel](config)
val data=new KeyedMessage[String, LogisticRegressionModel(topic2,key,model)
producer.send(data)
producer.close()
I would encounter an error like this:
org.apache.spark.mllib.classification.LogisticRegressionModel cannot be cast to java.lang.String
So, is it possible for kafka to send non-string messages through a topic?
You can send non-string messages to Kafka topic using Kafka Producer. From 0.9.0 version its better to use Java Client instead of Scala Client.
All you need to do is specifying the correct Key, Value serializer in Properties like below.
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

kafka new api 0.10 doesn't provide a list of stream and consumer objects per topic

Previously I have been using 0.8 API. When you pass topics list to it, it returns a map of streams (one entry per topic). This allows me to spawn a separate thread and assign each topic's stream to it. Having too much data in each topic, spawning a separate thread helps multi tasking.
//0.8 code sample
Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap =
consumer.createMessageStreams(topicCountMap);
I want to upgrade to 0.10. I checked KafkaStreams and KafkaConsumer classes. KafkaConsumer object takes config properties and provide the subscribe method that takes topics List and its return type is void. I cannot find a way where I can get a handle to each topic.
KafkaConsumer consumer = new KafkaConsumer(props);
consumer.subscribe(topicsList);
conusmer.poll(long ms)
KafkaStreams on the other hand seems to have the same problem.
KStreamBuilder builder = new KStreamBuilder();
String [] topics = new String[] {"topic1", "topic2"};
KStream<byte[], byte[]> source = builder.stream(stringSerde, stringSerde, topics);
KafkaStreams streams = new KafkaStreams(builder, props);
streams.start();
There is source.foreach() method available but it is a stream of all topics. Anyone, any ideas ?
First, using a multi threaded consumer is tricky, thus the pattern you employed in 0.8 is hopefully well designed :)
Best practice is to use a single threaded consumer and thus, there is "no need" to separate different topics if a single consumer subscribes to a list of topics at once. Nevertheless, while consuming the record, the record object provides information about from which topic it originates from (it carries this metadata). Thus, you could theoretically dispatch a record according to its topics to a different thread for the actual processing (even if this is not recommended!).
Kafka scales out via partitions, thus, if a single-threaded consumer is not able to handle the load, you should start multiple consumers (as a consumer group) to scale out your consumer processing capacity.
A more general question: if you want to process data per topic, why not using multiple consumers each subscribing to a single topic each?
Last but not least, in Apache Kafka 0.10+ the Kafka Streams API is a newly introduced stream processing library -- though it must not be confused with 0.8 KafkaStream class (hint, there is no "s"). Both are completely unrelated to each other.

What are the differences between Kafka and MapR streams from coding perspective?

What are the differences between Kafka and MapR streams from coding perspective? I need to implement the MapR streams in future but currently I have only access to Kafka. So exploring the Kafka right now is useful? So that I can easily pick up on MapR streams once I get the access?
As such there is no big difference in Kafka and MapR Stream API in terms of coding.
But there are some differences in terms of configuration and API arguments:
Kafka supports Receiver and Direct both approaches, but MapR stream supports only Direct approach.
The offset reset configuration value for reading the data from start, is smallest in Kafka, but in MapR Stream it is earliest.
The Kafka API supports for passing the Key and Value deserializer arguments in method, but in MapR stream API you have to configure them in Kafka params map against key.deserializer and value.deserializer keys.
Example of Direct approach for Kafka and MapR Stream API calls to receive the DStream:
Kafka API:
// setting the topic.
HashSet<String> topicsSet = new HashSet<String>(Arrays.asList("myTopic"));
// setting the broker list.
Map<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", "localhost:9092");
// To read the messages from start.
kafkaParams.put("auto.offset.reset", "smallest");
// creating the DStream
JavaPairInputDStream<byte[], byte[]> kafkaStream = KafkaUtils.createDirectStream(streamingContext, byte[].class, byte[].class, DefaultDecoder.class, DefaultDecoder.class, kafkaParams, topicsSet);
MapR Stream API:
// setting the topic.
HashSet<String> topicsSet = new HashSet<String>(Arrays.asList("myTopic"));
// setting the broker list.
Map<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", "localhost:9092");
// To read the messages from start.
kafkaParams.put("auto.offset.reset", "earliest");
// setting up the key and value deserializer
kafkaParams.put("key.deserializer", StringDeserializer.class.getName());
kafkaParams.put("value.deserializer", ByteArrayDeserializer.class.getName());
// creating the DStream
JavaPairInputDStream<byte[], byte[]> kafkaStream = KafkaUtils.createDirectStream(streamingContext, byte[].class, byte[].class, kafkaParams, topicsSet);
I hope the above explanation help you in understanding the differences between Kafka and MapR Stream API's.
Thanks,
Hokam
www.streamanalytix.com
I haven't used MapR Streams (since it is not open source), but my understanding is that they cloned the Kafka 0.9 Java API. So, if you are using Kafka 0.9 clients, it should be pretty similar (but you need to use their client, not Apache's).
In addition, note that clients in other languages will not be available. And other Apache projects that use different APIs (notably Spark Streaming) will require special MapR compatible versions.