Can Apache Kafka send non-string messages through a topic? - scala

A mllib model is trained somewhere and I want it to be sent to somewhere else. When I try to send it through a kafka topic like this
val model = LogisticRegressionModel.load(sc, "/PATH/To/Model")
val producer=new Producer[String, LogisticRegressionModel](config)
val data=new KeyedMessage[String, LogisticRegressionModel(topic2,key,model)
producer.send(data)
producer.close()
I would encounter an error like this:
org.apache.spark.mllib.classification.LogisticRegressionModel cannot be cast to java.lang.String
So, is it possible for kafka to send non-string messages through a topic?

You can send non-string messages to Kafka topic using Kafka Producer. From 0.9.0 version its better to use Java Client instead of Scala Client.
All you need to do is specifying the correct Key, Value serializer in Properties like below.
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

Related

how does the producer communicate with the registry and what does it send to the registry

I'm trying to understand this by reading the documentation but maybe because I'm not an advanced programmer I do not really understand it.
I'm in the documentation and for example in this example:
https://docs.confluent.io/current/schema-registry/serdes-develop/serdes-protobuf.html#protobuf-serializer
Properties props = new Properties();
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092");
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG,
"org.apache.kafka.common.serialization.StringSerializer");
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG,
"io.confluent.kafka.serializers.protobuf.KafkaProtobufSerializer");
props.put("schema.registry.url", "http://127.0.0.1:8081");
Producer<String, MyRecord> producer = new KafkaProducer<String, MyRecord>(props);
String topic = "testproto";
String key = "testkey";
OtherRecord otherRecord = OtherRecord.newBuilder()
.setOtherId(123).build();
MyRecord myrecord = MyRecord.newBuilder()
.setF1("value1").setF2(otherRecord).build();
ProducerRecord<String, MyRecord> record
= new ProducerRecord<String, MyRecord>(topic, key, myrecord);
producer.send(record).get();
producer.close();
I see here that you define the schema registry url and then somehow the producer will know that it will send contact to the registry to provide some metadata on the messages to the registry.
Now I would like to understand better how does this actually work and what is exchanged between the producer and the registry (or is kafka that contact with the registry)?
Anyway my question is imagine I have a record that is in a protobuf format.
I'm putting that protobuf into kafka in a certain topic.
now I want to activate the schema registry so would the producer just send the proto definition into the schema registry?
does the producer just get the metadata definition directly from the record?
would it try to update in any new message into the queue? would this not increase a bit the latency when pushing the data to kafka?
Sorry if this is all very basic questions but I'm just trying to get the bigger picture and I'm missing this peace.
Thanks for any information, sorry if this is already clear from the documentation.
(I need to have this so that I can use ksql to deserialize my messages in kafka)
best regards,
would the producer just send the proto definition into the schema registry
The Serializer does, not the Producer directly.
MyRecord is serialized to binary, the schema is sent over HTTP to the Registry, which returns an ID, then the sent message contains 0x0 + ID + binary-protobuf-value
Source code here
would it try to update in any new message into the queue?
Schema gets sent before any messages are sent. Existing messages are untouched
would this not increase a bit the latency when pushing the data to kafka?
Only for the first message since the schema gets cached

Deserialization in Flink datastream

Here I wrote a string to Kafka topic and flink consumes this topic.The Deserialization is done by using SimpleStringSchema.When I need to consume integer value what deserialization method should be used instead of SimpleStringschema ???
DataStream<String> messageStream = env.addSource(new FlinkKafkaConsumer09<String>("test2", new SimpleStringSchema(), properties));
You need to define your own SerializationSchema
https://github.com/apache/flink/blob/master/flink-core/src/main/java/org/apache/flink/api/common/serialization/SerializationSchema.java#L32
Or leave it as a string then map the stream out into your required types

Multiple Kafka streams for spark streaming

How to input kafka streams from two different topics (like view_event and click_event), how we can do this in Java so both the streams are received parallely. Till now I am trying this
JavaPairInputDStream<String, GenericData.Record> stream_1 =
KafkaUtils.createDirectStream(ssc, String.class, GenericData.Record.class, StringDecoder.class,
GenericDataRecordDecoder.class, props, topicsSet_1);
JavaPairInputDStream<String, GenericData.Record> stream_2 =
KafkaUtils.createDirectStream(ssc, String.class, GenericData.Record.class, StringDecoder.class,
GenericDataRecordDecoder.class, props, topicsSet_2);
But I observed some weirdness in output that sometimes only Stream_1 is received and only stream_2 is received.
Also I am using connector version 0.8.2.1 and I would be better if you provide sample code example.

What are the differences between Kafka and MapR streams from coding perspective?

What are the differences between Kafka and MapR streams from coding perspective? I need to implement the MapR streams in future but currently I have only access to Kafka. So exploring the Kafka right now is useful? So that I can easily pick up on MapR streams once I get the access?
As such there is no big difference in Kafka and MapR Stream API in terms of coding.
But there are some differences in terms of configuration and API arguments:
Kafka supports Receiver and Direct both approaches, but MapR stream supports only Direct approach.
The offset reset configuration value for reading the data from start, is smallest in Kafka, but in MapR Stream it is earliest.
The Kafka API supports for passing the Key and Value deserializer arguments in method, but in MapR stream API you have to configure them in Kafka params map against key.deserializer and value.deserializer keys.
Example of Direct approach for Kafka and MapR Stream API calls to receive the DStream:
Kafka API:
// setting the topic.
HashSet<String> topicsSet = new HashSet<String>(Arrays.asList("myTopic"));
// setting the broker list.
Map<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", "localhost:9092");
// To read the messages from start.
kafkaParams.put("auto.offset.reset", "smallest");
// creating the DStream
JavaPairInputDStream<byte[], byte[]> kafkaStream = KafkaUtils.createDirectStream(streamingContext, byte[].class, byte[].class, DefaultDecoder.class, DefaultDecoder.class, kafkaParams, topicsSet);
MapR Stream API:
// setting the topic.
HashSet<String> topicsSet = new HashSet<String>(Arrays.asList("myTopic"));
// setting the broker list.
Map<String, String> kafkaParams = new HashMap<String, String>();
kafkaParams.put("metadata.broker.list", "localhost:9092");
// To read the messages from start.
kafkaParams.put("auto.offset.reset", "earliest");
// setting up the key and value deserializer
kafkaParams.put("key.deserializer", StringDeserializer.class.getName());
kafkaParams.put("value.deserializer", ByteArrayDeserializer.class.getName());
// creating the DStream
JavaPairInputDStream<byte[], byte[]> kafkaStream = KafkaUtils.createDirectStream(streamingContext, byte[].class, byte[].class, kafkaParams, topicsSet);
I hope the above explanation help you in understanding the differences between Kafka and MapR Stream API's.
Thanks,
Hokam
www.streamanalytix.com
I haven't used MapR Streams (since it is not open source), but my understanding is that they cloned the Kafka 0.9 Java API. So, if you are using Kafka 0.9 clients, it should be pretty similar (but you need to use their client, not Apache's).
In addition, note that clients in other languages will not be available. And other Apache projects that use different APIs (notably Spark Streaming) will require special MapR compatible versions.

Kafka High Level Consumer Fetch All Messages From Topic Using Java API (Equivalent to --from-beginning)

I am testing the Kafka High Level Consumer using the ConsumerGroupExample code from the Kafka site. I would like to retrieve all the existing messages on the topic called "test" that I have in the Kafka server config. Looking at other blogs, auto.offset.reset should be set to "smallest" to be able to get all messages:
private static ConsumerConfig createConsumerConfig(String a_zookeeper, String a_groupId) {
Properties props = new Properties();
props.put("zookeeper.connect", a_zookeeper);
props.put("group.id", a_groupId);
props.put("auto.offset.reset", "smallest");
props.put("zookeeper.session.timeout.ms", "10000");
return new ConsumerConfig(props);
}
The question I really have is this: what is the equivalent Java api call for the High Level Consumer that is the equivalent of:
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning
Basically, everytime a new consumer tries to consume a topic, it'll read messages from the beginning. If you're especially just consuming from the beginning each time for testing purposes, everytime you initialise your consumer with a new groupID, it'll read the messages from the beginning. Here's how I did it :
properties.put("group.id", UUID.randomUUID().toString());
and read messages from the beginning each time!
Looks like you need to use the "low level SimpleConsumer API"
For most applications, the high level consumer Api is good enough.
Some applications want features not exposed to the high level consumer
yet (e.g., set initial offset when restarting the consumer). They can
instead use our low level SimpleConsumer Api. The logic will be a bit
more complicated and you can follow the example in here.
This example worked for getting all messages from a topic with the following arguments: (note that the port is the Kafka port, not the ZooKeeper port, topics set up from this example):
10 my-replicated-topic 0 localhost 9092
Specifically, there is a method to get readOffset which takes kafka.api.OffsetRequest.EarliestTime():
long readOffset = getLastOffset(consumer,a_topic, a_partition, kafka.api.OffsetRequest.EarliestTime(), clientName);
Here is another post may provide some alternate ideas on how to sort this out: How to get data from old offset point in Kafka?
To fetch messages from the beginning, you can do this:
import kafka.utils.ZkUtils;
ZkUtils.maybeDeletePath("zkhost:zkport", "/consumers/group.id");
then just follow the routine work...
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
props.put("auto.offset.reset", "earliest");
props.put("group.id", UUID.randomUUID().toString());
This properties will help you.