Kafka serialization difference between LongSerializer and Serdes.LongSerde - apache-kafka

We are writing a new application which publish to Kafka and we need to serialize the key of the messages as a long value. When checking the Kafka docs for serialization, seems there are two long serializer classes as LongSerializer and Serdes.LongSerde. We tried searching for a reference for the difference between the two but could not find any link explaining the difference. Appreciate if someone can please let us know / share a link which explains the difference between these. Or are these the same?
Primary link to Serializer Doc: https://kafka.apache.org/11/javadoc/org/apache/kafka/common/serialization/package-frame.html
LongSerializer: https://kafka.apache.org/11/javadoc/org/apache/kafka/common/serialization/LongSerializer.html
Serdes.LongSerde: https://kafka.apache.org/11/javadoc/org/apache/kafka/common/serialization/Serdes.LongSerde.html
Thanks.

As you should know, for Kafka (Brokers) messages are arrays of bytes (keys, values).
KafkaProducer, KafkaConsumer and KafkaStreams need to know how to write and read messages - transform them from POJO to array of bytes and vice versa.
For that purpose org.apache.kafka.common.serialization.Serializer and org.apache.kafka.common.serialization.Deserializer are used. KafkaProducer uses Serializer - to transform Key and Value to array of bytes, and KafkaConsumer uses Deserializer to transform array of bytes to Key and Value.
KafkaStreams applications does both action writes, reads (to/from topic) and for that org.apache.kafka.common.serialization.Serdes are - It is some kind of wrapper for Serializer and Deserializer.
In your example:
LongSerializer is a class, that should be used to translate Long to array of bytes
LongSerde is a class, that should be used in Kafka Streams application to read and write Long (under the hood it uses LongSerializer and LongDeserializer)
Additional reading:
https://kafka.apache.org/23/documentation/streams/developer-guide/datatypes.html

Related

how can I pass KafkaAvroSerializer into a Kafka ProducerRecord?

I have messages which are being streamed to Kafka. I would like to convert the messages in avro binary format (means to encode them).
I'm using the confluent platform. I have a Kafka ProducerRecord[String,String] which sends the messages to the Kafka topic.
Can someone provide with a (short) example? Or recommend a website with examples?
Does anyone know how I can pass a instance of a KafkaAvroSerializer into the KafkaProducer?
Can I use inside the ProducerRecord a Avro GenericRecord instance?
Kind regards
Nika
You need to use the KafkaAvroSerializer in your producer config for the either serializer config, as well as set the schema registry url in the producer config as well (AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG)
That serializer will Avro-encode primitives and strings, but if you need complex objects, you could try adding Avro4s, for example. Otherwise, GenericRecord will work as well.
Java example is here - https://docs.confluent.io/current/schema-registry/serializer-formatter.html

Kafka JSON console producer

I am trying to use Kafka's console producer, kafka-console-producer. It seems that by default, it uses the string serializer.
Is it possible to set the Kafka console producer to use the JSON serializer?
If you have control of the input data, there should be no reason to explicitly set JSON serializer onto the console producer
The JSON serializer is just an extension of the String Serializer; it also takes the raw string object and converts to bytes. Plus, plain strings are valid JSON types, anyway.
Kafka only stores bytes - it doesn't care what format your data exists in
I just wanted to add in addition to the fact that the StringSerializer is the default, you can choose whatever serializer you want (even custom implementations) by providing the class names in the following producer configurations:
key.serializer
value.serializer
Your consumers will then need to use the appropriate deserializer and you would set that in the following consumer configurations:
key.deserializer
value.deserializer
Apache Kafka Documentation
Edited: At first I thought the serializer class had a default value but this doesn't seem to be the case according to the Kafka documentation. It might be set according to the data type used for both your key and value pairs in the Record being produced. In any case, these would be the parameters you need to configure to set a specific class with the implementation of the serializer/deserializer you need.

Kafka data types of messages

I was wondering about what types of data we could have in Kafka topics.
As I know in application level this is a key-value pairs and this could be the data of type which is supported by the language.
For example we send some messages to the topic, could it be some json, parquet files, serialized data or we operate with the messages only like with the plain text format?
Thanks for you help.
There are various message formats depending on if you are talking about the APIs, the wire protocol, or the on disk storage.
Some of these Kafka Message formats are described in the docs here
https://kafka.apache.org/documentation/#messageformat
Kafka has the concept of a Serializer/Deserializer or SerDes (pronounced Sir-Deez).
https://en.m.wikipedia.org/wiki/SerDes
A Serializer is a function that can take any message and converts it into the byte array that is actually sent on the wire using the Kafka Protocol.
A Deserializer does the opposite, it reads the raw message bytes portion of the Kafka wire protocol and re-creates a message as you want the receiving application to see it.
There are built-in SerDes libraries for Strings, Long, ByteArrays, ByteBuffers and a wealth of community SerDes libraries for JSON, ProtoBuf, Avro, as well as application specific message formats.
You can build your own SerDes libraries as well see the following
How to create Custom serializer in kafka?
On the topic it's always just serialised data. Serialisation happens in the producer before sending and deserialisation in the consumer after fetching. Serializers and deserializers are pluggable, so as you said at application level it's key value pairs of any data type you want.

Kafka Connect: How can I send protobuf data from Kafka topics to HDFS using hdfs sink connector?

I have a producer that's producing protobuf messages to a topic. I have a consumer application which deserializes the protobuf messages. But hdfs sink connector picks up messages from the Kafka topics directly. What would the key and value converter in etc/schema-registry/connect-avro-standalone.properties be set to? What's the best way to do this? Thanks in advance!
Kafka Connect is designed to separate the concern of serialization format in Kafka from individual connectors with the concept of converters. As you seem to have found, you'll need to adjust the key.converter and value.converter classes to implementations that support protobufs. These classes are commonly implemented as a normal Kafka Deserializer followed by a step which performs a conversion from serialization-specific runtime formats (e.g. Message in protobufs) to Kafka Connect's runtime API (which doesn't have any associated serialization format -- it's just a set of Java types and a class to define Schemas).
I'm not aware of an existing implementation. The main challenge in implementing this is that protobufs is self-describing (i.e. you can deserialize it without access to the original schema), but since its fields are simply integer IDs, you probably wouldn't get useful schema information without either a) requiring that the specific schema is available to the converter, e.g. via config (which makes migrating schemas more complicated) or b) a schema registry service + wrapper format for your data that allows you to look up the schema dynamically.

Is there a way to specify multiple Decoders (or one per Topic) in a Kafka Consumer? Anyone else felt need for this?

I am doing Spark Streaming over Kafka work in Scala (ref) using
public static <K,V,U extends kafka.serializer.Decoder<?>,T extends kafka.serializer.Decoder<?>> ReceiverInputDStream<scala.Tuple2<K,V>> createStream(StreamingContext ssc, scala.collection.immutable.Map<String,String> kafkaParams, scala.collection.immutable.Map<String,Object> topics, StorageLevel storageLevel, scala.reflect.ClassTag<K> evidence$1, scala.reflect.ClassTag<V> evidence$2, scala.reflect.ClassTag<U> evidence$3, scala.reflect.ClassTag<T> evidence$4)
I want to receive different types of messages (that need different Decoders) in the same DStream and the underlying RDD every batch interval. I will be listening to multiple topics and each topic will correspond to one message type thereby needing its own Decoder. Currently it does not seem like there's away to provide a kafka.serializer.Decoder<?> per topic (Is there one?). It seems fairly likely that people would send different type of messages over each topic (protobuf serialized bytes?). Has anyone else run into this issue?
Thanks.
C.
It seems a mapping of topic to valueDecodersomewhere in here could help.
I think, you need two DStream, one per topic. Then you will be able to perform join or union to get a single dstream with all elements.
Use the createDirectStream api, which gives you access to the topic on a per-partition basis via HasOffsetRanges. For the kafka decoder use the DefaultDecoder to get an array of bytes for each message.
Then do your actual decoding in a mapPartitions, where you match against the topic name to figure out how to interpret the array of bytes.
http://spark.apache.org/docs/latest/streaming-kafka-integration.html