Kafka data types of messages - apache-kafka

I was wondering about what types of data we could have in Kafka topics.
As I know in application level this is a key-value pairs and this could be the data of type which is supported by the language.
For example we send some messages to the topic, could it be some json, parquet files, serialized data or we operate with the messages only like with the plain text format?
Thanks for you help.

There are various message formats depending on if you are talking about the APIs, the wire protocol, or the on disk storage.
Some of these Kafka Message formats are described in the docs here
https://kafka.apache.org/documentation/#messageformat
Kafka has the concept of a Serializer/Deserializer or SerDes (pronounced Sir-Deez).
https://en.m.wikipedia.org/wiki/SerDes
A Serializer is a function that can take any message and converts it into the byte array that is actually sent on the wire using the Kafka Protocol.
A Deserializer does the opposite, it reads the raw message bytes portion of the Kafka wire protocol and re-creates a message as you want the receiving application to see it.
There are built-in SerDes libraries for Strings, Long, ByteArrays, ByteBuffers and a wealth of community SerDes libraries for JSON, ProtoBuf, Avro, as well as application specific message formats.
You can build your own SerDes libraries as well see the following
How to create Custom serializer in kafka?

On the topic it's always just serialised data. Serialisation happens in the producer before sending and deserialisation in the consumer after fetching. Serializers and deserializers are pluggable, so as you said at application level it's key value pairs of any data type you want.

Related

Does Kafka Consumer Deserializer have to match Producer Serializer?

Does the deserializer used by a consumer has to match the serializer used by the produced?
If a producer adds JSON values to messages then could the consumer choose to use the ByteArrayDeserializeror, StringDeserializeror JsonDeserializer regardless of which serializer the producer used or does it have to match?
If they have to match, what is the consequence of using a different one? Would this result in an exception, no data or something else?
It has to be compatible, not necessarily the matching pair
ByteArrayDeserializer can consume anything.
StringDeserializer assumes UTF8 encoded strings and might cast other types to strings upon consumption
JsonDeserializer will attempt to parse the message and will fail on invalid JSON
If you used an Avro, Protobuf, Msgpack, etc binary-format consumer then those also attempt to parse the message and will fail if they don't recognize their respective containers

Kafka serialization difference between LongSerializer and Serdes.LongSerde

We are writing a new application which publish to Kafka and we need to serialize the key of the messages as a long value. When checking the Kafka docs for serialization, seems there are two long serializer classes as LongSerializer and Serdes.LongSerde. We tried searching for a reference for the difference between the two but could not find any link explaining the difference. Appreciate if someone can please let us know / share a link which explains the difference between these. Or are these the same?
Primary link to Serializer Doc: https://kafka.apache.org/11/javadoc/org/apache/kafka/common/serialization/package-frame.html
LongSerializer: https://kafka.apache.org/11/javadoc/org/apache/kafka/common/serialization/LongSerializer.html
Serdes.LongSerde: https://kafka.apache.org/11/javadoc/org/apache/kafka/common/serialization/Serdes.LongSerde.html
Thanks.
As you should know, for Kafka (Brokers) messages are arrays of bytes (keys, values).
KafkaProducer, KafkaConsumer and KafkaStreams need to know how to write and read messages - transform them from POJO to array of bytes and vice versa.
For that purpose org.apache.kafka.common.serialization.Serializer and org.apache.kafka.common.serialization.Deserializer are used. KafkaProducer uses Serializer - to transform Key and Value to array of bytes, and KafkaConsumer uses Deserializer to transform array of bytes to Key and Value.
KafkaStreams applications does both action writes, reads (to/from topic) and for that org.apache.kafka.common.serialization.Serdes are - It is some kind of wrapper for Serializer and Deserializer.
In your example:
LongSerializer is a class, that should be used to translate Long to array of bytes
LongSerde is a class, that should be used in Kafka Streams application to read and write Long (under the hood it uses LongSerializer and LongDeserializer)
Additional reading:
https://kafka.apache.org/23/documentation/streams/developer-guide/datatypes.html

how can I pass KafkaAvroSerializer into a Kafka ProducerRecord?

I have messages which are being streamed to Kafka. I would like to convert the messages in avro binary format (means to encode them).
I'm using the confluent platform. I have a Kafka ProducerRecord[String,String] which sends the messages to the Kafka topic.
Can someone provide with a (short) example? Or recommend a website with examples?
Does anyone know how I can pass a instance of a KafkaAvroSerializer into the KafkaProducer?
Can I use inside the ProducerRecord a Avro GenericRecord instance?
Kind regards
Nika
You need to use the KafkaAvroSerializer in your producer config for the either serializer config, as well as set the schema registry url in the producer config as well (AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG)
That serializer will Avro-encode primitives and strings, but if you need complex objects, you could try adding Avro4s, for example. Otherwise, GenericRecord will work as well.
Java example is here - https://docs.confluent.io/current/schema-registry/serializer-formatter.html

Kafka Connect: How can I send protobuf data from Kafka topics to HDFS using hdfs sink connector?

I have a producer that's producing protobuf messages to a topic. I have a consumer application which deserializes the protobuf messages. But hdfs sink connector picks up messages from the Kafka topics directly. What would the key and value converter in etc/schema-registry/connect-avro-standalone.properties be set to? What's the best way to do this? Thanks in advance!
Kafka Connect is designed to separate the concern of serialization format in Kafka from individual connectors with the concept of converters. As you seem to have found, you'll need to adjust the key.converter and value.converter classes to implementations that support protobufs. These classes are commonly implemented as a normal Kafka Deserializer followed by a step which performs a conversion from serialization-specific runtime formats (e.g. Message in protobufs) to Kafka Connect's runtime API (which doesn't have any associated serialization format -- it's just a set of Java types and a class to define Schemas).
I'm not aware of an existing implementation. The main challenge in implementing this is that protobufs is self-describing (i.e. you can deserialize it without access to the original schema), but since its fields are simply integer IDs, you probably wouldn't get useful schema information without either a) requiring that the specific schema is available to the converter, e.g. via config (which makes migrating schemas more complicated) or b) a schema registry service + wrapper format for your data that allows you to look up the schema dynamically.

protobuf within avro encoded message on kafka

Wanted to know if there is a better way to solve the problem that we are having. Here is the flow:
Our client code understands only protocol buffers (protobuf). On the server side, our gateway gets the protobuf and puts it onto Kafka.
Now avrò is the recommended encoding scheme, so we put the specific protobuf within avro (as a byte array) and we put it onto the message bus. The reason we do this is to avoid having to do entire protobuf->avro conversion.
On the consumer side, it reads the avro message, gets the protobuf out of it and works on that.
How reliable is protobuf with Kafka? Are there a lot of people using it? What exactly are the advantages/disadvantages of using Kafka with protobuf?
Is there a better way to handle our use case/scenario?
thanks
Kafka doesn't differentiate between encoding schemes since at the end every message flows in and out of kafka as binary.
Both Proto-buff and Avro are binary based encoding schemes, why would you want to wrap a proto-buff inside an Avro schema, when you can directly put the proto-buff message into Kafka?