Does the deserializer used by a consumer has to match the serializer used by the produced?
If a producer adds JSON values to messages then could the consumer choose to use the ByteArrayDeserializeror, StringDeserializeror JsonDeserializer regardless of which serializer the producer used or does it have to match?
If they have to match, what is the consequence of using a different one? Would this result in an exception, no data or something else?
It has to be compatible, not necessarily the matching pair
ByteArrayDeserializer can consume anything.
StringDeserializer assumes UTF8 encoded strings and might cast other types to strings upon consumption
JsonDeserializer will attempt to parse the message and will fail on invalid JSON
If you used an Avro, Protobuf, Msgpack, etc binary-format consumer then those also attempt to parse the message and will fail if they don't recognize their respective containers
Related
I am wondering if it is allowed to produce a message with the JSON key schema and AVRO value schema.
Is there a restriction on mix and matching schema types of the producer record?
Does the schema registry ban it?
Is it considered a bad practice?
There is no such limitation for mixing of messages. Kafka stores bytes; it doesn't care about the serialization, but a random consumer of the topic might not know how to consume the data, so using Avro (or Protobuf) for both would be a good idea
You can use UTF-8 text for key and Avro for value. You can have both as Avro. When I tried to use Kafka REST and key was not in avro format while message was, I couldn't use the consumer. Looks like it's still the case based on this issue But if you implement your own producer and consumer you decide what to encode/decode which way.
You have to be careful with keys. As based on key, messages are sent to specific partitions, in most scenarios messages should be in time order, but that can be only achieved if they go to same partition. If you have userId or any other identifier you likely want to send all events for this user to same partition, so use userId as a key. I wouldn't use json as key unless your key is based on few fields, but you have to be careful, to not end up with messages on different partitions.
I have a kafka message in one kafka topic. One of the keys of this message key=ID and value of that key is value=12345678910111213141.
Type of this value is integer. I want to convert the type to string.
Currently am doing this in some hacky way:
consume message
convert the type
produce the message to other topic
Is there an easier way to do this?
PS: Don't have the access to the first producer which sends the message as integer.
If I understand your question correctly, this will not be possible. As far as Kafka is concerned, all data is stored as Bytes and Kafka does not know which Serializer was used to generate the byte code.
Therefore, you can only deserialize the value in the same way as it was serialized by the Producer. As I understand, this was done using a Integer Serializer. But as you do not have acces to the Producer, you have no chance but reading it as an Integer and converting it to a String afterwards.
I have written some avro data to the topic “test-avro” using Kafka-avro-console-producer.
Then I have written some plain text data to the same topic “test-avro” using Kafka-console-producer.
After this, all the data in the topic got corrupted. Can anyone explain what caused this to happen like this?
You simply cannot use the avro-console-consumer (or a Consumer with an Avro deserializer) anymore to read those offsets because it'll assume all data in the topic is Avro and use Confluent's KafkaAvroDeserializer.
The plain console-producer will push non-Avro encoded UTF-8 strings and use the StringSerializer, which will not match the wire format expected for the Avro deserializer
The only way to get past them is to know what offsets are bad, and wait for them to expire on the topic, or reset a consumer group to begin after those messages. Or, you can always use the ByteArrayDeserializer, and add a bunch of conditional logic for parsing your messages to ensure no data-loss.
tl;dr The producer and consumer must agree on the data format of the topic.
I am trying to use Kafka's console producer, kafka-console-producer. It seems that by default, it uses the string serializer.
Is it possible to set the Kafka console producer to use the JSON serializer?
If you have control of the input data, there should be no reason to explicitly set JSON serializer onto the console producer
The JSON serializer is just an extension of the String Serializer; it also takes the raw string object and converts to bytes. Plus, plain strings are valid JSON types, anyway.
Kafka only stores bytes - it doesn't care what format your data exists in
I just wanted to add in addition to the fact that the StringSerializer is the default, you can choose whatever serializer you want (even custom implementations) by providing the class names in the following producer configurations:
key.serializer
value.serializer
Your consumers will then need to use the appropriate deserializer and you would set that in the following consumer configurations:
key.deserializer
value.deserializer
Apache Kafka Documentation
Edited: At first I thought the serializer class had a default value but this doesn't seem to be the case according to the Kafka documentation. It might be set according to the data type used for both your key and value pairs in the Record being produced. In any case, these would be the parameters you need to configure to set a specific class with the implementation of the serializer/deserializer you need.
I was wondering about what types of data we could have in Kafka topics.
As I know in application level this is a key-value pairs and this could be the data of type which is supported by the language.
For example we send some messages to the topic, could it be some json, parquet files, serialized data or we operate with the messages only like with the plain text format?
Thanks for you help.
There are various message formats depending on if you are talking about the APIs, the wire protocol, or the on disk storage.
Some of these Kafka Message formats are described in the docs here
https://kafka.apache.org/documentation/#messageformat
Kafka has the concept of a Serializer/Deserializer or SerDes (pronounced Sir-Deez).
https://en.m.wikipedia.org/wiki/SerDes
A Serializer is a function that can take any message and converts it into the byte array that is actually sent on the wire using the Kafka Protocol.
A Deserializer does the opposite, it reads the raw message bytes portion of the Kafka wire protocol and re-creates a message as you want the receiving application to see it.
There are built-in SerDes libraries for Strings, Long, ByteArrays, ByteBuffers and a wealth of community SerDes libraries for JSON, ProtoBuf, Avro, as well as application specific message formats.
You can build your own SerDes libraries as well see the following
How to create Custom serializer in kafka?
On the topic it's always just serialised data. Serialisation happens in the producer before sending and deserialisation in the consumer after fetching. Serializers and deserializers are pluggable, so as you said at application level it's key value pairs of any data type you want.