Kafka message Value conversion from Integer to String - apache-kafka

I have a kafka message in one kafka topic. One of the keys of this message key=ID and value of that key is value=12345678910111213141.
Type of this value is integer. I want to convert the type to string.
Currently am doing this in some hacky way:
consume message
convert the type
produce the message to other topic
Is there an easier way to do this?
PS: Don't have the access to the first producer which sends the message as integer.

If I understand your question correctly, this will not be possible. As far as Kafka is concerned, all data is stored as Bytes and Kafka does not know which Serializer was used to generate the byte code.
Therefore, you can only deserialize the value in the same way as it was serialized by the Producer. As I understand, this was done using a Integer Serializer. But as you do not have acces to the Producer, you have no chance but reading it as an Integer and converting it to a String afterwards.

Related

Correct key-serializer to use for kafka avro

If I use org.apache.kafka.common.serialization.StringSerializer in my key-serializer attribute yml file, the key that gets published in Kafka is correct but I get the SerializationException error : Error deserializing Avro message for id -1 when that message is consumed.
But when I use io.confluent.kafka.serializers.KafkaAvroSerializer instead, I don't get the SerializationException error but there are leading characters that gets added with the key. The characters are \u00014H and I have no idea where they came from. I'm using UUID as key and the application is in Spring Boot.
What should be the proper serializer to use? The value-serializer I use is io.confluent.kafka.serializers.KafkaAvroSerializer
The characters are \u00014H and I have no idea where they came from
They came from you using String Deserializer when you're consuming the Avro bytes.
If you only have UUID strings, then you don't need Avro. Kafka has its own UUIDSerializer

Does Kafka Consumer Deserializer have to match Producer Serializer?

Does the deserializer used by a consumer has to match the serializer used by the produced?
If a producer adds JSON values to messages then could the consumer choose to use the ByteArrayDeserializeror, StringDeserializeror JsonDeserializer regardless of which serializer the producer used or does it have to match?
If they have to match, what is the consequence of using a different one? Would this result in an exception, no data or something else?
It has to be compatible, not necessarily the matching pair
ByteArrayDeserializer can consume anything.
StringDeserializer assumes UTF8 encoded strings and might cast other types to strings upon consumption
JsonDeserializer will attempt to parse the message and will fail on invalid JSON
If you used an Avro, Protobuf, Msgpack, etc binary-format consumer then those also attempt to parse the message and will fail if they don't recognize their respective containers

Different schema types for message's key and value

I am wondering if it is allowed to produce a message with the JSON key schema and AVRO value schema.
Is there a restriction on mix and matching schema types of the producer record?
Does the schema registry ban it?
Is it considered a bad practice?
There is no such limitation for mixing of messages. Kafka stores bytes; it doesn't care about the serialization, but a random consumer of the topic might not know how to consume the data, so using Avro (or Protobuf) for both would be a good idea
You can use UTF-8 text for key and Avro for value. You can have both as Avro. When I tried to use Kafka REST and key was not in avro format while message was, I couldn't use the consumer. Looks like it's still the case based on this issue But if you implement your own producer and consumer you decide what to encode/decode which way.
You have to be careful with keys. As based on key, messages are sent to specific partitions, in most scenarios messages should be in time order, but that can be only achieved if they go to same partition. If you have userId or any other identifier you likely want to send all events for this user to same partition, so use userId as a key. I wouldn't use json as key unless your key is based on few fields, but you have to be careful, to not end up with messages on different partitions.

Kafka topic with different format of data

I have written some avro data to the topic “test-avro” using Kafka-avro-console-producer.
Then I have written some plain text data to the same topic “test-avro” using Kafka-console-producer.
After this, all the data in the topic got corrupted. Can anyone explain what caused this to happen like this?
You simply cannot use the avro-console-consumer (or a Consumer with an Avro deserializer) anymore to read those offsets because it'll assume all data in the topic is Avro and use Confluent's KafkaAvroDeserializer.
The plain console-producer will push non-Avro encoded UTF-8 strings and use the StringSerializer, which will not match the wire format expected for the Avro deserializer
The only way to get past them is to know what offsets are bad, and wait for them to expire on the topic, or reset a consumer group to begin after those messages. Or, you can always use the ByteArrayDeserializer, and add a bunch of conditional logic for parsing your messages to ensure no data-loss.
tl;dr The producer and consumer must agree on the data format of the topic.

How to check which partition is a key assign to in kafka?

I am trying to debug a issue for which I am trying to prove that each distinct key only goes to 1 partition if the cluster is not rebalancing.
So I was wondering for a given topic, is there a way to determine which partition a key is send to?
As explained here or also in the source code
You need the byte[] keyBytes assuming it isn't null, then using org.apache.kafka.common.utils.Utils, you can run the following.
Utils.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
For strings or JSON, it's UTF8 encoded, and the Utils class has helper functions to get that.
For Avro, such as Confluent serialized values, it's a bit more complicated (a magic byte, then a schema ID, then the data). See Wire format
In Kafka Streams API, You should have a ProcessorContext available in your Processor#init , which you can store a reference to and then access in your Processor#process method, such as ctx.recordMetadata.get().partition() (recordMetadata returns an Optional)
only goes to 1 partition
This isn't a guarantee. Hashes can collide.
It makes more sense to say that a given key isn't in more than one partition.
if the cluster is not rebalancing
Rebalancing will still preserve a partition value.
when you send message,
Partitions are determined by the following classes
https://github.com/apache/kafka/blob/trunk/clients/src/main/java/org/apache/kafka/clients/producer/internals/DefaultPartitioner.java
If you want change logics, implement org.apache.kafka.clients.producer.Partitioner interface and,
set ProduceConfig's 'partitioner.class'
reference docuement :
https://kafka.apache.org/documentation/#producerconfigs