Structuring Data Received from Kafka Topic - apache-kafka

I have fetched some data from a kafka topic. I have used this configuration in the yml file
key-deserializer: org.apache.kafka.common.serialization.StringDeserializer
value-deserializer: org.apache.kafka.common.serialization.StringDeserializer
The data I got in console is
?8Enterprise recommended by AC
Is this an Avro Data??
And if is it so.. What should I do to convert it as Json data. Do I need to convert or deserialize??

depends on the way you have gererate the message before write it to a topic.
If you have deserialized with String Deserializar and your output seems correct, probably you also used the String Serializer to write the value of the message in the topic.
This is not Avro or structured data. You can use json or Avro to serialize data when creating the message and then use the same deserializer when consuming the messages to get the message back with the srtuctured data.
If you tell me your programming language maybe i can give you an example.
I hope It helps

Related

How to deserialize avro message using mirrormaker?

I want to replicate a kafka topic to an azure event hub.
The messages are in avro format and uses a schema that is behind a schema registry with USER_INFO authentication.
Using a java client to connect to kafka, I can use a KafkaAvroDeserializer to deserialize the message correctly.
But this configuration doesn't seems to work with mirrormaker.
Is is possible to deserialize the avro message using mirrormaker before sending it ?
Cheers
For MirrorMaker1, the consumer deserializer properties are hard-coded
Unless you plan on re-serializing the data into a different format when the producer sends data to EventHub, you should stick to using the default ByteArrayDeserializer.
If you did want to manipulate the messages in any way, that would need to be done with a MirrorMakerMessageHandler subclass
For MirrorMaker2, you can use AvroConverter followed by some transforms properties, but still ByteArrayConverter would be preferred for a one-to-one byte copy.

how can I pass KafkaAvroSerializer into a Kafka ProducerRecord?

I have messages which are being streamed to Kafka. I would like to convert the messages in avro binary format (means to encode them).
I'm using the confluent platform. I have a Kafka ProducerRecord[String,String] which sends the messages to the Kafka topic.
Can someone provide with a (short) example? Or recommend a website with examples?
Does anyone know how I can pass a instance of a KafkaAvroSerializer into the KafkaProducer?
Can I use inside the ProducerRecord a Avro GenericRecord instance?
Kind regards
Nika
You need to use the KafkaAvroSerializer in your producer config for the either serializer config, as well as set the schema registry url in the producer config as well (AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG)
That serializer will Avro-encode primitives and strings, but if you need complex objects, you could try adding Avro4s, for example. Otherwise, GenericRecord will work as well.
Java example is here - https://docs.confluent.io/current/schema-registry/serializer-formatter.html

Kafka topic with different format of data

I have written some avro data to the topic “test-avro” using Kafka-avro-console-producer.
Then I have written some plain text data to the same topic “test-avro” using Kafka-console-producer.
After this, all the data in the topic got corrupted. Can anyone explain what caused this to happen like this?
You simply cannot use the avro-console-consumer (or a Consumer with an Avro deserializer) anymore to read those offsets because it'll assume all data in the topic is Avro and use Confluent's KafkaAvroDeserializer.
The plain console-producer will push non-Avro encoded UTF-8 strings and use the StringSerializer, which will not match the wire format expected for the Avro deserializer
The only way to get past them is to know what offsets are bad, and wait for them to expire on the topic, or reset a consumer group to begin after those messages. Or, you can always use the ByteArrayDeserializer, and add a bunch of conditional logic for parsing your messages to ensure no data-loss.
tl;dr The producer and consumer must agree on the data format of the topic.

Kafka data types of messages

I was wondering about what types of data we could have in Kafka topics.
As I know in application level this is a key-value pairs and this could be the data of type which is supported by the language.
For example we send some messages to the topic, could it be some json, parquet files, serialized data or we operate with the messages only like with the plain text format?
Thanks for you help.
There are various message formats depending on if you are talking about the APIs, the wire protocol, or the on disk storage.
Some of these Kafka Message formats are described in the docs here
https://kafka.apache.org/documentation/#messageformat
Kafka has the concept of a Serializer/Deserializer or SerDes (pronounced Sir-Deez).
https://en.m.wikipedia.org/wiki/SerDes
A Serializer is a function that can take any message and converts it into the byte array that is actually sent on the wire using the Kafka Protocol.
A Deserializer does the opposite, it reads the raw message bytes portion of the Kafka wire protocol and re-creates a message as you want the receiving application to see it.
There are built-in SerDes libraries for Strings, Long, ByteArrays, ByteBuffers and a wealth of community SerDes libraries for JSON, ProtoBuf, Avro, as well as application specific message formats.
You can build your own SerDes libraries as well see the following
How to create Custom serializer in kafka?
On the topic it's always just serialised data. Serialisation happens in the producer before sending and deserialisation in the consumer after fetching. Serializers and deserializers are pluggable, so as you said at application level it's key value pairs of any data type you want.

Deserialize Avro Data In Memory Using Python

We are working on connecting Storm with Kafka.
In our setup Kafka stores messages in Avro.
We are using a Storm wrapper called "Pyleus", and Avro coming in bolt as a variable.
Question:
How to deserialize Avro data in a variable using any of the Python-Avro modules out there? There are tons of examples for deserializing Avro in .avro files directly. However, our use-case have a performance requirement so we cannot first write to a file then parse.
Any help, documentation and/or example will be appreciated.
Assuming you have loaded your schema into 'schema' and you have the avro data into 'raw_bytes'. The below might help
bytes_reader = io.BytesIO(raw_bytes)
decoder = avro.io.BinaryDecoder(bytes_reader)
reader = avro.io.DatumReader(schema)
decoded_data = reader.read(decoder)