Deserializing avro message in kdb - apache-kafka

I am working on a problem which requires me to consume avro serialized messages from kafka topics in kdb+ service. Is it possible to deserialize avro messages in kdb?

Related

How to deserialize avro message using mirrormaker?

I want to replicate a kafka topic to an azure event hub.
The messages are in avro format and uses a schema that is behind a schema registry with USER_INFO authentication.
Using a java client to connect to kafka, I can use a KafkaAvroDeserializer to deserialize the message correctly.
But this configuration doesn't seems to work with mirrormaker.
Is is possible to deserialize the avro message using mirrormaker before sending it ?
Cheers
For MirrorMaker1, the consumer deserializer properties are hard-coded
Unless you plan on re-serializing the data into a different format when the producer sends data to EventHub, you should stick to using the default ByteArrayDeserializer.
If you did want to manipulate the messages in any way, that would need to be done with a MirrorMakerMessageHandler subclass
For MirrorMaker2, you can use AvroConverter followed by some transforms properties, but still ByteArrayConverter would be preferred for a one-to-one byte copy.

Failed with exception java.io.IOException:java.lang.ClassCastException

We have a Producer which writes data onto the topic with an Avro Schema. The Kafka HDFS Connect sinks the data from the topic to the hdfs as parquet format and creates external table in Hive. Works fine in Hive/Impala engines. So far so good. We cannot control the Producer side Avro schema column ordering. When there is a change in the avro schema order of columns on the Producer side then error comes up in Hive.
Failed with exception java.io.IOException:java.lang.ClassCastException: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveArrayInspector cannot be cast to org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector
Note: Impala is still fine even when the AVRO schema column order changes.

Why Apache Kafka consumer would use a different version of schema to deserialize record other than the one sent along with the data?

Let us assume I am using Avro serialization while sending data to kafka.
While consuming record from Apache Kafka, I get both the schema and the record. I can use the schema to parse the record. I am not getting the scenario why consumer would use a different version of schema to deserialize the record. Can someone help?
The message is serialized with the the unique id for a specific version of the schema when produced onto Kafka. The consumer would use that unique schema id to deserialize.
Taken from https://docs.confluent.io/current/schema-registry/avro.html#schema-evolution

Is it possible to use Flume Kafka Source by itself?

Let's say there are a bunch of producers writing avro records (that have the same schema) into a Kafak Topic.
Can I use Flume Kafka Source to read those records and write them to HDFS. Even though the records were not published using a Flume Sink?
Yes, you can. In general, what a Kafka consumer can or cannot do is fully independent on who produced the data, but only on the format in which the data has been encoded.
That is the whole point of Kafka and Enterprise Service Bus.

Read XML message from Kafka topic in Spark Streaming

I need to consume XML messages from a Kafka topic and load into a Spark Dataframe within my foreachRDD block in streaming. How can I do that? I am able to consume JSON messages in my streaming job by doing spark.sqlContext.read.json(rdd); What is the analogous code for reading XML format messages from Kafka? I am using Spark 2.2, Scala 2.11.8 and Kafka 0.10
My XML messages will have about 400 fields (heavily nested) so I want to dynamically store them in a DF in the stream.foreachRDD { rdd => ... } block and then operate on the DF.
Also should I convert the XML into JSON or Avro before sending into the topic at the producer end? Is that heavy to send XMLs as-is and is it better to send JSON instead?