Read XML message from Kafka topic in Spark Streaming - scala

I need to consume XML messages from a Kafka topic and load into a Spark Dataframe within my foreachRDD block in streaming. How can I do that? I am able to consume JSON messages in my streaming job by doing spark.sqlContext.read.json(rdd); What is the analogous code for reading XML format messages from Kafka? I am using Spark 2.2, Scala 2.11.8 and Kafka 0.10
My XML messages will have about 400 fields (heavily nested) so I want to dynamically store them in a DF in the stream.foreachRDD { rdd => ... } block and then operate on the DF.
Also should I convert the XML into JSON or Avro before sending into the topic at the producer end? Is that heavy to send XMLs as-is and is it better to send JSON instead?

Related

Deserializing avro message in kdb

I am working on a problem which requires me to consume avro serialized messages from kafka topics in kdb+ service. Is it possible to deserialize avro messages in kdb?

kafka direct stream and seekToEnd

In my Spark job I initialize Kafka stream with KafkaUtils.createDirectStream.
I read about seekToEnd method for Consumer. How can I apply it to the stream?
spark-kafka transitively includes kafka-clients, so you're welcome to initialize the raw consumer instance on your own and seek it
Alternatively, if no consumer group exists, you would set startingOffsets=latest in your Spark config
note: Kafka Direct Stream API is deprecated as of Spark 2.4 and you should be using Structured Streaming instead

Apache spark streaming kafka API vs kinesis API

I am having an scala spark application in which I need to switch between streaming from kafka vs kinesis based on the application configuration.
Both the spark API's for kafka streaming (spark-streaming-kafka-0-10_2.11) and kinesis streaming (spark-streaming-kinesis-asl_2.11) returns an InputDStream on creating a stream, but the value types are different.
Kafka stream creating returns InputDStream[ConsumerRecord[String, String]],
whereas, Kinesis stream creating returns InputDStream[Array[Byte]]
Is there any API that returns a generic InputDStream irrespective of kafka or kinesis, so that my stream processing can have a generic implementation, instead of having to write separate code for kafka and kinesis.
I tried assigning both the stream to a InputDStream[Any], but that did not work.
Appreciate any idea on how this can be done.

Kafka Consumer Vs Apache Flink

I did a poc in which I read data from Kafka using spark streaming. But our organization is either using Apache Flink or Kafka consumer to read data from Apache kafka as a standard process. So I need to replace Kafka streaming with Kafka consumer or Apache Flink. In my application use case, I need to read data from kafka, filter json data and put fields in cassandra, so the recommendation is to use Kafka consumer rather than flink/other streamings as I don't really need to do any processing with Kafka json data. So I need your help to understand below questions:
Using Kafka consumer, can I achieve same continuous data read as we do in case of spark streaming or flink?
Is kafka consumer sufficient for me considering I need to read data from kafka, deserialize using avro scehma, filter fields and put in cassandra?
Kafka consumer application can be created using kafka consumer API, right?
Is there any down sides in my case if I just use Kafka consumer instead of Apache flink?
Firstly, let's take a look at Flinks Kafka Connector, And Spark Streaming with Kafka, both of them use Kafka Consumer API(either simple API or high level API) inside to consume messages from Apache Kafka for their jobs.
So, regarding to your questions:
Yes
Yes. However, if you use Spark, you can consider to use Spark Cassandra connector, which helps to save data into Cassandara efficiently
Right
As mentioned above, Flink also uses Kafka consumer for its job. Moreover, it is a distributed stream and batch data processing, it helps to process data efficiently after consuming from Kafka. In your cases, to save data into Cassandra, you can consider to use Flink Cassandra Connector rather than coding by yourselve.

XML Processing using Spark Streaming and Kafka - Split single XML file content issue?

Since Spark streaming consumes data in micro batches, will there be an issue when consume XML message?
If i publish entire XML file as single message to Kafka Topic, when consume using Spark Streaming is there a possibility of splitting the XML file content because of message size.
Will Spark streaming consume single XML message in two different DStreams? Is this valid scenario?
Please advise.