I am working on a problem which requires me to consume avro serialized messages from kafka topics in kdb+ service. Is it possible to deserialize avro messages in kdb?
In my Spark job I initialize Kafka stream with KafkaUtils.createDirectStream.
I read about seekToEnd method for Consumer. How can I apply it to the stream?
spark-kafka transitively includes kafka-clients, so you're welcome to initialize the raw consumer instance on your own and seek it
Alternatively, if no consumer group exists, you would set startingOffsets=latest in your Spark config
note: Kafka Direct Stream API is deprecated as of Spark 2.4 and you should be using Structured Streaming instead
I am having an scala spark application in which I need to switch between streaming from kafka vs kinesis based on the application configuration.
Both the spark API's for kafka streaming (spark-streaming-kafka-0-10_2.11) and kinesis streaming (spark-streaming-kinesis-asl_2.11) returns an InputDStream on creating a stream, but the value types are different.
Kafka stream creating returns InputDStream[ConsumerRecord[String, String]],
whereas, Kinesis stream creating returns InputDStream[Array[Byte]]
Is there any API that returns a generic InputDStream irrespective of kafka or kinesis, so that my stream processing can have a generic implementation, instead of having to write separate code for kafka and kinesis.
I tried assigning both the stream to a InputDStream[Any], but that did not work.
Appreciate any idea on how this can be done.
I did a poc in which I read data from Kafka using spark streaming. But our organization is either using Apache Flink or Kafka consumer to read data from Apache kafka as a standard process. So I need to replace Kafka streaming with Kafka consumer or Apache Flink. In my application use case, I need to read data from kafka, filter json data and put fields in cassandra, so the recommendation is to use Kafka consumer rather than flink/other streamings as I don't really need to do any processing with Kafka json data. So I need your help to understand below questions:
Using Kafka consumer, can I achieve same continuous data read as we do in case of spark streaming or flink?
Is kafka consumer sufficient for me considering I need to read data from kafka, deserialize using avro scehma, filter fields and put in cassandra?
Kafka consumer application can be created using kafka consumer API, right?
Is there any down sides in my case if I just use Kafka consumer instead of Apache flink?
Firstly, let's take a look at Flinks Kafka Connector, And Spark Streaming with Kafka, both of them use Kafka Consumer API(either simple API or high level API) inside to consume messages from Apache Kafka for their jobs.
So, regarding to your questions:
Yes
Yes. However, if you use Spark, you can consider to use Spark Cassandra connector, which helps to save data into Cassandara efficiently
Right
As mentioned above, Flink also uses Kafka consumer for its job. Moreover, it is a distributed stream and batch data processing, it helps to process data efficiently after consuming from Kafka. In your cases, to save data into Cassandra, you can consider to use Flink Cassandra Connector rather than coding by yourselve.
Since Spark streaming consumes data in micro batches, will there be an issue when consume XML message?
If i publish entire XML file as single message to Kafka Topic, when consume using Spark Streaming is there a possibility of splitting the XML file content because of message size.
Will Spark streaming consume single XML message in two different DStreams? Is this valid scenario?
Please advise.