I am having an scala spark application in which I need to switch between streaming from kafka vs kinesis based on the application configuration.
Both the spark API's for kafka streaming (spark-streaming-kafka-0-10_2.11) and kinesis streaming (spark-streaming-kinesis-asl_2.11) returns an InputDStream on creating a stream, but the value types are different.
Kafka stream creating returns InputDStream[ConsumerRecord[String, String]],
whereas, Kinesis stream creating returns InputDStream[Array[Byte]]
Is there any API that returns a generic InputDStream irrespective of kafka or kinesis, so that my stream processing can have a generic implementation, instead of having to write separate code for kafka and kinesis.
I tried assigning both the stream to a InputDStream[Any], but that did not work.
Appreciate any idea on how this can be done.
Related
In my Spark job I initialize Kafka stream with KafkaUtils.createDirectStream.
I read about seekToEnd method for Consumer. How can I apply it to the stream?
spark-kafka transitively includes kafka-clients, so you're welcome to initialize the raw consumer instance on your own and seek it
Alternatively, if no consumer group exists, you would set startingOffsets=latest in your Spark config
note: Kafka Direct Stream API is deprecated as of Spark 2.4 and you should be using Structured Streaming instead
I need to consume XML messages from a Kafka topic and load into a Spark Dataframe within my foreachRDD block in streaming. How can I do that? I am able to consume JSON messages in my streaming job by doing spark.sqlContext.read.json(rdd); What is the analogous code for reading XML format messages from Kafka? I am using Spark 2.2, Scala 2.11.8 and Kafka 0.10
My XML messages will have about 400 fields (heavily nested) so I want to dynamically store them in a DF in the stream.foreachRDD { rdd => ... } block and then operate on the DF.
Also should I convert the XML into JSON or Avro before sending into the topic at the producer end? Is that heavy to send XMLs as-is and is it better to send JSON instead?
I am using Apache Spark Streaming using a TCP connector to receive data.
I have a python application that connects to a sensor, and create a TCP server that waits connection from Apache Spark, and then, sends json data through this socket.
How can I manage to join many independent sensors sources to send data to the same receiver on Apache Spark?
It seems like you need Message-oriented Middleware (MOM) or a kafka cluster to handle real-time data feeds. Your message producers can send to a kafka topic and Spark streaming can receive from that kafka topic. That way you can decouple your producer and receiver. Kafka can scale linearly and using it with spark streaming kafka-direct stream approach with back-pressure can provide you good failover resiliency.
If you choose another MOM, you can use the spark receiver based approach and union multiple streams to scale it up
I did a poc in which I read data from Kafka using spark streaming. But our organization is either using Apache Flink or Kafka consumer to read data from Apache kafka as a standard process. So I need to replace Kafka streaming with Kafka consumer or Apache Flink. In my application use case, I need to read data from kafka, filter json data and put fields in cassandra, so the recommendation is to use Kafka consumer rather than flink/other streamings as I don't really need to do any processing with Kafka json data. So I need your help to understand below questions:
Using Kafka consumer, can I achieve same continuous data read as we do in case of spark streaming or flink?
Is kafka consumer sufficient for me considering I need to read data from kafka, deserialize using avro scehma, filter fields and put in cassandra?
Kafka consumer application can be created using kafka consumer API, right?
Is there any down sides in my case if I just use Kafka consumer instead of Apache flink?
Firstly, let's take a look at Flinks Kafka Connector, And Spark Streaming with Kafka, both of them use Kafka Consumer API(either simple API or high level API) inside to consume messages from Apache Kafka for their jobs.
So, regarding to your questions:
Yes
Yes. However, if you use Spark, you can consider to use Spark Cassandra connector, which helps to save data into Cassandara efficiently
Right
As mentioned above, Flink also uses Kafka consumer for its job. Moreover, it is a distributed stream and batch data processing, it helps to process data efficiently after consuming from Kafka. In your cases, to save data into Cassandra, you can consider to use Flink Cassandra Connector rather than coding by yourselve.
I want to setup Flink so it would transform and redirect the data streams from Apache Kafka to MongoDB. For testing purposes I'm building on top of flink-streaming-connectors.kafka example (https://github.com/apache/flink).
Kafka streams are being properly red by Flink, I can map them etc., but the problem occurs when I want to save each recieved and transformed message to MongoDB. The only example I've found about MongoDB integration is flink-mongodb-test from github. Unfortunately it uses static data source (database), not the Data Stream.
I believe there should be some DataStream.addSink implementation for MongoDB, but apparently there's not.
What would be the best way to achieve it? Do I need to write the custom sink function or maybe I'm missing something? Maybe it should be done in different way?
I'm not tied to any solution, so any suggestion would be appreciated.
Below there's an example what exactly i'm getting as an input and what I need to store as an output.
Apache Kafka Broker <-------------- "AAABBBCCCDDD" (String)
Apache Kafka Broker --------------> Flink: DataStream<String>
Flink: DataStream.map({
return ("AAABBBCCCDDD").convertTo("A: AAA; B: BBB; C: CCC; D: DDD")
})
.rebalance()
.addSink(MongoDBSinkFunction); // store the row in MongoDB collection
As you can see in this example I'm using Flink mostly for Kafka's message stream buffering and some basic parsing.
As an alternative to Robert Metzger answer, you can write your results again to Kafka and then use one of the maintained kafka's connectors to drop the content of a topic inside your MongoDB Database.
Kafka -> Flink -> Kafka -> Mongo/Anything
With this approach you can mantain the "at-least-once semantics" behaivour.
There is currently no Streaming MongoDB sink available in Flink.
However, there are two ways for writing data into MongoDB:
Use the DataStream.write() call of Flink. It allows you to use any OutputFormat (from the Batch API) with streaming. Using the HadoopOutputFormatWrapper of Flink, you can use the offical MongoDB Hadoop connector
Implement the Sink yourself. Implementing sinks is quite easy with the Streaming API, and I'm sure MongoDB has a good Java Client library.
Both approaches do not provide any sophisticated processing guarantees. However, when you're using Flink with Kafka (and checkpointing enabled) you'll have at-least-once semantics: In an error case, the data is streamed again to the MongoDB sink.
If you're doing idempotent updates, redoing these updates shouldn't cause any inconsistencies.
If you really need exactly-once semantics for MongoDB, you should probably file a JIRA in Flink and discuss with the community how to implement this.