Spark Streaming Update on Kafka Direct Stream parameter - scala

I have the followingcode:
//Set basic spark parameters
val conf = new SparkConf()
.setAppName("Cartographer_jsonInsert")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(2))
val messagesDStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, Tuple4[String, Int, Long, String]](ssc, getKafkaBrokers, getKafkaTopics("processed"), (mmd: MessageAndMetadata[String, String]) => {
(mmd.topic, mmd.partition, mmd.offset, mmd.message().toString)
})
getKafkaBrokers and getKafkaTopics calls an API that checks the database for specific new Topics as we add them to our system. Does the SSC while running update variables each iteration? So ever messageDStream be re-created with the new variables each time?
It does not look like it does, is there any way to have the happen?

Tathagata Das, one of the creators of Spark Streaming answered a similar question in the Spark User List regarding modifications of existing DStreams.
Currently Spark Streaming does not support addition/deletion/modification of DStream after the streaming context has been started.
Nor can you restart a stopped streaming context.
Also, multiple spark contexts (and therefore multiple streaming contexts) cannot be run concurrently in the same JVM.
I don't see a straight forward way of implementing this with Spark Streaming, as you have no way of updating your graph. You need much more control than currently available. Maybe a solution based on Reactive Kafka, the Akka Streams connector for Kafka. Or any other streaming based solution where you control the source.

Any reason you are not using Akka Graph with reactive-kafka (https://github.com/akka/reactive-kafka). it is very easy to build reactive stream where source can be given topic , flow to process messages and sink to sink result.
I have built a sample application is using the same https://github.com/asethia/akka-streaming-graph

Related

Kafka Spark Stream save directly to Redis

I'am using Scala to get kafkaStream and want to insert this data directly to Redis. What is the best optimum strategy to do so ?
val kafkaStream = KafkaUtils.createStream(ssc, "192.168.0.40:2181", "group", topics, StorageLevel.MEMORY_AND_DISK)
Earlier I was trying to use https://github.com/debasishg/scala-redis but this does not work with Spark so i had to collect the RDD and then save the record into Redis which is creating lot of overhead in my project. So looking for some solution where I can directly push this string of messages into Redis and also want to maintain the ZScore.
Thanks,

Saving a streaming DataFrame to MongoDB using MongoSpark

Some backstory: For a homework project for university we are tasked to implement an algorithm of choice in a scalable way. We chose to use Scala, Spark, MongoDB and Kafka as these were recommended during the course. To read data from our MongoDB, we opted to use MongoSpark as it allows for easy and scalable operations on data. We also use Kafka to simulate streaming from an outside source. We need to perform multiple operations on every entry that Kafka produces. The issue comes from saving the result of this data back to MongoDB.
We have the following code:
val streamDF = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "aTopic")
.load
.selectExpr("CAST(value AS STRING)")
From here on, we're at a loss. We cannot use a .map as MongoSpark only operates on DataFrames, Datasets and RDDs and is not serializable, and using MongoSpark.save does not work on streaming DataFrames like the one specified. We also cannot use the default MongoDB Scala driver as this conflicts with MongoSpark upon adding the dependency. Note that the rest of the algorithm heavily relies on joins and groupbys.
How can we get the data from here to our MongoDB?
Edit:
For an easy to reproduce example, one could try the following:
val streamDF = sparkSession
.readStream
.format("rate")
.load
Adding a .write to that, which is required for MongoSpark.save, will cause an exception because write cannot be called on a streaming DataFrame.
Adding a .write to that, which is required for MongoSpark.save, will cause an exception because write cannot be called on a streaming DataFrame.
The save() method for MongoDB Connector for Spark accepts RDD (as of current version 2.2). When utilising DStream with MongoSpark, you need to fetch the 'batches' of RDDs in the stream to write.
wordCounts.foreachRDD({ rdd =>
import spark.implicits._
val wordCounts = rdd.map({ case (word: String, count: Int)
=> WordCount(word, count) }).toDF()
wordCounts.write.mode("append").mongo()
})
See also:
Design Patterns for using foreachRDD
MongoDB: Spark Streaming

Using Kafka to communicate between long running Spark jobs

I am new to Apache Spark and have a need to run several long-running processes (jobs) on my Spark cluster at the same time. Often, these individual processes (each of which is its own job) will need to communicate with each other. Tentatively, I'm looking at using Kafka to be the broker in between these processes. So the high-level job-to-job communication would look like:
Job #1 does some work and publishes message to a Kafka topic
Job #2 is set up as a streaming receiver (using a StreamingContext) to that same Kafka topic, and as soon as the message is published to the topic, Job #2 consumes it
Job #2 can now do some work, based on the message it consumed
From what I can tell, streaming contexts are blocking listeners that run on the Spark Driver node. This means that once I start the streaming consumer like so:
def createKafkaStream(ssc: StreamingContext,
kafkaTopics: String, brokers: String): DStream[(String,
String)] = {
// some configs here
KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, props, topicsSet)
}
def consumerHandler(): StreamingContext = {
val ssc = new StreamingContext(sc, Seconds(10))
createKafkaStream(ssc, "someTopic", "my-kafka-ip:9092").foreachRDD(rdd => {
rdd.collect().foreach { msg =>
// Now do some work as soon as we receive a messsage from the topic
}
})
ssc
}
StreamingContext.getActive.foreach {
_.stop(stopSparkContext = false)
}
val ssc = StreamingContext.getActiveOrCreate(consumerHandler)
ssc.start()
ssc.awaitTermination()
...that there are now 2 implications:
The Driver is now blocking and listening for work to consume from Kafka; and
When work (messages) are received, they are sent to any available Worker Nodes to actually be executed upon
So first, if anything that I've said above is incorrect or is misleading, please begin by correcting me! Assuming I'm more or less correct, then I'm simply wondering if there is a more scalable or performant way to accomplish this, given my criteria. Again, I have two long-runnning jobs (Job #1 and Job #2) that are running on my Spark nodes, and one of them needs to be able to 'send work to' the other one. Any ideas?
From what I can tell, streaming contexts are blocking listeners that
run on the Spark Driver node.
A StreamingContext (singular) isn't a blocking listener. It's job is to create the graph of execution for your streaming job.
When you start reading from Kafka, you specify that you want to fetch new records every 10 seconds. What happens from now on depends on which Kafka abstraction you're using for Kafka, either the Receiver approach via KafkaUtils.createStream, or the Receiver-less approach via KafkaUtils.createDirectStream.
In both approaches in general, data is being consumed from Kafka and then dispatched to each Spark worker to process in parallel.
then I'm simply wondering if there is a more scalable or performant
way to accomplish this
This approach is highly scalable. When using the receiver-less approach, each Kafka partition maps to a Spark partition in a given RDD. You can increase parallelism by either increasing the amount of partitions in Kafka, or by re-partitions the data inside Spark (using DStream.repartition). I suggest testing this setup to determine if it suits your performance requirements.

Stream the most recent data in cassandra with spark streaming

I continuously have data being written to cassandra from an outside source.
Now, I am using spark streaming to continuously read this data from cassandra with the following code:
val ssc = new StreamingContext(sc, Seconds(5))
val cassandraRDD = ssc.cassandraTable("keyspace2", "feeds")
val dstream = new ConstantInputDStream(ssc, cassandraRDD)
dstream.foreachRDD { rdd =>
println("\n"+rdd.count())
}
ssc.start()
ssc.awaitTermination()
sc.stop()
However, the following line:
val cassandraRDD = ssc.cassandraTable("keyspace2", "feeds")
takes the entire table data from cassandra every time. Now just the newest data saved into the table.
What I want to do is have spark streaming read only the latest data, ie, the data added after its previous read.
How can I achieve this? I tried to Google this but got very little documentation regarding this.
I am using spark 1.4.1, scala 2.10.4 and cassandra 2.1.12.
Thanks!
EDIT:
The suggested duplicate question (asked by me) is NOT a duplicate, because it talks about connecting spark streaming and cassandra and this question is about streaming only the latest data. BTW, streaming from cassandra IS possible by using the code I provided. However, it takes the entire table every time and not just the latest data.
There will be some low-level work on Cassandra that will allow notifying external systems (an indexer, a Spark stream etc.) of new mutations incoming to Cassandra, read this: https://issues.apache.org/jira/browse/CASSANDRA-8844

Spark Streaming Direct Kafka API, OffsetRanges : How to handle first run

My spark-streaming application reads from Kafka using the direct stream approach without the help of ZooKeeper. I would like to handle failures such that Exactly-once Semantics is followed in my application. I am following this for reference. Everything looks perfect except for :
val stream: InputDStream[(String,Long)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, Long)](
ssc, kafkaParams, fromOffsets,
// we're just going to count messages per topic, don't care about the contents, so convert each message to (topic, 1)
(mmd: MessageAndMetadata[String, String]) => (mmd.topic, 1L))
In the very first run of the application, since there will be no offsets read, what value to pass in for the fromOffsets Map parameter? I am certainly missing something.
Thanks and appreciate any help!
The first offset isn't necessarily 0L, depending on how long the topics have existed.
I personally just pre-insert the appropriate offsets into the database separately. Then the spark job reads the offsets from the database at startup.
The file kafkacluster.scala in the spark Kafka integration has methods that make it easier to query Kafka for the earliest available offset. That file was private, but has been made public in the most recent spark code.