I have a use case where I need to call RESTAPI from spark streaming after messages are read from Kafka to perform some calculation and save back the result to HDFS and third party application.
I have few doubts here:
How can we call RESTAPI directly from the spark streaming.
How to manage RESTAPI timeout with streaming batch time.
This code will not compile as it is. But this the approach for the given usecase.
val conf = new SparkConf().setAppName("App name").setMaster("yarn")
val ssc = new StreamingContext(conf, Seconds(1))
val dstream = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap)
dstream.foreachRDD { rdd =>
//Write the rdd to HDFS directly
//loop through each parttion in rdd
rdd.foreachPartition { partitionOfRecords =>
//1. Create HttpClient object here
//2.a POST data to API
//Use it if you want record level control in rdd or partion
partitionOfRecords.foreach { record =>
//2.b Post the the date to API
//Use 2.a or 2.b to POST data as per your req
Most of the HttpClients (for REST call) supports request timeout.
Sample Http POST call with timeout using Apache HttpClient
val CONNECTION_TIMEOUT_MS = 20000; // Timeout in millis (20 sec).
val requestConfig = RequestConfig.custom()
val client: CloseableHttpClient = HttpClientBuilder.create().build();
val url = "https://selfsolve.apple.com/wcResults.do"
val post = new HttpPost(url);
//Set config to post
post.setEntity(EntityBuilder.create.setText("some text to post to API").build())
val response: HttpResponse = client.execute(post)
I'm creating a ByteArrayOutputStream using ZIO Streams i.e.:
lazy val byteArrayOutputStream = new ByteArrayOutputStream()
val sink = ZSink.fromOutputStream(byteArrayOutputStream).contramapChunks[String](_.flatMap(_.getBytes)
val data = ZStream.unwrap(callToFunction).run(sink)
This works fine - now I need to stream this data back to the client using akka http.
I can do this:
val arr = byteArrayOutputStream.toByteArray
complete(HttpEntity(ContentTypes.`application/octet-stream`, arr)
which works but of course the toByteArray brings the outputstream into memory i.e. I don't stream the data. I'm missing something obvious - is there an easy way to do this?
You can convert output stream to Akka Stream Source:
val byteArrayOutputStream = new ByteArrayOutputStream()
val source = StreamConverters.asOutputStream().mapMaterializedValue(_ => byteArrayOutputStream)
and then simply create a chunked HTTP entity:
HttpResponse(entity = HttpEntity.Chunked.fromData(ContentTypes.`application/octet-stream`, source))
More about chunked transfer: https://datatracker.ietf.org/doc/html/rfc7230#section-4.1
For ZIO, you could probably use something like this:
val zSource = ZStream.fromOutputStreamWriter(os => byteArrayOutputStream.writeTo(os))
However, you need to find a way to convert ZStream to Akka Stream Source.
In my spark kinesis streaming application I am using foreachBatch to get the streaming data and need to send it to the drools rule engine for further processing.
My requirement is, I need to accumulate all json data in a list/ruleSession and send it for rule engine for processing as a batch at the executor side.
//Scala Code Example:
val dataFrame = sparkSession.readStream
.option("streamName", streamName)
.option("region", region)
.option("initialPosition", "TRIM_HORIZON")
val query = dataFrame
.selectExpr("CAST(data as STRING) as krecord")
val function = (batchDF: DataFrame, batchId: Long) => {
val ruleSession = kBase.newKieSession() //Drools Rule Session, this is getting created at driver side
batchDF.foreach(row => { // This piece of code is being run in executor.
val jsonData: JSONData = jsonHandler.convertStringToJSONType(row.mkString)
ruleSession.insert(jsonData) // Getting a null pointer exception here as the ruleSession is not available in executor.
ruleHandler.processRule(ruleSession) // Again this is in the driver scope.
In the above code, the problem I am facing is: the function used in foreachBatch is getting executed at driver side and the code inside batchDF.foreach is getting executed at worker/executor side, and thus failing to get he ruleSession.
Is there any way to run the whole function at each executor side?
Is there a better way to accumulate all the data in a batch DataFrame after transformation and send it to next process from within the executor/worker?
I think this might work ... Rather than running foreach, you could use foreachBatch or foreachPartition (or or a map version like mapPartition if you want return info). In this portion, open a connection to the drools system. From that point, iterate over the dataset within each partition (or batch) sending each to the drools system (or you might send that whole chunk to drools). In the foreachPartition / foreachBatch section, at the end, close the connect (if applicable).
#codeaperature, This is how I achieved batching, inspired from your answer, posting it as an answer as this exceeds the word limit in a comment.
Using foreach on dataframe and passing in a ForeachWriter.
Initializing the rule session in open method of ForeachWriter.
Adding each input JSON to rule session in process method.
Execute the rule in close method with the rule session loaded with batch of data.
//Scala code:
val dataFrame = sparkSession.readStream
.option("streamName", streamName)
.option("region", region)
.option("initialPosition", "TRIM_HORIZON")
val query = dataFrame
.selectExpr("CAST(data as STRING) as krecord")
val dataConsumer = new ForeachWriter[Row] {
var ruleSession: KieSession = null;
def open(partitionId: Long, version: Long): Boolean = { // first open is called once for every batch
ruleSession = kBase.newKieSession()
def process(row: Row) = { // the process method will be called for a batch of records
val jsonData: JSONData = jsonHandler.convertStringToJSONType(row.mkString)
ruleSession.insert(jsonData) // Add all input json to rule session.
def close(errorOrNull: Throwable): Unit = { // after calling process for all records in bathc close is called
val factCount = ruleSession.getFactCount
if (factCount > 0) {
ruleHandler.processRule(ruleSession) //batch processing of rule
I'm developing an algorithm using Kafka and Spark Streaming. This is part of my receiver:
val Array(brokers, topics) = args
val sparkConf = new SparkConf().setAppName("Traccia2014")
val ssc = new StreamingContext(sparkConf, Seconds(10))
// Create direct kafka stream with brokers and topics
val topicsSet = topics.split(",").toSet
val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers)
val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)
val slice=30
val lines = messages.map(_._2)
val dStreamDst=lines.transform(rdd => {
val y= rdd.map(x => x.split(",")(0)).reduce((a, b) => if (a < b) a else b)
rdd.map(x => (((x.split(",")(0).toInt - y.toInt).toLong/slice).round*slice+" "+(x.split(",")(2)),1)).reduceByKey(_ + _)
on which I get the following error :
ERROR JobScheduler: Error generating jobs for time 1484927230000 ms
java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$42.apply(RDD.scala:1034)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1$$anonfun$apply$42.apply(RDD.scala:1034)
What does it means? How could I solve it?
Any kind of help is truly appreciated..thanks in advance
Solved. Don't use transform or print() method. Use foreachRDD, is the best solution.
You are encountering this b/c you are interacting with the DStream using the transform() API. When using that method, you are given the RDD that represents that snapshot of data in time, in your case the 10 second window. Your code is failing because at a particular time window, there was no data, and the RDD you are operating on is empty, giving you the "empty collection" error when you invoke reduce().
Use the rdd.isEmpty() to ensure that the RDD is not empty before invoking your operation.
lines.transform(rdd => {
if (rdd.isEmpty)
else {
// rest of transformation
I need to establish a connection from Spark Streaming to Neo4j graph database.The RDDs are of type((is,I),(am,Hello)(sam,happy)....). I need to establish a edge between each pair of words in Neo4j.
In Spark Streaming documentation I found
dstream.foreachRDD { rdd =>
rdd.foreachPartition { partitionOfRecords =>
// ConnectionPool is a static, lazily initialized pool of connections
val connection = ConnectionPool.getConnection()
partitionOfRecords.foreach(record => connection.send(record))
ConnectionPool.returnConnection(connection) // return to the pool for future reuse
to the push to the data to an external database.
I am doing this in Scala. I am little confused about how to go about? I found AnormCypher and Neo4jScala wrapper. Can I use these to get work done? If so, how can I do that? If not, all there any better alternatives?
Thank you all....
I did an experiment with AnormCypher
Like this:
implicit val connection = Neo4jREST.setServer("localhost", 7474, "/db/data/")
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(FILE, 4).cache()
val count = logData
.flatMap( _.split(" "))
.map( w =>
Cypher("CREATE(:Word {text:{text}})")
.on( "text" -> w ).execute()
).filter( _ ).count()
Neo4j 2.2.x has great concurrent write performance that you can use from Spark. So the more concurrent threads you can have to write to Neo4j the better. If you can batch statements in batches of 100 to 1000 each per request then even better.
Take a look at MazeRunner (http://www.kennybastani.com/2014/11/using-apache-spark-and-neo4j-for-big.html) as it will give you some ideas.
I am new to Spark so please guide.
There are lots of example available related to Spark streaming using Scala.
You could check it out from https://github.com/apache/incubator-spark/tree/master/examples/src/main/scala/org/apache/spark/streaming/examples.
I want to run TwitterPopularTags.scala.
I am not able to set the twitter login details for this example.
I am successfully run the network count example.
But when i execute
./run-example org.apache.spark.streaming.examples.TwitterPopularTags local[2]
then it will show me authentication failure issue...
I set twitter login details before initializing string context in TwitterPopularTags.scala like
System.setProperty("twitter4j.oauth.consumerKey", "####");
System.setProperty("twitter4j.oauth.consumerSecret", "##");
System.setProperty("twitter4j.oauth.accessToken", "##");
System.setProperty("twitter4j.oauth.accessTokenSecret", "##");
Please guide.
Put the file "twitter4j.properties" into the Spark root directory (e.g. spark-0.8.0-incubating) before you run the Twitter examples.
Worked for me on Mac with the Scala examples.
I was not able to open the github link https://github.com/apache/incubator-spark/tree/master/examples/src/main/scala/org/apache/spark/streaming/examples.
However you could use the below code which worked for me.
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.SparkContext._
import org.apache.spark.streaming.twitter._
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.{ SparkContext, SparkConf }
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.flume._
* A Spark Streaming application that receives tweets on certain
* keywords from twitter datasource and find the popular hashtags
* Arguments: <comsumerKey> <consumerSecret> <accessToken> <accessTokenSecret> <keyword_1> ... <keyword_n>
* <comsumerKey> - Twitter consumer key
* <consumerSecret> - Twitter consumer secret
* <accessToken> - Twitter access token
* <accessTokenSecret> - Twitter access token secret
* <keyword_1> - The keyword to filter tweets
* <keyword_n> - Any number of keywords to filter tweets
* More discussion at stdatalabs.blogspot.com
* #author Sachin Thirumala
object SparkPopularHashTags {
val conf = new SparkConf().setMaster("local[4]").setAppName("Spark Streaming - PopularHashTags")
val sc = new SparkContext(conf)
def main(args: Array[String]) {
val Array(consumerKey, consumerSecret, accessToken, accessTokenSecret) = args.take(4)
val filters = args.takeRight(args.length - 4)
// Set the system properties so that Twitter4j library used by twitter stream
// can use them to generat OAuth credentials
System.setProperty("twitter4j.oauth.consumerKey", consumerKey)
System.setProperty("twitter4j.oauth.consumerSecret", consumerSecret)
System.setProperty("twitter4j.oauth.accessToken", accessToken)
System.setProperty("twitter4j.oauth.accessTokenSecret", accessTokenSecret)
// Set the Spark StreamingContext to create a DStream for every 5 seconds
val ssc = new StreamingContext(sc, Seconds(5))
// Pass the filter keywords as arguements
// val stream = FlumeUtils.createStream(ssc, args(0), args(1).toInt)
val stream = TwitterUtils.createStream(ssc, None, filters)
// Split the stream on space and extract hashtags
val hashTags = stream.flatMap(status => status.getText.split(" ").filter(_.startsWith("#")))
// Get the top hashtags over the previous 60 sec window
val topCounts60 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(60))
.map { case (topic, count) => (count, topic) }
// Get the top hashtags over the previous 10 sec window
val topCounts10 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(10))
.map { case (topic, count) => (count, topic) }
// print tweets in the currect DStream
// Print popular hashtags
topCounts60.foreachRDD(rdd => {
val topList = rdd.take(10)
println("\nPopular topics in last 60 seconds (%s total):".format(rdd.count()))
topList.foreach { case (count, tag) => println("%s (%s tweets)".format(tag, count)) }
topCounts10.foreachRDD(rdd => {
val topList = rdd.take(10)
println("\nPopular topics in last 10 seconds (%s total):".format(rdd.count()))
topList.foreach { case (count, tag) => println("%s (%s tweets)".format(tag, count)) }
setMaster("local[4]") - Make sure to set master to local mode with at least 2 threads as 1 thread is used for collecting the incoming streams and another thread for processing it.
We count the popular hashtags with the below code:
val topCounts60 = hashTags.map((_, 1)).reduceByKeyAndWindow(_ + _, Seconds(60))
.map { case (topic, count) => (count, topic) }
The above snippet does a word count of the hashtags over the previous 60/10 secs as specified in the reduceByKeyAndWindow and sorts them in descending order.
reduceByKeyAndWindow is used in case we have to apply transformations on data that is accumulated in the previous stream intervals.
Execute the code by passing the four twitter OAuth tokens as arguments:
You should see the popular hashtags over every 10/60 second interval.
