Submitting a streaming job to a spark cluster via a Spark Context - scala

I have an operation: RDD[T] => Unit that I would like to submit as a spark job, using a spark streamingContext. The Spark job should stream values from myStream, passing each instance of RDD[T] to operation.
Originally, I accomplished this by creating a Spark Job with a main() function that makes use of myStream.foreachRDD() and supplying the class name to spark-submit on the command-line, however, I would rather avoid making a call out-to-the-shell and instead: submit the job using the streamingContext. This would both be more elegant and allow me to terminate the job, at will, simply by calling streamingContext.stop().
I suspect that the solution is to make use of streamingContext.sparkContext.runJob() but this would require supplying additional arguments that I did not have to provide when using spark-submit: namely a single RDD[T] instance, and partition information. Is there a sensible way to provide "default" values for these parameters (to mirror the utility of spark-submit) or is there another approach that I could be missing?
Code Snippet
val streamingContext : StreamingContext = ...
val eventStream : DStream[T] = ...
eventStream.foreachRDD { rdd =>
rdd.toLocalIterator.toSeq.take(200).foreach { message =>
message.foreach { content =>
// process message content
}
}
}
streamingContext.start()
streamingContent.awaitTermination()
Note:
It is also acceptable (and possibly required) that only a specific time duration of the stream can be used as input for the submitted job.

Related

KernelDensity Serialization error in spark

Recently I am using KernelDensity class in Spark, I try to Serialize it to my disk in windows10, here is my code:
// read sample from disk
val sample = spark.read.option("inferSchema", "true").csv("D:\\sample")
val trainX = sample.select("_c1").rdd.map(r => r.getDouble(0))
val kd = new KernelDensity().setSample(trainX).setBandwidth(1)
// Serialization
val oos = new ObjectOutputStream(new FileOutputStream("a.obj"))
oos.writeObject(kd)
oos.close()
// deserialization
val ios = new ObjectInputStream(new FileInputStream("a.obj"))
val kd1 = ios.readObject.asInstanceOf[KernelDensity]
ios.close()
// error comes when I use estimate
kd1.estimate(Array(1,2, 3))
Exception in thread "main" org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:90)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.aggregate(RDD.scala:1117)
at org.apache.spark.mllib.stat.KernelDensity.estimate(KernelDensity.scala:92)
at KernelDensityConstruction$.main(KernelDensityConstruction.scala:35)
at KernelDensityConstruction.main(KernelDensityConstruction.scala)
20/05/10 22:05:42 INFO SparkContext: Invoking stop() from shutdown hook
why does it not work? if I do not do Serialization operation, it works well.
It happens because KernelDensity, despite being formally Serializable, is not designed to be fully compatible with standard serialization tools.
Internally it holds a reference to a sample, which in turn depends on corresponding SparkContext. In other words it is distributed tool, which is designed to be used within the scope of a single active session.
Given that:
It doesn't perform any computations until estimate is called.
It requires sample RDD to evaluate estimate.
it doesn't really makes sense to serialize it in the first place ‒ you can simply recreate the object, based on the desired parameters, in the new context.
However, if you really want to serialize the whole thing, you should create a wrapper that serializes both the parameters and the corresponding RDD (in a similar way how ML models with distributed data structures, like ALS, work) and loads these back, within a new session.

Spark httpclient job does not parallel well [duplicate]

I build a RDD from a list of urls, and then try to fetch datas with some async http call.
I need all the results before doing other calculs.
Ideally, I need to make the http calls on differents nodes for scaling considerations.
I did something like this:
//init spark
val sparkContext = new SparkContext(conf)
val datas = Seq[String]("url1", "url2")
//create rdd
val rdd = sparkContext.parallelize[String](datas)
//httpCall return Future[String]
val requests = rdd.map((url: String) => httpCall(url))
//await all results (Future.sequence may be better)
val responses = requests.map(r => Await.result(r, 10.seconds))
//print responses
response.collect().foreach((s: String) => println(s))
//stop spark
sparkContext.stop()
This work, but Spark job never finish !
So I wonder what is are the best practices for dealing with Future using Spark (or Future[RDD]).
I think this use case looks pretty common, but didn't find any answer yet.
Best regards
this use case looks pretty common
Not really, because it simply doesn't work as you (probably) expect. Since each task operates on standard Scala Iterators these operations will be squashed together. It means that all operations will be blocking in practice. Assuming you have three URLs ["x", "y", "z"] you code will be executed in a following order:
Await.result(httpCall("x", 10.seconds))
Await.result(httpCall("y", 10.seconds))
Await.result(httpCall("z", 10.seconds))
You can easily reproduce the same behavior locally. If you want to execute your code asynchronously you should handle this explicitly using mapPartitions:
rdd.mapPartitions(iter => {
??? // Submit requests
??? // Wait until all requests completed and return Iterator of results
})
but this is relatively tricky. There is no guarantee all data for a given partition fits into memory so you'll probably need some batching mechanism as well.
All of that being said I couldn't reproduce the problem you've described to is can be some configuration issue or a problem with httpCall itself.
On a side note allowing a single timeout to kill whole task doesn't look like a good idea.
I couldnt find an easy way to achieve this. But after several iteration of retries this is what I did and its working for a huge list of queries. Basically we used this to do a batch operation for a huge query into multiple sub queries.
// Break down your huge workload into smaller chunks, in this case huge query string is broken
// down to a small set of subqueries
// Here if needed to optimize further down, you can provide an optimal partition when parallelizing
val queries = sqlContext.sparkContext.parallelize[String](subQueryList.toSeq)
// Then map each one those to a Spark Task, in this case its a Future that returns a string
val tasks: RDD[Future[String]] = queries.map(query => {
val task = makeHttpCall(query) // Method returns http call response as a Future[String]
task.recover {
case ex => logger.error("recover: " + ex.printStackTrace()) }
task onFailure {
case t => logger.error("execution failed: " + t.getMessage) }
task
})
// Note:: Http call is still not invoked, you are including this as part of the lineage
// Then in each partition you combine all Futures (means there could be several tasks in each partition) and sequence it
// And Await for the result, in this way you making it to block untill all the future in that sequence is resolved
val contentRdd = tasks.mapPartitions[String] { f: Iterator[Future[String]] =>
val searchFuture: Future[Iterator[String]] = Future sequence f
Await.result(searchFuture, threadWaitTime.seconds)
}
// Note: At this point, you can do any transformations on this rdd and it will be appended to the lineage.
// When you perform any action on that Rdd, then at that point,
// those mapPartition process will be evaluated to find the tasks and the subqueries to perform a full parallel http requests and
// collect those data in a single rdd.
If you dont want to perform any transformation on the content like parsing the response payload, etc. Then you could use foreachPartition instead of mapPartitions to perform all those http calls immediately.
I finally made it using scalaj-http instead of Dispatch.
Call are synchronous, but this match my use case.
I think the Spark Job never finish using Dispatch because the Http connection was not closed properly.
Best Regards
This wont work.
You cannot expect the request objects be distributed and responses collected over a cluster by other nodes. If you do then the spark calls for future will never end. The futures will never work in this case.
If your map() make sync(http) requests then please collect responses within the same action/transformation call and then subject the results(responses) to further map/reduce/other calls.
In your case, please rewrite logic collect the responses for each call in sync and remove the notion of futures then all should be fine.

spark streaming: mapping points into queue

I am new to spark streaming and I can't understand how map works. I want to enqueue some points from a stream after I pass it from a constructor so what I wrote is:
val data = inp.flatMap(_.split(","))
val points = data.map(_.toDouble)
val queue: Queue[Point] = new Queue[Point]
points.foreachRDD(rdd => {
rdd.map(x => queue.enqueue(new Point(x,1)))
})
when I print the size of the queue is always zero.
All transformations in Spark are lazy and they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset. The transformations are only computed when an action requires a result to be returned to the driver program.
Since you are applying map function here , its a lazily evaluated and will not be computed.Instead, a DAG is built. This will be evaluated only when an action is called. You might wanna try collect or any other actions to materialize this.
You can read more about this here.Its kinda old but informative.
https://training.databricks.com/visualapi.pdf

How to parallelize several apache spark rdds?

I have the next code:
sc.parquetFile("some large parquet file with bc").registerTempTable("bcs")
sc.parquetFile("some large parquet file with imps").registerTempTable("imps")
val bcs = sc.sql("select * from bcs")
val imps = sc.sql("select * from imps")
I want to do:
bcs.map(x => wrapBC(x)).collect
imps.map(x => wrapIMP(x)).collect
but when I do this, it's running not async. I can to do it with Future, like that:
val bcsFuture = Future { bcs.map(x => wrapBC(x)).collect }
val impsFuture = Future { imps.map(x => wrapIMP(x)).collect }
val result = for {
bcs <- bcsFuture
imps <- impsFuture
} yield (bcs, imps)
Await.result(result, Duration.Inf) //this return (Array[Bc], Array[Imp])
I want to do this without Future, how can I do it?
Update This was originally composed before the question was updated. Given those updates, I agree with #stholzm's answer to use cartesian in this case.
There do exist a limited number of actions which will produce a FutureAction[A] for an RDD[A] and be executed in the background. These are available on the AsyncRDDActions class, and so long as you import SparkContext._ any RDD will can be implicitly converted to an AysnchRDDAction as needed. For your specific code example that would be:
bcs.map(x => wrapBC(x)).collectAsync
imps.map(x => wrapIMP(x)).collectAsync
In additionally to evaluating the DAG up to action in the background, the FutureAction produced has the cancel method to attempt to end processing early.
Caveat
This may not do what you think it does. If the intent is to get data from both sources and then combine them you're more likely to want to join or group the RDDs instead. For that you can look at the functions available in PairRDDFunctions, again available on RDDs through implicit conversion.
If the intention isn't to have the data graphs interact then so far in my experience then running batches concurrently might only serve to slow down both, though that may be a consequence of how the cluster is configured. If the resource manager is set up to give each execution stage a monopoly on the cluster in FIFO order (the default in standalone and YARN modes, I believe; I'm not sure about Mesos) then each of the asynchronous collects will contend with each other for that monopoly, run their tasks, then contend again for the next execution stage.
Compare this to using a Future to wrap blocking calls to downstream services or database, for example, where either the resources in question are completely separate or generally have enough resource capacity to handle multiple requests in parallel without contention.
Update: I misunderstood the question. The desired result is not the cartesian product Array[(Bc, Imp)].
But I'd argue that it does not matter how long the single map calls take because as soon as you add other transformations, Spark tries to combine them in an efficient way. As long as you only chain transformations on RDDs, nothing happens on the data. When you finally apply an action then the execution engine will figure out a way to produce the requested data.
So my advice would be to not think so much about the intermediate steps and avoid collect as much as possible because it will fetch all the data to the driver program.
It seems you are building a cartesian product yourself. Try cartesian instead:
val bc = bcs.map(x => wrapBC(x))
val imp = imps.map(x => wrapIMP(x))
val result = bc.cartesian(imp).collect
Note that collect is called on the final RDD and no longer on intermediate results.
You can use union for solve this problem. For example:
bcs.map(x => wrapBC(x).asInstanceOf[Any])
imps.map(x => wrapIMP(x).asInstanceOf[Any])
val result = (bcs union imps).collect()
val bcsResult = result collect { case bc: Bc => bc }
val impsResult = result collect { case imp: Imp => imp }
If you want to use sortBy or another operations, you can use inheritance of trait or main class.

Spark streams: enrich stream with reference data

I have spark streaming set up so that it reads from a socket, does some enrichment of the data before publishing it on a rabbit queue.
The enrichment looks up information from a Map that was instantiated by reading a regular text file (Source.fromFile...) before setting up the streaming context.
I have a feeling that this is not really the way it should be done. On the other hand, when using a StreamingContext, I can only read from streams, not from static files as I would be able to do with a SparkContext.
I could try to allow multiple contexts but I'm not sure if this is the right way either.
Any advice would be greatly appreciated.
Making the assumption that the map being used for enrichment is fairly small to be held in memory, a recommended way to use that data in a Spark job is through Broadcast variables. The content of such variable will be sent once to each executor, avoiding in that way overhead of serializing datasets captured in a closure.
Broadcast variables are wrappers instantiated in the driver and the data is 'unwrapped' using the broadcastVar.value method in a closure.
This would be an example of how to use broadcast variables with a DStream:
// could replace with Source.from File as well. This is just more practical
val data = sc.textFile("loopup.txt").map(toKeyValue).collectAsMap()
// declare the broadcast variable
val bcastData = sc.broadcast(data)
... initialize streams ...
socketDStream.map{ elem =>
// doing every step here explicitly for illustrative purposes. Usually, one would typically just chain these calls
// get the map within the broadcast wrapper
val lookupMap = bcastData.value
// use the map to lookup some data
val lookupValue = lookupMap.getOrElse(elem, "not found")
// create the desired result
(elem, lookupValue)
}
socketDStream.saveTo...
If your file is small and not on a distributed file system, Source.fromFile is fine (whatever gets the job done).
If you want to read files via the SparkContext, you can still access it via streamingContext.sparkContext and combine it with the DStream in transform or foreachRDD.