Using Futures within Spark - scala

A Spark job makes a remote web service for every element in an RDD. A simple implementation might look something like this:
def webServiceCall(url: String) = scala.io.Source.fromURL(url).mkString
rdd2 = rdd1.map(x => webServiceCall(x.field1))
(The above example has been kept simple and does not handle timeouts).
There is no interdependency between any of the results for different elements of the RDD.
Would the above be improved by using Futures to optimise performance by making parallel calls to the web service for each element of the RDD? Or does Spark itself have that level of optimization built in, so that it will run the operations on each element in the RDD in parallel?
If the above can be optimized by using Futures, does anyone have some code examples showing the correct way to use Futures within a function passed to a Spark RDD.
Thanks

Or does Spark itself have that level of optimization built in, so that it will run the operations on each element in the RDD in parallel?
It doesn't. Spark parallelizes tasks at the partition level but by default every partition is processed sequentially in a single thread.
Would the above be improved by using Futures
It could be an improvement but is quite hard to do it right. In particular:
every Future has to be completed in the same stage before any reshuffle takes place.
given lazy nature of the Iterators used to expose partition data you cannot do it high level primitives like map (see for example Spark job with Async HTTP call).
you can build your custom logic using mapPartitions but then you have to deal with all the consequences of non-lazy partition evaluation.

I couldnt find an easy way to achieve this. But after several iteration of retries this is what I did and its working for a huge list of queries. Basically we used this to do a batch operation for a huge query into multiple sub queries.
// Break down your huge workload into smaller chunks, in this case huge query string is broken
// down to a small set of subqueries
// Here if needed to optimize further down, you can provide an optimal partition when parallelizing
val queries = sqlContext.sparkContext.parallelize[String](subQueryList.toSeq)
// Then map each one those to a Spark Task, in this case its a Future that returns a string
val tasks: RDD[Future[String]] = queries.map(query => {
val task = makeHttpCall(query) // Method returns http call response as a Future[String]
task.recover {
case ex => logger.error("recover: " + ex.printStackTrace()) }
task onFailure {
case t => logger.error("execution failed: " + t.getMessage) }
task
})
// Note:: Http call is still not invoked, you are including this as part of the lineage
// Then in each partition you combine all Futures (means there could be several tasks in each partition) and sequence it
// And Await for the result, in this way you making it to block untill all the future in that sequence is resolved
val contentRdd = tasks.mapPartitions[String] { f: Iterator[Future[String]] =>
val searchFuture: Future[Iterator[String]] = Future sequence f
Await.result(searchFuture, threadWaitTime.seconds)
}
// Note: At this point, you can do any transformations on this rdd and it will be appended to the lineage.
// When you perform any action on that Rdd, then at that point,
// those mapPartition process will be evaluated to find the tasks and the subqueries to perform a full parallel http requests and
// collect those data in a single rdd.
I'm reposting it from my original answer here

Related

Spark - parallel computation for different dataframes

A premise: this question might sound idiotic, but I guess I fell into confusion and/ignorance.
The question is: does Spark already optmize its physical plan to execute computations on unrelated dataframes to be in parallel? If not, would it be advisable to try and parallelize such processes? Example below.
Let's assume I have the following scenario:
val df1 = read table into dataframe
val df2 = read another table into dataframe
val aTransformationOnDf1 = df1.filter(condition).doSomething
val aSubSetOfTransformationOnDf1 = aTransformationOnDf1.doSomeOperations
// Push to Kafka
aSubSetOfTransformationOnDf1.toJSON.pushToKafkaTopic
val anotherTransformationOnDf1WithDf2 = df1.filter(anotherCondition).join(df2).doSomethingElse
val yetAnotherTransformationOnDf1WithDf2 = df1.filter(aThirdCondition).join(df2).doAnotherThing
val unionAllTransformation = aTransformationOnDf1
.union(anotherTransformationOnDf1WithDf2)
.union(yetAnotherTransformationOnDf1WithDf2)
unionAllTransformation.write.mode(whatever).partitionBy(partitionColumn).save(wherever)
Basically I have two initial dataframes. One is an avent log with past events and new events to process. As an example:
a subset of these new events must be processed and pushed to Kafka.
a subset of the past events could have updates, so they must be processed alone
another subset of the past events could have another kind of updates, so they must be processed alone
In the end, all processed events are unified in one dataframe to be written back to the events' log table.
Question: does Spark process the different subsets in parallel or sequentially (and onyl computation within each individual dataframe is performed distributedly)?
If not, could we enforce parallel computation of each individual subset before the union? I know Scala has a Future propery, though I never used it.
Something like>
def unionAllDataframes(df1: DataFrame, df2: DataFrame, df3: DataFrame): Future[DafaFrame] = {
Future { df1.union(df2).union(df2) }
}
// At the end
val finalDf = unionAllDataframes(
aTransformationOnDf1,
anotherTransformationOnDf1WithDf2,
yetAnotherTransformationOnDf1WithDf2)
finalDf.onComplete({
case Success(df) => df.write(etc...)
case Failure(exception) => handleException(exception)
})
Sorry for the horrendous design and probably the wrong usage of Future. Once again, I am a bit confused on this scenario and I am trying to micro-optimize this passage (if possible).
Thanks a lot in advance!
Cheers

How to use SparkContext.submitJob to call REST API

Can someone please provide an example of submitJob method call
Found reference here: How to execute async operations (i.e. returning a Future) from map/filter/etc.?
I believe i can implement it for my use case
In my current Implementation I am using paritions to invoke parallel calls, but they are waiting for the response before invoking the next call
Dataframe.rdd.reparition(TPS allowed on API)
.map(row => {
val response = callApi(row)
parse(response)
})
But as there is latency at the API end, I am waiting 10 seconds for the response before parsing and then make the next call. I have a 100 TPS but current logic i see only 4-7 TPS
If someone has used SparkContext.submitJob , to make asynchronous calls please provide an example as I am new spark and scala
I want to invoke the calls without waiting for response, ensuring 100 TPS and then once I receive response I want to parse and create Dataframe on top of it.
I had previously tried collecting the rows and invoking API calls from master node, seems to be limited by hardware for creating large thread pool
submitJob[T, U, R](rdd: RDD[T], processPartition: (Iterator[T]) ⇒ U, partitions: Seq[Int], resultHandler: (Int, U) ⇒ Unit, resultFunc: ⇒ R): SimpleFutureAction[R]
Rdd - rdd out of my Dataframe
paritition - my rdd is already partitioned, do i provide range 0 to No.of.partitions in my rdd ?
processPartition - is it my callApi() ?
resultHandler - not sure what is to be done here
resultFunc - I believe this would be parsing my response
How to I create Dataframe after SimpleFutureAction
Can someone please assist
submitJob won't make your API calls automatically faster. It is part of the low-level implementation of Spark's parallel processing - Spark splits actions into jobs and then submits them to whatever cluster scheduler is in place. Calling submitJob is like starting a Java thread - the job will run asynchronously, but not faster than if you simply call the action on the dataframe/RDD.
IMHO your best option is to use mapPartitions which allows you to run a function within the context of each partition. You already have your data partitioned so to ensure maximum concurrency, just make sure you have enough Spark executors to actually have those partitions running simultaneously:
df.rdd.repartition(#concurrent API calls)
.mapPartitions(partition => {
partition.map(row => {
val response = callApi(row)
parse(response)
})
})
.toDF("col1", "col2", ...)
mapPartitions expects a function that maps Iterator[T] (all data in a single partition) to Iterator[U] (transformed partition) and returns RDD[U]. Converting back to a dataframe is a matter of chaining a call to toDF() with the appropriate column names.
You may wish to implement some sort of per-thread rate limiting in callApi to make sure no single executor fires a large number of requests per second. Keep in mind that executors may run in both separate threads and/or separate JVMs.
Of course, just calling mapPartitions does nothing. You need to trigger an action on the resulting dataframe for the API calls to actually fire.

Are two transformations on the same RDD executed in parallel in Apache Spark?

Lets say we have the following Scala program:
val inputRDD = sc.textFile("log.txt")
inputRDD.persist()
val errorsRDD = inputRDD.filter(lambda x: "error" in x)
val warningsRDD = inputRDD.filter(lambda x: "warning" in x)
println("Errors: " + errorsRDD.count() + ", Warnings: " + warningsRDD.count())
We create a simple RDD, persist it, perform two transformations on the RDD and finally have an action which uses the RDDs.
When the print is called, the transformations are executed, each transformation is of course parallel depending on the cluster management.
My main question is: Are the two actions and transformations executed in parallel or sequence? Or does errorsRDD.count() first execute and then warningsRDD.count(), in sequence?
I'm also wondering if there is any point in using persist in this example.
All standard RDD methods are blocking (with exception to AsyncRDDActions) so actions will be evaluated sequentially. It is possible to execute multiple actions concurrently using non-blocking submission (threads, Futures) with correct configuration of in-application scheduler or explicitly limited resources for each action.
Regarding cache it is impossible to answer without knowing the context. Depending on the cluster configuration, storage, and data locality it might be cheaper to load data from disk again, especially when resources are limited, and subsequent actions might trigger cache cleaner.
This will execute errorsRDD.count() first then warningsRDD.count().
The point of using persist here is when the first count is executed, inputRDD will be in memory.
The second count, spark won't need to re-read "whole" content of file from storage again, so execution time of this count would be much faster than the first.

Why does RDD.foreach fail with "SparkException: This RDD lacks a SparkContext"?

I have a dataset (as an RDD) that I divide into 4 RDDs by using different filter operators.
val RSet = datasetRdd.
flatMap(x => RSetForAttr(x, alLevel, hieDict)).
map(x => (x, 1)).
reduceByKey((x, y) => x + y)
val Rp:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("Rp"))
val Rc:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("Rc"))
val RpSv:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("RpSv"))
val RcSv:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("RcSv"))
I sent Rp and RpSV to the following function calculateEntropy:
def calculateEntropy(Rx: RDD[(String, Int)], RxSv: RDD[(String, Int)]): Map[Int, Map[String, Double]] = {
RxSv.foreach{item => {
val string = item._1.split(",")
val t = Rx.filter(x => x._1.split(",")(2).equals(string(2)))
.
.
}
}
I have two questions:
1- When I loop operation on RxSv as:
RxSv.foreach{item=> { ... }}
it collects all items of the partitions, but I want to only a partition where i am in. If you said that user map function but I don't change anything on RDD.
So when I run the code on a cluster with 4 workers and a driver the dataset is divided into 4 partitions and each worker runs the code. But for example i use foreach loop as i specified in the code. Driver collects all data from workers.
2- I have encountered with a problem on this code
val t = Rx.filter(x => x._1.split(",")(2).equals(abc(2)))
The error :
org.apache.spark.SparkException: This RDD lacks a SparkContext.
It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations;
for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
First of all, I'd highly recommend caching the first RDD using cache operator.
RSet.cache
That will avoid scanning and transforming your dataset every time you filter for the other RDDs: Rp, Rc, RpSv and RcSv.
Quoting the scaladoc of cache:
cache() Persist this RDD with the default storage level (MEMORY_ONLY).
Performance should increase.
Secondly, I'd be very careful using the term "partition" to refer to a filtered RDD since the term has a special meaning in Spark.
Partitions say how many tasks Spark executes for an action. They are hints for Spark so you, a Spark developer, could fine-tune your distributed pipeline.
The pipeline is distributed across cluster nodes with one or many Spark executors per the partitioning scheme. If you decide to have a one partition in a RDD, once you execute an action on that RDD, you'll have one task on one executor.
The filter transformation does not change the number of partitions (in other words, it preserves partitioning). The number of partitions, i.e. the number of tasks, is exactly the number of partitions of RSet.
1- When I loop operation on RxSv it collects all items of the partitions, but I want to only a partition where i am in
You are. Don't worry about it as Spark will execute the task on executors where the data lives. foreach is an action that does not collect items but describes a computation that runs on executors with the data distributed across the cluster (as partitions).
If you want to process all items at once per partition use foreachPartition:
foreachPartition Applies a function f to each partition of this RDD.
2- I have encountered with a problem on this code
In the following lines of the code:
RxSv.foreach{item => {
val string = item._1.split(",")
val t = Rx.filter(x => x._1.split(",")(2).equals(string(2)))
you are executing foreach action that in turn uses Rx which is RDD[(String, Int)]. This is not allowed (and if it were possible should not have been compiled).
The reason for the behaviour is that an RDD is a data structure that just describes what happens with the dataset when an action is executed and lives on the driver (the orchestrator). The driver uses the data structure to track the data sources, transformations and the number of partitions.
A RDD as an entity is gone (= disappears) when the driver spawns tasks on executors.
And when the tasks run nothing is available to help them to know how to run RDDs that are part of their work. And hence the error. Spark is very cautious about it and checks such anomalies before they could cause issues after tasks are executed.

parallelize list.map for concurrent rdd jobs

according to a posting
https://groups.google.com/forum/#!topic/spark-users/3QIn42VbQe0
This code will submit all jobs directly to Spark's scheduler, and you get a list of "Future"s back.
val rdds: List[RDD[T]] = ...
val futures = rdds.map { rdd =>
rdd.map(...).reduceByKey(...).collect()
}
I am wondering whether adding .par would speed this up, as in
rdds.par.map
or, maybe the author meant that each map entry would be just a spark job submission, and running them in sequence would be just as fast.
In the code provided, the RDDs will be evaluated sequentially. When you call .collect on an RDD, the RDD transformation is evaluated and the results are collected in the driver. The driver is blocked while the results are collected.
If you were to change it to rdds.par.map { ... }, then the .collects would be called in parallel and all RDDs would be evaluated at the same time. This then leaves it to the Spark scheduling mechanism to decide how to share the cluster between the RDDs.