Spark - parallel computation for different dataframes - scala

A premise: this question might sound idiotic, but I guess I fell into confusion and/ignorance.
The question is: does Spark already optmize its physical plan to execute computations on unrelated dataframes to be in parallel? If not, would it be advisable to try and parallelize such processes? Example below.
Let's assume I have the following scenario:
val df1 = read table into dataframe
val df2 = read another table into dataframe
val aTransformationOnDf1 = df1.filter(condition).doSomething
val aSubSetOfTransformationOnDf1 = aTransformationOnDf1.doSomeOperations
// Push to Kafka
aSubSetOfTransformationOnDf1.toJSON.pushToKafkaTopic
val anotherTransformationOnDf1WithDf2 = df1.filter(anotherCondition).join(df2).doSomethingElse
val yetAnotherTransformationOnDf1WithDf2 = df1.filter(aThirdCondition).join(df2).doAnotherThing
val unionAllTransformation = aTransformationOnDf1
.union(anotherTransformationOnDf1WithDf2)
.union(yetAnotherTransformationOnDf1WithDf2)
unionAllTransformation.write.mode(whatever).partitionBy(partitionColumn).save(wherever)
Basically I have two initial dataframes. One is an avent log with past events and new events to process. As an example:
a subset of these new events must be processed and pushed to Kafka.
a subset of the past events could have updates, so they must be processed alone
another subset of the past events could have another kind of updates, so they must be processed alone
In the end, all processed events are unified in one dataframe to be written back to the events' log table.
Question: does Spark process the different subsets in parallel or sequentially (and onyl computation within each individual dataframe is performed distributedly)?
If not, could we enforce parallel computation of each individual subset before the union? I know Scala has a Future propery, though I never used it.
Something like>
def unionAllDataframes(df1: DataFrame, df2: DataFrame, df3: DataFrame): Future[DafaFrame] = {
Future { df1.union(df2).union(df2) }
}
// At the end
val finalDf = unionAllDataframes(
aTransformationOnDf1,
anotherTransformationOnDf1WithDf2,
yetAnotherTransformationOnDf1WithDf2)
finalDf.onComplete({
case Success(df) => df.write(etc...)
case Failure(exception) => handleException(exception)
})
Sorry for the horrendous design and probably the wrong usage of Future. Once again, I am a bit confused on this scenario and I am trying to micro-optimize this passage (if possible).
Thanks a lot in advance!
Cheers

Related

Transform and persist Spark DStream into several separate locations, in parallel?

I have an use case of a DStream that contains data with several levels of nesting, and I have a requirement to persist different elements from that data into separate HDFS locations. I managed to work this out by using Spark SQL, as below:
val context = new StreamingContext(sparkConf, Seconds(duration))
val stream = context.receiverStream(receiver)
stream.foreachRDD {rdd =>
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate
import spark.implicits._
rdd.toDF.drop("childRecords").write.parquet("ParentTable")
}
stream.foreachRDD {rdd =>
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate
import spark.implicits._
rdd.toDF.select(explode(col("childRecords")).as("children"))
.select("children.*").write.parquet("ChildTable")
}
// repeat as necessary if parent table has more different kinds of child records,
// or if child table itself also has child records too
The code works, but the only issue I have with it, is that the persistence runs sequentially - the first stream.foreachRDD has to complete before the second one starts, etc. What I'd like to see ideally is for the persistence job for ChildTable to start without waiting for ParentTable to finish, as they're writing to different locations and would not conflict. In reality, I have about 10 different jobs all waiting to complete sequentially, and would probably see a big improvement in execution time if I'd be able to run them all in parallel.

Partial unpersist of spark dataframe in scala

I am performing a join by bucketizing one of my dataframes and, given the frequent use of the dataframes I have persisted them to cache. However I see that the total time spent in GC has increased to >10% of execution time.
Is it possible for me to perform in scala a partial unpersist of the dataframe to actively prune cached data as and when it turns obsolete for my use case.
EX:
srcDF.persist()
srcDF.count()
val df1 = srcDF.filter(col("bucket_id") === lit(1))
val result = df1.join(otherDF, Seq("field1"))
Given this execution is it possible for me to do a partial unpersist() to exclude everything with "bucket_id" = 1

Why does RDD.foreach fail with "SparkException: This RDD lacks a SparkContext"?

I have a dataset (as an RDD) that I divide into 4 RDDs by using different filter operators.
val RSet = datasetRdd.
flatMap(x => RSetForAttr(x, alLevel, hieDict)).
map(x => (x, 1)).
reduceByKey((x, y) => x + y)
val Rp:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("Rp"))
val Rc:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("Rc"))
val RpSv:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("RpSv"))
val RcSv:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("RcSv"))
I sent Rp and RpSV to the following function calculateEntropy:
def calculateEntropy(Rx: RDD[(String, Int)], RxSv: RDD[(String, Int)]): Map[Int, Map[String, Double]] = {
RxSv.foreach{item => {
val string = item._1.split(",")
val t = Rx.filter(x => x._1.split(",")(2).equals(string(2)))
.
.
}
}
I have two questions:
1- When I loop operation on RxSv as:
RxSv.foreach{item=> { ... }}
it collects all items of the partitions, but I want to only a partition where i am in. If you said that user map function but I don't change anything on RDD.
So when I run the code on a cluster with 4 workers and a driver the dataset is divided into 4 partitions and each worker runs the code. But for example i use foreach loop as i specified in the code. Driver collects all data from workers.
2- I have encountered with a problem on this code
val t = Rx.filter(x => x._1.split(",")(2).equals(abc(2)))
The error :
org.apache.spark.SparkException: This RDD lacks a SparkContext.
It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations;
for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
First of all, I'd highly recommend caching the first RDD using cache operator.
RSet.cache
That will avoid scanning and transforming your dataset every time you filter for the other RDDs: Rp, Rc, RpSv and RcSv.
Quoting the scaladoc of cache:
cache() Persist this RDD with the default storage level (MEMORY_ONLY).
Performance should increase.
Secondly, I'd be very careful using the term "partition" to refer to a filtered RDD since the term has a special meaning in Spark.
Partitions say how many tasks Spark executes for an action. They are hints for Spark so you, a Spark developer, could fine-tune your distributed pipeline.
The pipeline is distributed across cluster nodes with one or many Spark executors per the partitioning scheme. If you decide to have a one partition in a RDD, once you execute an action on that RDD, you'll have one task on one executor.
The filter transformation does not change the number of partitions (in other words, it preserves partitioning). The number of partitions, i.e. the number of tasks, is exactly the number of partitions of RSet.
1- When I loop operation on RxSv it collects all items of the partitions, but I want to only a partition where i am in
You are. Don't worry about it as Spark will execute the task on executors where the data lives. foreach is an action that does not collect items but describes a computation that runs on executors with the data distributed across the cluster (as partitions).
If you want to process all items at once per partition use foreachPartition:
foreachPartition Applies a function f to each partition of this RDD.
2- I have encountered with a problem on this code
In the following lines of the code:
RxSv.foreach{item => {
val string = item._1.split(",")
val t = Rx.filter(x => x._1.split(",")(2).equals(string(2)))
you are executing foreach action that in turn uses Rx which is RDD[(String, Int)]. This is not allowed (and if it were possible should not have been compiled).
The reason for the behaviour is that an RDD is a data structure that just describes what happens with the dataset when an action is executed and lives on the driver (the orchestrator). The driver uses the data structure to track the data sources, transformations and the number of partitions.
A RDD as an entity is gone (= disappears) when the driver spawns tasks on executors.
And when the tasks run nothing is available to help them to know how to run RDDs that are part of their work. And hence the error. Spark is very cautious about it and checks such anomalies before they could cause issues after tasks are executed.

Using Futures within Spark

A Spark job makes a remote web service for every element in an RDD. A simple implementation might look something like this:
def webServiceCall(url: String) = scala.io.Source.fromURL(url).mkString
rdd2 = rdd1.map(x => webServiceCall(x.field1))
(The above example has been kept simple and does not handle timeouts).
There is no interdependency between any of the results for different elements of the RDD.
Would the above be improved by using Futures to optimise performance by making parallel calls to the web service for each element of the RDD? Or does Spark itself have that level of optimization built in, so that it will run the operations on each element in the RDD in parallel?
If the above can be optimized by using Futures, does anyone have some code examples showing the correct way to use Futures within a function passed to a Spark RDD.
Thanks
Or does Spark itself have that level of optimization built in, so that it will run the operations on each element in the RDD in parallel?
It doesn't. Spark parallelizes tasks at the partition level but by default every partition is processed sequentially in a single thread.
Would the above be improved by using Futures
It could be an improvement but is quite hard to do it right. In particular:
every Future has to be completed in the same stage before any reshuffle takes place.
given lazy nature of the Iterators used to expose partition data you cannot do it high level primitives like map (see for example Spark job with Async HTTP call).
you can build your custom logic using mapPartitions but then you have to deal with all the consequences of non-lazy partition evaluation.
I couldnt find an easy way to achieve this. But after several iteration of retries this is what I did and its working for a huge list of queries. Basically we used this to do a batch operation for a huge query into multiple sub queries.
// Break down your huge workload into smaller chunks, in this case huge query string is broken
// down to a small set of subqueries
// Here if needed to optimize further down, you can provide an optimal partition when parallelizing
val queries = sqlContext.sparkContext.parallelize[String](subQueryList.toSeq)
// Then map each one those to a Spark Task, in this case its a Future that returns a string
val tasks: RDD[Future[String]] = queries.map(query => {
val task = makeHttpCall(query) // Method returns http call response as a Future[String]
task.recover {
case ex => logger.error("recover: " + ex.printStackTrace()) }
task onFailure {
case t => logger.error("execution failed: " + t.getMessage) }
task
})
// Note:: Http call is still not invoked, you are including this as part of the lineage
// Then in each partition you combine all Futures (means there could be several tasks in each partition) and sequence it
// And Await for the result, in this way you making it to block untill all the future in that sequence is resolved
val contentRdd = tasks.mapPartitions[String] { f: Iterator[Future[String]] =>
val searchFuture: Future[Iterator[String]] = Future sequence f
Await.result(searchFuture, threadWaitTime.seconds)
}
// Note: At this point, you can do any transformations on this rdd and it will be appended to the lineage.
// When you perform any action on that Rdd, then at that point,
// those mapPartition process will be evaluated to find the tasks and the subqueries to perform a full parallel http requests and
// collect those data in a single rdd.
I'm reposting it from my original answer here

parallelize list.map for concurrent rdd jobs

according to a posting
https://groups.google.com/forum/#!topic/spark-users/3QIn42VbQe0
This code will submit all jobs directly to Spark's scheduler, and you get a list of "Future"s back.
val rdds: List[RDD[T]] = ...
val futures = rdds.map { rdd =>
rdd.map(...).reduceByKey(...).collect()
}
I am wondering whether adding .par would speed this up, as in
rdds.par.map
or, maybe the author meant that each map entry would be just a spark job submission, and running them in sequence would be just as fast.
In the code provided, the RDDs will be evaluated sequentially. When you call .collect on an RDD, the RDD transformation is evaluated and the results are collected in the driver. The driver is blocked while the results are collected.
If you were to change it to rdds.par.map { ... }, then the .collects would be called in parallel and all RDDs would be evaluated at the same time. This then leaves it to the Spark scheduling mechanism to decide how to share the cluster between the RDDs.