Spark: Which is the proper way to use a Broadcast variable? - scala

I don't know if I'm using well a broadcast variable.
I have two RDDs, rdd1 and rdd2. I want to apply rdd2.mapPartitionsWithIndex(...), and for each partition I need to perfom some calculation using the whole rdd1. So, I think this is a case to use a Broadcast variable. First question: Am I thinking it right?
To do so, I did this:
val rdd1Broadcast = sc.broadcast(rdd1.collect())
Second question: Why do I need to put .collect(). I saw examples with and without .collect(), but I didn't realized when do I need to use it.
Also, I did this:
val rdd3 = rdd2.mapPartitionsWithIndex( myfunction(_, _, rdd1Broadcast), preservesPartitioning = preserves).cache()
Third question: Which is better: passing rdd1Broadcast or rdd1Broadcast.value?

Am I thinking it right?
There is really not enough information to answer this part. Broadcasting is useful only if broadcasted object is relatively small, or local access significantly reduces computational complexity.
Why do I need to put .collect().
Because RDDs can be accessed only on the driver. Broadcasting RDD is not meaningful, as you cannot access the data from a task.
Which is better: passing rdd1Broadcast or rdd1Broadcast.value?
The argument should be of type Broadcast[_] so don't use rdd1Broadcast.value. If parameter is passed by value, it will be evaluated and substituted locally, and broadcast will not be used.

Related

What is Right way to get spark executing udf

As fas as I know , the spark use lazy computation meaning if the action is not called, nothing would ever never happen .
And one way I know is using collect method get spark working , however when I read the article it says :
Usually, collect() is used to retrieve the action output when you have
very small result set and calling collect() on an RDD/DataFrame with a
bigger result set causes out of memory as it returns the entire
dataset (from all workers) to the driver hence we should avoid calling
collect() on a larger dataset.
And I actually have udf that returns NullType()
#udf
def write_something():
#write something to dir
so I do not want to use collect() ,cause it might cause OOM as mentioned above.
So in my case , what is the best way to do this in my case ? Thanks !
You can use Dataframe.foreach:
df.foreach(lambda x: None)
The foreach action will trigger the excecution of the whole DAG of df while keeping all data on their respective executors.
The pattern foreach(lambda x: None) is mainly used for debugging purposes. An option might be to remove the udf and put its logic into the function that is called by foreach.

Scala Spark isin broadcast list

I'm trying to perform a isin filter as optimized as possible. Is there a way to broadcast collList using Scala API?
Edit: I'm not looking for an alternative, I know them, but I need isin so my RelationProviders will pushdown the values.
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
//collList.size == 200.000
val retTable = df.filter(col("col1").isin(collList: _*))
The list i'm passing to the "isin" method has upto ~200.000 unique elements.
I know this doesn't look like the best option and a join sounds better, but I need those elements pushed down into the filters, makes a huge difference when reading (my storage is Kudu, but it also applies to HDFS+Parquet, base data is too big and queries work on around 1% of that data), I already measured everything, and it saved me around 30minutes execution time :). Plus my method already takes care if the isin is larger than 200.000.
My problem is, I'm getting some Spark "task are too big" (~8mb per task) warnings, everything works fine so not a big deal, but I'm looking to remove them and also optimize.
I've tried with, which does nothing as I still get the warning (since the broadcasted var gets resolved in Scala and passed to vargargs I guess):
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList).value: _*))
And this one which doesn't compile:
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList: _*).value))
And this one which doesn't work (task too big still appears)
val broadcastedList=df.sparkSession.sparkContext.broadcast(collList.map(lit(_).expr))
val filterBroadcasted=In(col("col1").expr, collList.value)
val retTable = df.filter(new Column(filterBroadcasted))
Any ideas on how to broadcast this variable? (hacks allowed). Any alternative to the isin which allows filter pushdown is also valid I've seen some people doing it on PySpark, but the API is not the same.
PS: Changes on the storage are not possible, I know partitioning (already partitioned, but not by that field) and such could help, but user inputs are totally random and the data is accessed and changed my many clients.
I'd opt for dataframe broad cast hash join in this case instead of broadcast variable.
Prepare a dataframe with your collectedDf("col1") collection list you want to filter with isin and then
use join between 2 dataframes to filter the rows matching.
I think it would be more efficient than isin since you have 200k entries to be filtered. spark.sql.autobroadcastjointhreshhold is the property you need to set with appropriate size(by default 10mb). AFAIK you can use till 200mb or 3oomb based on your requirements.
see this BHJ Explanation of how it works
Further reading Spark efficiently filtering entries from big dataframe that exist in a small dataframe
I'll just leave with big tasks since I only use it twice (but saves a lot of time) in my program and I can afford it, but if someone else needs it badly... well this seems to be the path.
Best alternatives I found to have big-arrays pushdown:
Change your relation provider so it broadcasts big-lists when pushing down In filters, this will probably leave some broadcasted trash, but well..., as long as your app is not streaming, it shouldn't be a problem, or you can save in a global list and clean those after a while
Add a filter in Spark (I wrote something at https://issues.apache.org/jira/browse/SPARK-31417 ) which allows broadcasted pushdown all the way to your relation provider. You would have to add your custom predicate, then implement your custom "Pushdown" (you can do this by adding a new rule) and then rewrite your RDD/Relation provider so it can exploit the fact the variable is broadcasted.
Use coalesce(X) after reading to decrease number of tasks, can work sometimes, depends on how the RelationProvider/RDD is implemented.

Split Spark DataFrame based on condition

I need something similar to the randomSplit function:
val Array(df1, df2) = myDataFrame.randomSplit(Array(0.6, 0.4))
However, I need to split myDataFrame based on a boolean condition. Does anything like the following exist?
val Array(df1, df2) = myDataFrame.booleanSplit(col("myColumn") > 100)
I'd like not to do two separate .filter calls.
Unfortunately the DataFrame API doesn't have such a method, to split by a condition you'll have to perform two separate filter transformations:
myDataFrame.cache() // recommended to prevent repeating the calculation
val condition = col("myColumn") > 100
val df1 = myDataFrame.filter(condition)
val df2 = myDataFrame.filter(not(condition))
I understand that caching and filtering twice looks a bit ugly, but please bear in mind that DataFrames are translated to RDDs, which are evaluated lazily, i.e. only when they are directly or indirectly used in an action.
If a method booleanSplit as suggested in the question existed, the result would be translated to two RDDs, each of which would be evaluated lazily. One of the two RDDs would be evaluated first and the other would be evaluated second, strictly after the first. At the point the first RDD is evaluated, the second RDD would not yet have "come into existence" (EDIT: Just noticed that there is a similar question for the RDD API with an answer that gives a similar reasoning)
To actually gain any performance benefit, the second RDD would have to be (partially) persisted during the iteration of the first RDD (or, actually, during the iteration of the parent RDD of both, which is triggered by the iteration of the first RDD). IMO this wouldn't align overly well with the design of the rest of the RDD API. Not sure if the performance gains would justify this.
I think the best you can achieve is to avoid writing two filter calls directly in your business code, by writing an implicit class with a method booleanSplit as a utility method does that part in a similar way as Tzach Zohar's answer, maybe using something along the lines of myDataFrame.withColumn("__condition_value", condition).cache() so the the value of the condition is not calculated twice.

How to use Scala to partition data into buckets for further processing

I want to read in a csv log which has as it's first column a timestamp of form hh:mm:ss. I would like to partition the entries into buckets, say hourly. I'm curious what the best approach is that adheres to Scala's semantics, i.e., reading the file as a stream, parsing it (maybe by a match predicate?) and emitting the csv entries as tuples.
It's been a couple of years since I looked at Scala but this problem seems particularly well suited to the language.
log format example:
[time],[string],[int],[int],[int],[int],[string]
The last field in the input could be mapped to an emum in the output tuple but I'm not sure there's value in that.
I'd be happy with a general recipe that I could use, with suggestions for certain built-in functions that are well suited to the problem.
The overall goal is a map-reduce, where I want to count elements in a time window but those elements first need to be preprocessed by a regex replace, before sorting and counting.
I've tried to keep the problem abstract, so the problem can be approached as a pattern to follow.
Thanks.
Perhaps as a first pass, a simple groupBy would do the trick ?
logLines.groupBy(line => line.timestamp.hours)
Using the groupBy idiom, and some filtering, my solution looks like
val lines: Traversable[String] = source.getLines.map(_.trim).toTraversable
val events: List[String] = lines.filter(line => line.matches("[\\d]+:.*")).toList
val buckets: Map[String, List[String]] = events.groupBy { line => line.substring(0, line.indexOf(":")) }
This gives me 24 buckets, one for each hour. Now I can process each bucket, perform the regex replace that I need to de-parameterize the URIs and finally map-reduce those to find the frequency each route has occurred.
Important note. I learned that groupBy doesn't work as desired, without first creating a List from the Traversable stream. Without that step, the end result is a single valued map for each hour. Possibly not the most performant solution, since it requires all events to be loaded in memory before partitioning. Is there a better solution that can partition a stream? Perhaps something that adds events to a mutable Set as the stream is processed?

How do you perform blocking IO in apache spark job?

What if, when I traverse RDD, I need to calculate values in dataset by calling external (blocking) service? How do you think that could be achieved?
val values: Future[RDD[Double]] = Future sequence tasks
I've tried to create a list of Futures, but as RDD id not Traversable, Future.sequence is not suitable.
I just wonder, if anyone had such a problem, and how did you solve it?
What I'm trying to achieve is to get a parallelism on a single worker node, so I can call that external service 3000 times per second.
Probably, there is another solution, more suitable for spark, like having multiple working nodes on single host.
It's interesting to know, how do you cope with such a challenge? Thanks.
Here is answer to my own question:
val buckets = sc.textFile(logFile, 100)
val tasks: RDD[Future[Object]] = buckets map { item =>
future {
// call native code
}
}
val values = tasks.mapPartitions[Object] { f: Iterator[Future[Object]] =>
val searchFuture: Future[Iterator[Object]] = Future sequence f
Await result (searchFuture, JOB_TIMEOUT)
}
The idea here is, that we get the collection of partitions, where each partition is sent to the specific worker and is the smallest piece of work. Each that piece of work contains data, that could be processed by calling native code and sending that data.
'values' collection contains the data, that is returned from the native code and that work is done across the cluster.
Based on your answer, that the blocking call is to compare provided input with each individual item in the RDD, I would strongly consider rewriting the comparison in java/scala so that it can be run as part of your spark process. If the comparison is a "pure" function (no side effects, depends only on its inputs), it should be straightforward to re-implement, and the decrease in complexity and increase in stability in your spark process due to not having to make remote calls will probably make it worth it.
It seems unlikely that your remote service will be able to handle 3000 calls per second, so a local in-process version would be preferable.
If that is absolutely impossible for some reason, then you might be able to create a RDD transformation which turns your data into a RDD of futures, in pseudo-code:
val callRemote(data:Data):Future[Double] = ...
val inputData:RDD[Data] = ...
val transformed:RDD[Future[Double]] = inputData.map(callRemote)
And then carry on from there, computing on your Future[Double] objects.
If you know how much parallelism your remote process can handle, it might be best to abandon the Future mode and accept that it is a bottleneck resource.
val remoteParallelism:Int = 100 // some constant
val callRemoteBlocking(data:Data):Double = ...
val inputData:RDD[Data] = ...
val transformed:RDD[Double] = inputData.
coalesce(remoteParallelism).
map(callRemoteBlocking)
Your job will probably take quite some time, but it shouldn't flood your remote service and die horribly.
A final option is that if the inputs are reasonably predictable and the range of outcomes is consistent and limited to some reasonable number of outputs (millions or so), you could precompute them all as a data set using your remote service and find them at spark job time using a join.