What is Right way to get spark executing udf - pyspark

As fas as I know , the spark use lazy computation meaning if the action is not called, nothing would ever never happen .
And one way I know is using collect method get spark working , however when I read the article it says :
Usually, collect() is used to retrieve the action output when you have
very small result set and calling collect() on an RDD/DataFrame with a
bigger result set causes out of memory as it returns the entire
dataset (from all workers) to the driver hence we should avoid calling
collect() on a larger dataset.
And I actually have udf that returns NullType()
#udf
def write_something():
#write something to dir
so I do not want to use collect() ,cause it might cause OOM as mentioned above.
So in my case , what is the best way to do this in my case ? Thanks !

You can use Dataframe.foreach:
df.foreach(lambda x: None)
The foreach action will trigger the excecution of the whole DAG of df while keeping all data on their respective executors.
The pattern foreach(lambda x: None) is mainly used for debugging purposes. An option might be to remove the udf and put its logic into the function that is called by foreach.

Related

Scala Spark isin broadcast list

I'm trying to perform a isin filter as optimized as possible. Is there a way to broadcast collList using Scala API?
Edit: I'm not looking for an alternative, I know them, but I need isin so my RelationProviders will pushdown the values.
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
//collList.size == 200.000
val retTable = df.filter(col("col1").isin(collList: _*))
The list i'm passing to the "isin" method has upto ~200.000 unique elements.
I know this doesn't look like the best option and a join sounds better, but I need those elements pushed down into the filters, makes a huge difference when reading (my storage is Kudu, but it also applies to HDFS+Parquet, base data is too big and queries work on around 1% of that data), I already measured everything, and it saved me around 30minutes execution time :). Plus my method already takes care if the isin is larger than 200.000.
My problem is, I'm getting some Spark "task are too big" (~8mb per task) warnings, everything works fine so not a big deal, but I'm looking to remove them and also optimize.
I've tried with, which does nothing as I still get the warning (since the broadcasted var gets resolved in Scala and passed to vargargs I guess):
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList).value: _*))
And this one which doesn't compile:
val collList = collectedDf.map(_.getAs[String]("col1")).sortWith(_ < _)
val retTable = df.filter(col("col1").isin(sc.broadcast(collList: _*).value))
And this one which doesn't work (task too big still appears)
val broadcastedList=df.sparkSession.sparkContext.broadcast(collList.map(lit(_).expr))
val filterBroadcasted=In(col("col1").expr, collList.value)
val retTable = df.filter(new Column(filterBroadcasted))
Any ideas on how to broadcast this variable? (hacks allowed). Any alternative to the isin which allows filter pushdown is also valid I've seen some people doing it on PySpark, but the API is not the same.
PS: Changes on the storage are not possible, I know partitioning (already partitioned, but not by that field) and such could help, but user inputs are totally random and the data is accessed and changed my many clients.
I'd opt for dataframe broad cast hash join in this case instead of broadcast variable.
Prepare a dataframe with your collectedDf("col1") collection list you want to filter with isin and then
use join between 2 dataframes to filter the rows matching.
I think it would be more efficient than isin since you have 200k entries to be filtered. spark.sql.autobroadcastjointhreshhold is the property you need to set with appropriate size(by default 10mb). AFAIK you can use till 200mb or 3oomb based on your requirements.
see this BHJ Explanation of how it works
Further reading Spark efficiently filtering entries from big dataframe that exist in a small dataframe
I'll just leave with big tasks since I only use it twice (but saves a lot of time) in my program and I can afford it, but if someone else needs it badly... well this seems to be the path.
Best alternatives I found to have big-arrays pushdown:
Change your relation provider so it broadcasts big-lists when pushing down In filters, this will probably leave some broadcasted trash, but well..., as long as your app is not streaming, it shouldn't be a problem, or you can save in a global list and clean those after a while
Add a filter in Spark (I wrote something at https://issues.apache.org/jira/browse/SPARK-31417 ) which allows broadcasted pushdown all the way to your relation provider. You would have to add your custom predicate, then implement your custom "Pushdown" (you can do this by adding a new rule) and then rewrite your RDD/Relation provider so it can exploit the fact the variable is broadcasted.
Use coalesce(X) after reading to decrease number of tasks, can work sometimes, depends on how the RelationProvider/RDD is implemented.

Spark: Which is the proper way to use a Broadcast variable?

I don't know if I'm using well a broadcast variable.
I have two RDDs, rdd1 and rdd2. I want to apply rdd2.mapPartitionsWithIndex(...), and for each partition I need to perfom some calculation using the whole rdd1. So, I think this is a case to use a Broadcast variable. First question: Am I thinking it right?
To do so, I did this:
val rdd1Broadcast = sc.broadcast(rdd1.collect())
Second question: Why do I need to put .collect(). I saw examples with and without .collect(), but I didn't realized when do I need to use it.
Also, I did this:
val rdd3 = rdd2.mapPartitionsWithIndex( myfunction(_, _, rdd1Broadcast), preservesPartitioning = preserves).cache()
Third question: Which is better: passing rdd1Broadcast or rdd1Broadcast.value?
Am I thinking it right?
There is really not enough information to answer this part. Broadcasting is useful only if broadcasted object is relatively small, or local access significantly reduces computational complexity.
Why do I need to put .collect().
Because RDDs can be accessed only on the driver. Broadcasting RDD is not meaningful, as you cannot access the data from a task.
Which is better: passing rdd1Broadcast or rdd1Broadcast.value?
The argument should be of type Broadcast[_] so don't use rdd1Broadcast.value. If parameter is passed by value, it will be evaluated and substituted locally, and broadcast will not be used.

Get a registered Spark Accumulator by name

Is there a way of getting a registered Spark Accumulator by name, without passing an actual reference? Desired behavior:
val cnt1 = sc.longAccumulator("cnt1")
val cnt2 = something.getAccumulatorByName("cnt1") asInstanceOf[LongAccumulator]
cnt1.add(1)
cnt2.value // returns 1
Thanks
Accumulators in Spark are kept in AccumulatorContext and there is no way to get them from it. Spark doesn't allow you to do this because accumulators are not kept until you stop SparkContext. They implemented canonicalizing mappings: accumulators are kept until you have strong reference to it, and as soon as they pass out of scope GC cleans them up (with special finalization process).
The only way to get accumulator by name is to put it into Map.
If you need for example to write accumulator in your FileFormat or RelationProvider and then read it in driver, just keep static reference to it.
If you read and write accumulators in the same class and you want to get them by name, you most likely need to create custom accumulator with Map[String, Long] inside. It is much more profitable in terms of performance.

Split Spark DataFrame based on condition

I need something similar to the randomSplit function:
val Array(df1, df2) = myDataFrame.randomSplit(Array(0.6, 0.4))
However, I need to split myDataFrame based on a boolean condition. Does anything like the following exist?
val Array(df1, df2) = myDataFrame.booleanSplit(col("myColumn") > 100)
I'd like not to do two separate .filter calls.
Unfortunately the DataFrame API doesn't have such a method, to split by a condition you'll have to perform two separate filter transformations:
myDataFrame.cache() // recommended to prevent repeating the calculation
val condition = col("myColumn") > 100
val df1 = myDataFrame.filter(condition)
val df2 = myDataFrame.filter(not(condition))
I understand that caching and filtering twice looks a bit ugly, but please bear in mind that DataFrames are translated to RDDs, which are evaluated lazily, i.e. only when they are directly or indirectly used in an action.
If a method booleanSplit as suggested in the question existed, the result would be translated to two RDDs, each of which would be evaluated lazily. One of the two RDDs would be evaluated first and the other would be evaluated second, strictly after the first. At the point the first RDD is evaluated, the second RDD would not yet have "come into existence" (EDIT: Just noticed that there is a similar question for the RDD API with an answer that gives a similar reasoning)
To actually gain any performance benefit, the second RDD would have to be (partially) persisted during the iteration of the first RDD (or, actually, during the iteration of the parent RDD of both, which is triggered by the iteration of the first RDD). IMO this wouldn't align overly well with the design of the rest of the RDD API. Not sure if the performance gains would justify this.
I think the best you can achieve is to avoid writing two filter calls directly in your business code, by writing an implicit class with a method booleanSplit as a utility method does that part in a similar way as Tzach Zohar's answer, maybe using something along the lines of myDataFrame.withColumn("__condition_value", condition).cache() so the the value of the condition is not calculated twice.

Why does RDD.groupBy return an empty RDD if the initial RDD wasn't empty?

I have an RDD that I've used to load binary files. Each file is broken into multiple parts and processed. After the processing step, each entry is:
(filename, List[Results])
Since the files are broken into several parts, the filename is the same for several entries in the RDD. I'm trying to put the results for each part back together using reduceByKey. However, when I attempt to run a count on this RDD it returns 0:
val reducedResults = my_rdd.reduceByKey((resultsA, resultsB) => resultsA ++ resultsB)
reducedResults.count() // 0
I've tried changing the key it uses with no success. Even with extremely simple attempts to group the results I don't get any output.
val singleGroup = my_rdd.groupBy((k, v) => 1)
singleGroup.count() // 0
On the other hand, if I simply collect the results, then I can group them outside of Spark and everything works fine. However, I still have additional processing that I need to do on the collected results, so that isn't a good option.
What could cause the groupBy/reduceBy commands to return empty RDDs if the initial RDD isn't empty?
Turns out there was a bug in how I was generating the Spark configuration for that particular job. Instead of setting the spark.default.parallelism field to something reasonable, it was being set to 0.
From the Spark documentation on spark.default.parallelism:
Default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set by user.
So while an operation like collect() worked perfectly fine, any attempt to reshuffle the data without specifying the number of partitions gave me an empty RDD. That'll teach me to trust old configuration code.