Spark refusing explicit broadcast join - pyspark

In my pyspark code (v2.4) I join two dataframes: one very large, the other tiny (just a few entries). The small DF, however, is obtained as a summary of another large DF.
My broadcast join maxsize thresholds are generous enough for a broadcast join to take place, but it just doesn't. Even when I explicitly call for it:
df_large.join(broadcast(df_small), ...)
physical plan still indicates SortMergeJoin. I get BroadcastJoin only if I save the small DF to a file and re-read it prior to the join command.
What other means do I have to enforce broadcast join?
As a background info:
Broadcast join is needed because otherwise there is an enormous skew
replacing a join with UDF seems like a bad alternative because this involves an early collect (though maybe broadcast join does it as well?)
Thanks

Related

Is it feasible to force Spark data set stored on driver node (not as pandas df)?

In my script, there are frequent joins of certain small data sets. I notice join normally takes longer than other operations, to my understanding probably because join is done only at driver node, so data from partitions need to be collected every time to the driver before joining.
Is it possible at all to make some Spark data set stay on the driver node, so that for next join it's already there ? I guess that may be interesting instead of collecting the same data set to driver node each time, join, then dispatch back to partitions, then next operation collect back at driver to join.
You may say using .collect() and work with it purely as pandas data frame on the driver node.
I did mix and match pandas data frame and Spark data frame on several occasions. However, in my scenario, besides using on small joins it will also be used in a large join with a big data set, that I broadcast join at the moment.
Imagine the operation is
small_df = small_df.join1(...)
small_df = f1(...)
small_df = f2(...)
.....
small_df = small_df.join2(..)
...
small_df = small_df.join3(..)
result = broadcast(small_df).join(big_df)
Again, in both cases, I understand that the small_df will always be pulled back to the driver node for join. Hence ideally it would be staying in the driver node while remaining a Spark df and not pandas df.
It's just a hypothetical thought; I am not sure it's feasible or efficient at all. Perhaps it goes against the philosophy of Spark, but if it's possible I would like to try and compare it with my current approach.
A join is performed at the executors not the driver. As long as you use the small dataset on the right hand side of the join, Spark should automatically perform a broadcast join where appropriate.

Spark dataframe Join issue

Below code snippet works fine. (Read CSV, Read Parquet and join each other)
//Reading csv file -- getting three columns: Number of records: 1
df1=spark.read.format("csv").load(filePath)
df2=spark.read.parquet(inputFilePath)
//Join with Another table : Number of records: 30 Million, total
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1") "right")
Its weired that below code snippet doesnt work. (Read Hbase, Read Parquet and join each other)(Difference is reading from Hbase)
//Reading from Hbase (It read from hbase properly -- getting three columns: Number of records: 1
df1=read from Hbase code
// It read from Hbase properly and able to show one record.
df1.show
df2=spark.read.parquet(inputFilePath)
//Join with Another table : Number of records: 50 Million, total
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1") "right")
Error: Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Total size of serialized results of 56 tasks (1024.4 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)
Then I have added spark.driver.maxResultSize=5g, then another error started occuring, Java Heap space error (run at ThreadPoolExecutor.java). If I observe memory usage in Manager I see that usage just keeps going up until it reaches ~ 50GB, at which point the OOM error occurs. So for whatever reason the amount of RAM being used to perform this operation is ~10x greater than the size of the RDD I'm trying to use.
If I persist df1 in memory&disk and do a count(). Program works fine. Code snippet is below
//Reading from Hbase -- getting three columns: Number of records: 1
df1=read from Hbase code
**df1.persist(StorageLevel.MEMORY_AND_DISK)
val cnt = df1.count()**
df2=spark.read.parquet(inputFilePath)
//Join with Another table : Number of records: 50 Million, total
columns: 15
df2.join(broadcast(df1), col("df2col1") === col("df1col1") "right")
It works with file even it has the same data but not with Hbase. Running this on 100 worknode cluster with 125 GB of memory on each. So memory is not the problem.
My question here is both the file and Hbase has same data and both read and able to show() the data. But why only Hbase is failing. I am struggling to understand what might be going wrong with this code. Any suggestions will be appreciated.
When the data is being extracted spark is unaware of number of rows which are retrieved from HBase, hence the strategy is opted would be sort merge join.
thus it tries to sort and shuffle the data across the executors.
to avoid the problem, we can use broadcast join at the same time we don't wont to sort and shuffle the data across the from df2 using the key column, which shows the last statement in your code snippet.
however to bypass this (since it is only one row) we can use Case expression for the columns to be padded.
example:
df.withColumn(
"newCol"
,when(col("df2col1").eq(lit(hbaseKey))
,lit(hbaseValueCol1))
.otherwise(lit(null))
I'm sometimes struggling with this error too. Often this occurs when spark tries to broadcast a large table during a join (that happens when spark's optimizer underestimates the size of the table, or the statistics are not correct). As there is no hint to force sort-merge join (How to hint for sort merge join or shuffled hash join (and skip broadcast hash join)?), the only option is to disable broadcast joins by setting spark.sql.autoBroadcastJoinThreshold= -1
When I have problem with memory during a join it usually means one of two reasons:
You have too few partitions in dataframes (partitions are too big)
There are many duplicates in the two dataframes on the key on which you join, and the join explodes your memory.
Ad 1. I think you should look at number of partitions you have in each table before join. When Spark reads a file it does not necessarily keep the same number of partitions as was the original table (parquet, csv or other). Reading from csv vs reading from HBase might create different number of partitions and that is why you see differences in performance. Too large partitions become even larger after join and this creates memory problem. Have a look at the Peak Execution Memory per task in Spark UI. This will give you some idea about your memory usage per task. I found it best to keep it below 1 Gb.
Solution: Repartition your tables before the join.
Ad. 2 Maybe not the case here but worth checking.

Spark performance: local faster than cluster (very uneven executor load)

let me start off by saying that I'm relatively new to spark so if I'm saying something that doesn't make sense just please correct me.
Summarising the problem, no mather what I do, at certain stages one executor does all the computation, which makes cluster execution slower than local, one-processor execution.
Full story:
I've written a spark 1.6 application which consists of series of maps, filters, joins and a short graphx part. The app uses only one data source - csv file. For the purpose of development I created a mockup dataset consisting of 100 000 rows, 7MB, with all of the fields having random data with uniform distribution (random sorting in file as well). The joins are self inner joins on PairRDD on various fields (the dataset has duplicate keys with ~200 duplicates per key immitating real data), leading to cartesian product within key. Then I perform a number of map and filter operations on the result of the joins, store it as RDD of some custom-class objects and save everything as a graph at the and.
I developed the code on my laptop and run it, which took about 5 minutes (windows machine, local file). To my surprise, when I deployed the jar onto the cluster (master yarn, cluster mode, file in csv in HDFS) and submitted it the code has taken 8 minutes to execute.
I've run same experiment with smaller data and the results were 40 seconds locally and 1.1 min on the cluster.
When I looked at history server I've seen that 2 stages are particularly long (almost 4 mins each), and on these stages there is one task that takes >90% of the time. I run the code multiple times and it was always the same task that took so much time, even though it was deployed on different data node each time.
To my surprise, when I opened the executors I saw that one executor does almost all of the job (in terms of time spent) and executes most jobs. In the screenshot provided second most 'active' executor had 50 tasks, but that's not always the case - in different submission second most busy executor had 15 tasks, and the leading one 95).
Moreover, I saw that the time of 3.9 mins is used for computation (second screenshot), which is most heavy on the joined data shortly after map. I thought, that the data may not be partitioned equally and one executor has to perform all the computation. Therefore, I tried to patrition the pairRdd manually (using .partitionBy(new HashPartitioner(40))) right before join (similar execution time) or right after join (execution even slower).
What could be the issue? Any help will be appreciated.
It's hard to tell without seeing your queries and understanding your Dataset, I'm guessing you didn't include it either because it's very complex or sensitive? So this is a little bit of a shot in the dark, however this looks a lot like a problem we dealt with on my team at work. My rough guess at what is happening is that during one of your joins, you have a key space that has a high cardinality, but very uneven distribution. In our case, we were joining on sources of web traffic, which while we have thousands of possible sources of traffic, the overwhelming majority of the traffic comes from just a few. This caused a problem when we joined. The keys would be distributed evenly among the executors, however since maybe 95% of the data shared maybe 3 or 4 keys, a very small number of executors were doing most of the work. When you find a join that suffers from this, the thing to do is to pick the smaller of the two datasets and explicitly perform a broadcast join. (Spark normally will try to do this, but it's not always perfect at being able to tell when it should.)
To do this, let's say you have two DataFrames. One of them has two columns, number and stringRep where number is just one row for all integers from 0-10000 and stringRep is just a string representation of that, so "one", "two", "three", etc. We'll call this numToString
The other DataFrame has some key column to join against number in numToString called kind, some other irrelevant data, and 100,000,000 rows. We'll call this DataFrame ourData. Then let's say that the distribution of the 100,000,000 rows in ourData is 90% have kind == 1, 5% have kind == 2, and the remaining 5% distributed pretty evenly amongst the remaining 99,998 numbers. When you perform the following code:
val numToString: DataFrame = loadNumToString()
val ourData: DataFrame = loadOurCode()
val joined = ourData.join(numToString).where(ourData("kind") === numToString("number"))
...it is very likely that Spark will send %90 of the data (that which has kind == 1) to one executor, %5 of the data (that which has kind == 2) to another executor, and the remaining %5 smeared across the rest, leaving two executors with huge partitions and the rest with very tiny ones.
The way around this as I mentioned before is to explicitly perform a broadcast join. What this does is take one DataFrame and distribute it entirely to each node. So you would do this instead:
val joined = ourData.join(broadcast(numToString)).where(ourData("kind") === numToString("number"))
...which would send numToString to each executor. Assuming that ourData was evenly partitioned beforehand, the data should remain evenly partitioned across executors. This might not be your problem, but it does sound a lot like a problem we were having. Hope it helps!
More information on broadcast joins can be found here:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins-broadcast.html

Understanding Spark partitioning

I'm trying to understand how Spark partitions data. Suppose I have an execution DAG like that in the picture (orange boxes are the stages). The two groupBy and the join operations are supposed to be very heavy if the RDD's are not partitioned.
Is it wise then to use .partitonBy(new HashPartitioner(properValue)) to P1, P2, P3 and P4 to avoid shuffle? What's the cost of partitioning an existing RDD? When isn't proper to partition an existing RDD? Doesn't Spark partition my data automatically if I don't specify a partitioner?
Thank you
tl;dr The answers to your questions respectively: Better to partition at the outset if you can; Probably less than not partitioning; Your RDD is partitioned one way or another anyway; Yes.
This is a pretty broad question. It takes up a good portion of our course! But let's try to address as much about partitioning as possible without writing a novel.
As you know, the primary reason to use a tool like Spark is because you have too much data to analyze on one machine without having the fan sound like a jet engine. The data get distributed among all the cores on all the machines in your cluster, so yes, there is a default partitioning--according to the data. Remember that the data are distributed already at rest (in HDFS, HBase, etc.), so Spark just partitions according to the same strategy by default to keep the data on the machines where they already are--with the default number of partitions equal to the number of cores on the cluster. You can override this default number by configuring spark.default.parallelism, and you want this number to be 2-3 per core per machine.
However, typically you want data that belong together (for example, data with the same key, where HashPartitioner would apply) to be in the same partition, regardless of where they are to start, for the sake of your analytics and to minimize shuffle later. Spark also offers a RangePartitioner, or you can roll your own for your needs fairly easily. But you are right that there is an upfront shuffle cost to go from default partitioning to custom partitioning; it's almost always worth it.
It is generally wise to partition at the outset (rather than delay the inevitable with partitionBy) and then repartition if needed later. Later on you may choose to coalesce even, which causes an intermediate shuffle, to reduce the number of partitions and potentially leave some machines and cores idle because the gain in network IO (after that upfront cost) is greater than the loss of CPU power.
(The only situation I can think of where you don't partition at the outset--because you can't--is when your data source is a compressed file.)
Note also that you can preserve partitions during a map transformation with mapPartitions and mapPartitionsWithIndex.
Finally, keep in mind that as you experiment with your analytics while you work your way up to scale, there are diagnostic capabilities you can use:
toDebugString to see the lineage of RDDs
getNumPartitions to, shockingly, get the number of partitions
glom to see clearly how your data are partitioned
And if you pardon the shameless plug, these are the kinds of things we discuss in Analytics with Apache Spark. We hope to have an online version soon.
By applying partitionBy preemptively you don't avoid the shuffle. You just push it in another place. This can be a good idea if partitioned RDD is reused multiple times, but you gain nothing for a one-off join.
Doesn't Spark partition my data automatically if I don't specify a partitioner?
It will partition (a.k.a. shuffle) your data a part of the join) and subsequent groupBy (unless you keep the same key and use transformation which preserves partitioning).

How to transform RDD, Dataframe or Dataset straight to a Broadcast variable without collect?

Is there any way (or any plans) to be able to turn Spark distributed collections (RDDs, Dataframe or Datasets) directly into Broadcast variables without the need for a collect? The public API doesn't seem to have anything "out of box", but can something be done at a lower level?
I can imagine there is some 2x speedup potential (or more?) for these kind of operations. To explain what I mean in detail let's work through an example:
val myUberMap: Broadcast[Map[String, String]] =
sc.broadcast(myStringPairRdd.collect().toMap)
someOtherRdd.map(someCodeUsingTheUberMap)
This causes all the data to be collected to the driver, then the data is broadcasted. This means the data is sent over the network essentially twice.
What would be nice is something like this:
val myUberMap: Broadcast[Map[String, String]] =
myStringPairRdd.toBroadcast((a: Array[(String, String)]) => a.toMap)
someOtherRdd.map(someCodeUsingTheUberMap)
Here Spark could bypass collecting the data altogether and just move the data between the nodes.
BONUS
Furthermore, there could be a Monoid-like API (a bit like combineByKey) for situations where the .toMap or whatever operation on Array[T] is expensive, but can possibly be done in parallel. E.g. constructing certain Trie structures can be expensive, this kind of functionality could result in awesome scope for algorithm design. This CPU activity can also be run while the IO is running too - while the current broadcast mechanism is blocking (i.e. all IO, then all CPU, then all IO again).
CLARIFICATION
Joining is not (main) use case here, it can be assumed that I sparsely use the broadcasted data structure. For example the keys in someOtherRdd by no means covers the keys in myUberMap but I don't know which keys I need until I traverse someOtherRdd AND suppose I use myUberMap multiple times.
I know that all sounds a bit vague, but the point is for more general machine learning algorithm design.
While theoretically this is an interesting idea I will argue that although theoretically possible it has very limited practical applications. Obviously I cannot speak for PMC so I cannot say if there are any plans to implement this type of broadcasting mechanism at all.
Possible implementation:
Since Spark already provides torrent broadcasting mechanism which behavior is described as follows:
The driver divides the serialized object into small chunks and
stores those chunks in the BlockManager of the driver.
On each executor, the executor first attempts to fetch the object from its BlockManager.
If it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or
other executors if available.
Once it gets the chunks, it puts the chunks in its own
BlockManager, ready for other executors to fetch from.
it should be possible to reuse the same mechanism for direct node-to-node broadcasting.
It is worth noting that this approach cannot completely eliminate driver communication. Even though blocks could be created locally you still need a single source of truth to advertise a set of blocks to fetch.
Limited applications
One problem with broadcast variables is that there are quite expensive. Even if you can eliminate driver bottleneck two problems remain:
Memory required to store deserialized object on each executor.
Cost of transferring broadcasted data to every executor.
The first problem should be relatively obvious. It is not only about direct memory usage but also about GC cost and its effect on overall latency. The second one is rather subtle. I partially covered this in my answer to Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark but let's discus this further.
From network traffic perspective broadcasting a whole dataset is pretty much equivalent to creating Cartesian product. So if dataset is large enough for driver becoming a bottleneck it is unlikely to be a good candidate for broadcasting and targeted approach like hash join can be preferred in practice.
Alternatives:
There are some methods which can be used to achieve similar results as direct broadcast and address issues enumerated above including:
Passing data via distributed file system.
Using replicated database collocated with worker nodes.
I don't know if we can do it for RDD but you can do it for Dataframe
import org.apache.spark.sql.functions
val df:DataFrame = your_data_frame
val broadcasted_df = functions.broadcast(df)
now you can use variable broadcasted_df and it will be broadcasted to executor.
Make sure broadcasted_df dataframe is not too big and can be send to executor.
broadcasted_df will be broadcaster in operations like for example
other_df.join(broadcasted_df)
and in this case join() operation executes faster because every executor has 1 partition of other_df and whole broadcasted_df
For your question i am not sure you can do what you want. You can not use one rdd inside #map() method of another rdd because spark doesn't allowed transformations inside transformations. And in your case you need to call collect() method to create map from your RDD because you can only use usual map object inside #map() method you can not use RDD there.