I noticed an inconsistent behaviour in Juptyter Notebook's spark scala, although I believe it's rather a Spark native behaviour. I was hoping that someone has seen this behaviour before and knows how the read and filter the data correctly.
In the below example I am reading data from an S3 json source. The filter condition is that the cId field contains at least a value of length 2, so that both null values and the value "0" are filtered out. The values in cId contain both numeric and alphanumeric values, like in the example.
When I apply the filter to the json source (after converting it to StringType) it returns an empty data set. When I create a small dataframe and apply the filter, all rows are displayed correctly.
Once treated, the dataframe is exported to parquet format.
I am trying to correctly display the values from the json source. Here are some of the things I have tried:
Use filter() vs where() function. Even if they're almost aliases I wanted to exclude any method errors
turn of the default conversion behaviour from spark (convertMetastoreParquet)
analyze the output plan (see below)
== Physical Plan ==
Format: JSON, Location: InMemoryFileIndex[s3a://s3pathredacted..., PartitionFilters: [], PushedFilters: [IsNotNull(evaluationType), IsNotNull(requestPayload), EqualTo(evaluationType,inbound_acquiring)], ReadSchema: struct<actionRecommended:string,dateInserted:string,evaluationType:string,requestPayload:struct<b...
: +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, true]))
: +- *(1) Project [customerid#116 AS cId#118]
: +- *(1) Filter ((length(customerid#116) > 1) && isnotnull(customerid#116))
: +- *(1) FileScan csv [customerid#116] Batched: false, Format: CSV, Location: InMemoryFileIndex[s3://folder1/folder2/folder3], PartitionFilters: [], PushedFilters: [IsNotNull(customerid)], ReadSchema: struct<customerid:string>
s3MLDataSource = "s3Path"
var dfJson = (spark.read.json(s3MLDataSource)
var dfSeq= Seq("12345","67890","1a2b3c4d").toDF("customerId")
dfJson = dfJson.select(col("customerId").cast(StringType).as("cId"))
dfJson = dfJson.where("length(cId)>1")
dfJson.show(false)
+--------------+
|cId |
+--------------+
| |
+--------------+
//Session settings
%%configure -f
{"executorMemory": "4G","driverMemory": "2G","executorCores": 8,
"conf": { "spark.dynamicAllocation.enabled": "true",
"spark.dynamicAllocation.shuffleTracking.enabled": "true",
"spark.shuffle.service.enabled": "true",
"spark.dynamicAllocation.minExecutors": "2",
"spark.sql.hive.convertMetastoreParquet": "false",
"spark.sql.hive.metastorePartitionPruning": "false",
"spark.dynamicAllocation.maxExecutors": "20",
"spark.sql.autoBroadcastJoinThreshold": "26214400",
"spark.sql.cbo.enabled": "true",
"spark.ui.showConsoleProgress": "false",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer"}}
Related
I have a DataFrame with the columns:
field1, field1_name, field3, field5, field4, field2, field6
I am selecting it so that I only keep field1, field2, field3, field4. Note that there is no field5 after the select.
After that, I have a filter that uses field5 and I would expect it to throw an analysis error since the column is not there, but instead it is filtering the original DataFrame (before the select) because it is pushing down the filter, as shown here:
== Parsed Logical Plan ==
'Filter ('field5 = 22)
+- Project [field1#43, field2#48, field3#45, field4#47]
+- Relation[field1#43,field1_name#44,field3#45,field5#46,field4#47,field2#48,field6#49] csv
== Analyzed Logical Plan ==
field1: string, field2: string, field3: string, field4: string
Project [field1#43, field2#48, field3#45, field4#47]
+- Filter (field5#46 = 22)
+- Project [field1#43, field2#48, field3#45, field4#47, field5#46]
+- Relation[field1#43,field1_name#44,field3#45,field5#46,field4#47,field2#48,field6#49] csv
== Optimized Logical Plan ==
Project [field1#43, field2#48, field3#45, field4#47]
+- Filter (isnotnull(field5#46) && (field5#46 = 22))
+- Relation[field1#43,field1_name#44,field3#45,field5#46,field4#47,field2#48,field6#49] csv
== Physical Plan ==
*Project [field1#43, field2#48, field3#45, field4#47]
+- *Filter (isnotnull(field5#46) && (field5#46 = 22))
+- *FileScan csv [field1#43,field3#45,field5#46,field4#47,field2#48] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/Users/..., PartitionFilters: [], PushedFilters: [IsNotNull(field5), EqualTo(field5,22)], ReadSchema: struct<field1:string,field3:string,field5:string,field4:stri...
As you can see the physical plan has the filter before the project... Is this the expected behaviour? I would expect an analysis exception instead...
A reproducible example of the issue:
val df = Seq(
("", "", "")
).toDF("field1", "field2", "field3")
val selected = df.select("field1", "field2")
val shouldFail = selected.filter("field3 == 'dummy'") // I was expecting this filter to fail
shouldFail.show()
Output:
+------+------+
|field1|field2|
+------+------+
+------+------+
The documentation on the Dataset/Dataframe describes the reason for what you are observing quite well:
"Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. "
The important part is highlighted in bold. When applying select and filter statements it just gets added to a logical plan that gets only parsed by Spark when an action is applied. When parsing this full logical plan, the Catalyst Optimizer looks at the whole plan and one of the optimization rules is to push down filters, which is what you see in your example.
I think this is a great feature. Even though you are not interested in seeing this particular field in your final Dataframe, it understands that you are not interested in some of the original data.
That is the main benefit of Spark SQL engine as opposed to RDDs. It understands what you are trying to do without being told how to do it.
I'm trying to join two large Spark dataframes using Scala and I can't get it to perform well. I really hope someone can help me.
I have the following two text files:
dfPerson.txt (PersonId: String, GroupId: String) 2 million rows (100MB)
dfWorld.txt (PersonId: String, GroupId: String, PersonCharacteristic: String) 30 billion rows (1TB)
First I parse the text files to parquet and partition on GroupId, which has 50 distinct values and a rest group.
val dfPerson = spark.read.csv("input/dfPerson.txt")
dfPerson.write.partitionBy("GroupId").parquet("output/dfPerson")
val dfWorld = spark.read.csv("input/dfWorld.txt")
dfWorld.write.partitionBy("GroupId").parquet("output/dfWorld")
Note: a GroupId can contain 1 PersonId up to 6 billion PersonIds, so since it is skewed it might not be the best partition column but it is all I could think of.
Next I read the parquet files and join them, I took the following approaches:
Approach 1: Basic spark join operation
val dfPerson = spark.read.parquet("output/dfPerson")
val dfWorld = spark.read.parquet("output/dfWorld")
dfWorld.as("w").join(
dfPerson.as("p"),
$"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
"right"
)
.drop($"w.GroupId")
.drop($"w.PersonId")
This however didn't perform well and shuffled over 1 TB of data.
Approach 2: Broadcast hash join
Since dfPerson might be small enough to hold in memory I thought this approach might solve my problem
val dfPerson = spark.read.parquet("output/dfPerson")
val dfWorld = spark.read.parquet("output/dfWorld")
dfWorld.as("w").join(
broadcast(dfPerson).as("p"),
$"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
"right"
)
.drop($"w.GroupId")
.drop($"w.PersonId")
This also didn't perform well and also shuffled over 1 TB of data which makes me believe the broadcast didn't work?
Approach 3: Bucket and sort the dataframe
I first try to bucket and sort the dataframes before writing to parquet and then join:
val dfPersonInput = spark.read.csv("input/dfPerson.txt")
dfPersonInput
.write
.format("parquet")
.partitionBy("GroupId")
.bucketBy(4,"PersonId")
.sortBy("PersonId")
.mode("overwrite")
.option("path", "output/dfPerson")
.saveAsTable("dfPerson")
val dfPerson = spark.table("dfPerson")
val dfWorldInput = spark.read.csv("input/dfWorld.txt")
dfWorldInput
.write
.format("parquet")
.partitionBy("GroupId")
.bucketBy(4,"PersonId")
.sortBy("PersonId")
.mode("overwrite")
.option("path", "output/dfWorld")
.saveAsTable("dfWorld")
val dfWorld = spark.table("dfWorld")
dfWorld.as("w").join(
dfPerson.as("p"),
$"w.GroupId" === $"p.GroupId" && $"w.PersonId" === $"p.PersonId",
"right"
)
.drop($"w.GroupId")
.drop($"w.PersonId")
With the following execution plan:
== Physical Plan ==
*(5) Project [PersonId#743]
+- SortMergeJoin [GroupId#73, PersonId#71], [GroupId#745, PersonId#743], RightOuter
:- *(2) Sort [GroupId#73 ASC NULLS FIRST, PersonId#71 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(GroupId#73, PersonId#71, 200)
: +- *(1) Project [PersonId#71, PersonCharacteristic#72, GroupId#73]
: +- *(1) Filter isnotnull(PersonId#71)
: +- *(1) FileScan parquet default.dfWorld[PersonId#71,PersonCharacteristic#72,GroupId#73] Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[file:/F:/Output/dfWorld..., PartitionCount: 52, PartitionFilters: [isnotnull(GroupId#73)], PushedFilters: [IsNotNull(PersonId)], ReadSchema: struct<PersonId:string,PersonCharacteristic:string>, SelectedBucketsCount: 4 out of 4
+- *(4) Sort [GroupId#745 ASC NULLS FIRST, PersonId#743 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(GroupId#745, PersonId#743, 200)
+- *(3) FileScan parquet default.dfPerson[PersonId#743,GroupId#745] Batched: true, Format: Parquet, Location: CatalogFileIndex[file:/F:/Output/dfPerson], PartitionCount: 45, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<PersonId:string,GroupId:string>, SelectedBucketsCount: 4 out of 4
Also this didn't perform well.
To conclude
All approaches take approximately 150-200 hours (based on the progress on stages and tasks in the spark jobs after 24 hours) and follow the following strategy:
DAG visualization
I guess there is something I'm missing with either the partitioning, bucketing, sorting parquet, or all of them.
Any help would be greatly appreciated.
What is the goal you're trying to achieve?
Why do you need to have it joined?
Join for a sake of join will take you nowhere, unless you have enough memory/disk space to collect 1TB x 100MB worth of data
Edited based on response
If you only need records related to persons that are presented in dfPerson then you don't need right/left join, inner join would be what you want.
Broadcast will only work if your DF is less than broadcast settings in your Spark (10 Mb by default), it's ignored otherwise.
dfPerson.as("p").join(
dfWorld.select(
$"GroupId", $"PersonId",
$"<feature1YouNeed>", $"<feature2YouNeed>"
).as("w"),
Seq("GroupId", "PersonId")
)
This should give you feature you're up to
NB: Replace < feature1YouNeed > and < feature2YouNeed > with actual column names.
I start spark-shell with spark 2.3.1 with these params:
--master='local[*]'
--executor-memory=6400M
--driver-memory=60G
--conf spark.sql.autoBroadcastJoinThreshold=209715200
--conf spark.sql.shuffle.partitions=1000
--conf spark.local.dir=/data/spark-temp
--conf spark.driver.extraJavaOptions='-Dderby.system.home=/data/spark-catalog/'
Then create two hive tables with sort and buckets
First table name - table1
Second table name - table2
val storagePath = "path_to_orc"
val storage = spark.read.orc(storagePath)
val tableName = "table1"
sql(s"DROP TABLE IF EXISTS $tableName")
storage.select($"group", $"id").write.bucketBy(bucketsCount, "id").sortBy("id").saveAsTable(tableName)
(the same code for table2)
I expected that when i join any of this tables with another df, there is not unnecessary Exchange step in query plan
Then i turn off broadcast to use SortMergeJoin
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1)
I take some df
val sample = spark.read.option("header", "true).option("delimiter", "\t").csv("path_to_tsv")
val m = spark.table("table1")
sample.select($"col" as "id").join(m, Seq("id")).explain()
== Physical Plan ==
*(4) Project [id#24, group#0]
+- *(4) SortMergeJoin [id#24], [id#1], Inner
:- *(2) Sort [id#24 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#24, 1000)
: +- *(1) Project [col#21 AS id#24]
: +- *(1) Filter isnotnull(col#21)
: +- *(1) FileScan csv [col#21] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/samples/sample-20K], PartitionFilters: [], PushedFilters: [IsNotNull(col)], ReadSchema: struct<col:string>
+- *(3) Project [group#0, id#1]
+- *(3) Filter isnotnull(id#1)
+- *(3) FileScan parquet default.table1[group#0,id#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/data/table1], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<group:string,id:string>
But when i use union for two tables before join
val m2 = spark.table("table2")
val mUnion = m union m2
sample.select($"col" as "id").join(mUnion, Seq("id")).explain()
== Physical Plan ==
*(6) Project [id#33, group#0]
+- *(6) SortMergeJoin [id#33], [id#1], Inner
:- *(2) Sort [id#33 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#33, 1000)
: +- *(1) Project [col#21 AS id#33]
: +- *(1) Filter isnotnull(col#21)
: +- *(1) FileScan csv [col#21] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/samples/sample-20K], PartitionFilters: [], PushedFilters: [IsNotNull(col)], ReadSchema: struct<col:string>
+- *(5) Sort [id#1 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#1, 1000)
+- Union
:- *(3) Project [group#0, id#1]
: +- *(3) Filter isnotnull(id#1)
: +- *(3) FileScan parquet default.membership_g043_append[group#0,id#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/data/table1], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<group:string,id:string>
+- *(4) Project [group#4, id#5]
+- *(4) Filter isnotnull(id#5)
+- *(4) FileScan parquet default.membership_g042[group#4,id#5] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/data/table2], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<group:string,id:string>
In this case appeared sort and partition (step 5)
How to union two hive tables without sorting and exchanging
As far as I know, spark does not consider sorting when joining but only partitions. So in order to get efficient joins, you must partition by the same column. This is because sorting does not guarantee that records with same key end up in the same partition. Spark has to make sure all keys with same values are shuffled to the same partition and on the same executor from multiple dataframes.
Method 1:
Querying a parquet file directly as :
val sqlDF = spark.sql("SELECT columns FROM parquet.`sample.parquet`")
and
Method 2:
Querying the Dataframe after reading a parquet file as :
df = spark.read.parquet(path_to_parquet_file)
df.select(columns)
and
Method 3:
Querying a Temporary View as :
df.createOrReplaceTempView("sample")
val sqlDF = spark.sql("SELECT columns FROM sample")
Behind the scene, are all 3 essentially executed the same way ?
In Method 1, does the parquet get converted into dataframe / dataset
before query execution ?
Which of the 3 methods are efficient and why ? (if they are
different)
Is there a specific use case for these methods ? (if they are
different)
Thank You !
Short Answer
Yes. The 3 ways you have illustrated of querying a Parquet file using Spark are executed in the same way.
Long Answer
The reason why this is so is a combination of two features of Spark: lazy evaluation & query optimization.
As a developer, you could split the Spark operations into multiple steps (as you have done in method 2). Internally, Spark (lazily) evaluates the operations in conjunction and applies optimizations on it. In this case, Spark could optimize the operations by column pruning (basically, it will not read the entire parquet data into memory; only the specific columns you have requested.)
The 3rd method of creating a temporary view is just about naming the data you have read, so that you can reference in further operations. It does not change how it is computed in the first place.
For more information on optimizations performed by Spark in reading Parquet, refer this in-depth article.
NOTE:
As I have mentioned in a comment to the question, you have selected specific columns in method 2; while the other two reads the entire data. Since, these are essentially different operations, there will be difference in execution. The above answer assumes similar operations are been performed in each of the three methods (either reading complete data or some specific columns from the file).
If you are trying to evaluate which '3' of them is best for same objective, There is no difference in between those. physical plan tell's your ask - 'Behind the scene?'.
Method 1:
sqlDF = spark.sql("SELECT CallNumber,CallFinalDisposition FROM parquet.`/tmp/ParquetA`").show()
== Physical Plan ==
CollectLimit 21
+- *(1) Project [cast(CallNumber#2988 as string) AS CallNumber#3026, CallFinalDisposition#2992]
+- *(1) FileScan parquet [CallNumber#2988,CallFinalDisposition#2992] Batched: true, Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/ParquetA], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<CallNumber:int,CallFinalDisposition:string>
Method 2:
df = spark.read.parquet('/tmp/ParquetA')
df.select("CallNumber","CallFinalDisposition").show()
== Physical Plan ==
CollectLimit 21
+- *(1) Project [cast(CallNumber#3100 as string) AS CallNumber#3172, CallFinalDisposition#3104]
+- *(1) FileScan parquet [CallNumber#3100,CallFinalDisposition#3104] Batched: true, Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/ParquetA], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<CallNumber:int,CallFinalDisposition:string>
Method 3:
tempDF = spark.read.parquet('/tmp/ParquetA/')
tempDF.createOrReplaceTempView("temptable");
tiny = spark.sql("SELECT CallNumber,CallFinalDisposition FROM temptable").show()
== Physical Plan ==
CollectLimit 21
+- *(1) Project [cast(CallNumber#2910 as string) AS CallNumber#2982, CallFinalDisposition#2914]
+- *(1) FileScan parquet [CallNumber#2910,CallFinalDisposition#2914] Batched: true, Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/ParquetA], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<CallNumber:int,CallFinalDisposition:string>
I'm trying to join two Dataframes, one is around 10 million records and the other is about 1/3 of that. Since the small DataFrame fits comfortably in the executor's memory, I perform a broadcast join and then write out the result:
val df = spark.read.parquet("/plablo/data/tweets10M")
.select("id", "content", "lat", "lon", "date")
val fullResult = FilterAndClean.performFilter(df, spark)
.select("id", "final_tokens")
.filter(size($"final_tokens") > 1)
val fullDFWithClean = {
df.join(broadcast(fullResult), "id")
}
fullDFWithClean
.write
.partitionBy("date")
.mode(saveMode = SaveMode.Overwrite)
.parquet("/plablo/data/cleanTokensSpanish")
After a while, I get this error:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResultInForkJoinSafely(ThreadUtils.scala:215)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:125)
at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:231)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:124)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:123)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:98)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.codegenInner(BroadcastHashJoinExec.scala:197)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume(BroadcastHashJoinExec.scala:82)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:36)
at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:68)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:88)
at org.apache.spark.sql.execution.FilterExec.doConsume(basicPhysicalOperators.scala:209)
at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:153)
at org.apache.spark.sql.execution.FileSourceScanExec.consume(DataSourceScanExec.scala:141)
at org.apache.spark.sql.execution.FileSourceScanExec.doProduceVectorized(DataSourceScanExec.scala:392)
at org.apache.spark.sql.execution.FileSourceScanExec.doProduce(DataSourceScanExec.scala:315)
.....
There's this question that addresses the same issue. In the comments, it's mentioned that increasing spark.sql.broadcastTimeout could fix the problem, but after setting a large value (5000 seconds) I still get the same error (although much later, of course).
The original data is partitioned by date column, the function that returns fullResult performs a series of narrow transformations and filters the data so, I'm assuming, the partition is preserved.
The Physical Plan confirms that spark will perform a BroadcastHashJoin
*Project [id#11, content#8, lat#5, lon#6, date#150, final_tokens#339]
+- *BroadcastHashJoin [id#11], [id#363], Inner, BuildRight
:- *Project [id#11, content#8, lat#5, lon#6, date#150]
: +- *Filter isnotnull(id#11)
: +- *FileScan parquet [lat#5,lon#6,content#8,id#11,date#150]
Batched: true, Format: Parquet, Location:
InMemoryFileIndex[hdfs://geoint1.lan:8020/plablo/data/tweets10M],
PartitionCount: 182, PartitionFilters: [], PushedFilters:
[IsNotNull(id)], ReadSchema:
struct<lat:double,lon:double,content:string,id:int>
+- BroadcastExchange
HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)))
+- *Project [id#363, UDF(UDF(UDF(content#360))) AS
final_tokens#339]
+- *Filter (((UDF(UDF(content#360)) = es) && (size(UDF(UDF(UDF(content#360)))) > 1)) && isnotnull(id#363))
+- *FileScan parquet [content#360,id#363,date#502] Batched: true, Format: Parquet, Location: InMemoryFileIndex[hdfs://geoint1.lan:8020/plablo/data/tweets10M], PartitionCount: 182, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<content:string,id:int>
I believe that, given the size of my data, this operation should be relatively fast (on 4 executors with 5 cores each and 4g RAM running on YARN in cluster mode).
Any help is appreciated
In situations like this, the first question is how big is the dataframe you are trying to broadcast? It's worth estimating its size (see this SO answer and this also).
Note that Spark's default spark.sql.autoBroadcastJoinThreshold is only 10Mb so you are really not supposed to broadcast very large datasets.
Your use of broadcast takes precedence and may be forcing Spark to do something it otherwise would choose not to do. A good rule is to only force aggressive optimization if the default behavior is unacceptable because aggressive optimization often creates various edge conditions, like the one you are experiencing.
This can also fail if spark.task.maxDirectResultSize is not increased. It's default is 1 megabyte (1m). Try spark.task.maxDirectResultSize=10g.