Many to many join on large datasets in Spark - scala

I have two large datasets, A and B, which I wish to join on key K.
Each dataset contains many rows with the same value of K, so this is a many-to-many join.
This join fails with memory related errors if I just try it naively.
Let's also say grouping both datasets by K, doing the join and then exploding back out with some trickery to get the correct result isn't a viable option, again due to memory issues
Are there any clever tricks people have found which improves the chance of this working?
Update:
Adding a very, very contrived concrete example:
spark-shell --master local[4] --driver-memory 5G --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.sql.autoBroadcastJoinThreshold=-1 --conf spark.sql.shuffle.partitions=10000 --conf spark.default.parallelism=10000
val numbersA = (1 to 100000).toList.toDS
val numbersWithDataA = numbersA.repartition(10000).map(n => (n, 1, Array.fill[Byte](1000*1000)(0)))
numbersWithDataA.write.mode("overwrite").parquet("numbersWithDataA.parquet")
val numbersB = (1 to 100).toList.toDS
val numbersWithDataB = numbersB.repartition(100).map(n => (n, 1, Array.fill[Byte](1000*1000)(0)))
numbersWithDataB.write.mode("overwrite").parquet("numbersWithDataB.parquet")
val numbersWithDataInA = spark.read.parquet("numbersWithDataA.parquet").toDF("numberA", "one", "dataA")
val numbersWithDataInB = spark.read.parquet("numbersWithDataB.parquet").toDF("numberB", "one", "dataB")
numbersWithDataInA.join(numbersWithDataInB, Seq("one")).write.mode("overwrite").parquet("joined.parquet")
Fails with Caused by: java.lang.OutOfMemoryError: Java heap space

--conf spark.sql.autoBroadcastJoinThreshold=-1
means you are disabling the broadcast feature.
You can change it to any suitable value <2gb (since 2gb limit is there). spark.sql.autoBroadcastJoinThreshold is default 10mb
as per spark documentation. I dont know the reason you have disabled it. if you disbale it SparkStregies will switch the path to sortmerge join or shuffle hash join. see my article for details
Remaining I dont think there is any need to change as its common pattern of joining 2 datasets.
Further reading DataFrame join optimization - Broadcast Hash Join
UPDATE : Alternatively In your real example (not contrieved :-)) you can do these steps
Steps :
1) Each dataset find out join key (may be for example, pickup a unique/distinct category or country or state field) and collect them as an array since its small data you can collect.
2) For each category element in an array join the 2 datasets (playing with small dataset joins) with category as where condition add to a sequence of dataframes.
3) reduce and union these dataframes.
scala example :
val dfCatgories = Seq(df1Category1, df2Category2, df3Category3)
dfCatgories.reduce(_ union _)
Note : for each join I still prefer BHJ since it will be less/no shuffle

Related

How to do Spark left outer join efficiently with skewed data (spark 2.3)

I have two datasets DS_A(300Gb csv) and DS_B(50 GB csv).
Currently we are doing a left outer join on the DF
val joinedDS = DS_A.joinWith(DS_B, DS_A("value") === DS_B("value"), "left_outer")
the data is skewed in DS_B and the takes forever to run
I tried to salting technique but facing too many OOM exceptions
Is the any better solution, i cant move to Spark3

Spark Dataframe join heap space issue and too many partitions

I am trying to join dataframes and in fact filtering in advance for simple testing. Each dataframe after filter has only 4 rows. There are 190 columns per dataframe.
I run this code in local and it runs super fast (albeit only 22 columns). Also, the partition size when I check in local is just 1. The join is pretty simple on 2 key columns and I made sure that there is no Cartesian product.
When I run this in my Dev/Uat cluster, it is taking forever and failing in between. Also, I see that the paritions created are around 40,000 per join. I am printing it using resultDf.rdd.partitions.size.
I've divided the join like this and it didn't help.
var joinCols = Seq("subjectid","componenttype")
val df1 = mainDf1.filter("metricname = 'NPV'").withColumnRenamed("my_attr","df1attr").withColumnRenamed("value","df1val")
val df2 = mainDf2.filter("metricname = 'NPV'").withColumnRenamed("my_attr","df2attr").withColumnRenamed("value","df2val")
val df3 = mainDf3.filter("metricname = 'NPV'").withColumnRenamed("my_attr","df3attr").withColumnRenamed("value","df3val")
val df4 = mainDf2.filter("metricname = 'NPV'").withColumnRenamed("my_attr","df4attr").withColumnRenamed("value","df4val")
var resultDf = df1.as("dft").join(df2,joinCols,"inner").select("dft.*","df2attr","df2val")
//check partition size here and show the dataframe to make sure we are getting 4 rows only as expected. I am getting 4 rows but 40,000 partitions and takes lot of time here itself.
resultDf = resultDf.as("dfi").join(df3,joinCols,"inner").select("dfi.*","df3attr","df3val")
//Mostly by here my program comes out either with heap space error or with exception exitCode=56
resultDf = resultDf.as("dfa").join(df4,joinCols,"inner").select("dfa.*","df4attr","df4val")
The naming convention used is all dummy to put the code here. So, please don't mind with that.
Any inputs/help to put me in right direction?

Spark DataFrame row count is inconsistent between runs

When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir
is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.

Spark: How to combine 2 sorted RDDs so that order is preserved after union?

I have 2 sorted RDDs:
val rdd_a = some_pair_rdd.sortByKey().
zipWithIndex.filter(f => f._2 < n).
map(f => f._1)
val rdd_b = another_pair_rdd.sortByKey().
zipWithIndex.filter(f => f._2 < n).
map(f => f._1)
val all_rdd = rdd_a.union(rdd_b)
In all_rdd, I see that the order is not necessarily maintained as I'd imagined (that all elements of rdd_a come first, followed by all elements of rdd_b). Is my assumption incorrect (about the contract of union), and if so, what should I use to append multiple sorted RDDs into a single rdd?
I'm fairly new to Spark so I could be wrong, but from what I understand Union is a narrow transformation. That is, each executor joins only its local blocks of RDD a with its local blocks of RDD b and then returns that to the driver.
As an example, let's say that you have 2 executors and 2 RDDS.
RDD_A = ["a","b","c","d","e","f"]
and
RDD_B = ["1","2","3","4","5","6"]
Let Executor 1 contain the first half of both RDD's and Executor 2 contain the second half of both RDD's. When they perform the union on their local blocks, it would look something like:
Union_executor1 = ["a","b","c","1","2","3"]
and
Union_executor2 = ["d","e","f","4","5","6"]
So when the executors pass their parts back to the driver you would have ["a","b","c","1","2","3","d","e","f","4","5","6"]
Again, I'm new to Spark and I could be wrong. I'm just sharing based on my understanding of how it works with RDD's. Hopefully we can both learn something from this.
You can't. Spark does not have a merge sort, because you can't make assumptions about the way that the RDDs are actually stored on the nodes. If you want things in sort order after you take the union, you need to sort again.

How is the performance impact of select statements on Spark DataFrames?

Using many select statements or expressions on Spark DataFrames, I wonder what their performance impact is on subsequent transformations once triggered by an action.
Given a dataframe df with 10 columns a to j.
How is the influence if I use as for column renaming on each column?
df.select( df("a").as("1"), ..., df("j").as("10"))
What if I select a subset (e.g. 5 columns)
val df2 = df.select( df("a"), ..., df("e") )
b. How handles Spark this projection? Is df still kept (as df2 is a projection) so df could serve as kind of reference? Or is instead df2 created freshly and df discarded? (neglecting any persist here)
How is the influence of general Column expressions used in select?
Are performance tests for the above cases available? And are performance measurements in general somewhere available? If not, how to measure the performance best?