Spark SQL join really lazy? - scala

I'm performing an inner join between, say, 8 dataframes, all coming from the same parent. Sample code:
// read parquet
val readDF = session.read.parquet(...)
// multiple expensive transformations are performed over readDF, making its DAG grow
// repartition + cache
val df = readDF.repartition($"type").cache
val df1 = df.filter($"type" === 1)
val df2 = df.filter($"type" === 2)
val df3 = df.filter($"type" === 3)
val df4 = df.filter($"type" === 4)
val df5 = df.filter($"type" === 5)
val df6 = df.filter($"type" === 6)
val df7 = df.filter($"type" === 7)
val df8 = df.filter($"type" === 8)
val joinColumns = Seq("col1", "col2", "col3", "col4")
val joinDF = df1
.join(df2, joinColumns)
.join(df3, joinColumns)
.join(df4, joinColumns)
.join(df5, joinColumns)
.join(df6, joinColumns)
.join(df7, joinColumns)
.join(df8, joinColumns)
Unexpectedly, the joinDF sentence is taking a long time. Join is supposed to be a transformation, not an action.
Do you know what's happening? Is this a use case for checkpointing?
Notes:
- joinDF.explain shows a long DAG lineage.
- using Spark 2.3.0 with Scala

RDD JOIN, SPARK SQL JOIN are known as a Transformation. I ran this with no issue on DataBricks Notebook, but I am not privy to " ...// multiple expensive transformations are performed over readDF, making its DAG grow .... May be there is an Action there.

Indeed, checkpointing seems to fix the long running join. It now behaves as a transformation, returning faster. So, I conclude that the delay was related to the large DAG lineage.
Also, the subsequent actions are now faster.

Related

how to increase performance on Spark distinct() on multiple columns

Could you please suggest alternative way of implementing distinct in spark data frame.
I tried both SQL and spark distinct but since the dataset size (>2 Billion) it fails on the shuffle .
If I increase the node and memory to >250GB, process run for a longe time (more than 7 hours).
val df = spark.read.parquet(out)
val df1 = df.
select($"ID", $"col2", $"suffix",
$"date", $"year", $"codes").distinct()
val df2 = df1.withColumn("codes", expr("transform(codes, (c,s) -> (d,s) )"))
df2.createOrReplaceTempView("df2")
val df3 = spark.sql(
"""SELECT
ID, col2, suffix
d.s as seq,
d.c as code,
year,date
FROM
df2
LATERAL VIEW explode(codes) exploded_table as d
""")
df3.
repartition(
600,
List(col("year"), col("date")): _*).
write.
mode("overwrite").
partitionBy("year", "date").
save(OutDir)

Better way to join in spark scala [duplicate]

I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;
val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))
The solution for this query already exists in pyspark version --provided in the following link
PySpark DataFrame - Join on multiple columns dynamically
I would like to code the same code using spark-scala
In scala you do it in similar way like in python but you need to use map and reduce functions:
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")
val columnsdf1 = df1.columns
val columnsdf2 = df2.columns
val joinExprs = columnsdf1
.zip(columnsdf2)
.map{case (c1, c2) => df1(c1) === df2(c2)}
.reduce(_ && _)
val dfJoinRes = df1.join(df2,joinExprs)

Databrick Azure broadcast variables not serializable

So I am trying to create a extremely simple spark notebook using Azure Databricks and would like to make use of a simple RDD map call.
This is just for messing around, so the example is a bit contrived, but I can not get a value to work in the RDD map call unless it is a static constant value
I have tried using a broadcast variable
Here is a simple example using an int which I broadcast and then try and use in the RDD map
val sparkContext = spark.sparkContext
val sqlContext = spark.sqlContext
import sqlContext.implicits._
val multiplier = 3
val multiplierBroadcast = sparkContext.broadcast(multiplier)
val data = Array(1, 2, 3, 4, 5)
val dataRdd = sparkContext.parallelize(data)
val mappedRdd = dataRdd.map(x => multiplierBroadcast.value)
val df = mappedRdd.toDF
df.show()
Here is another example where I use simple serializable singleton object with an int field which I broadcast and then try and use in the RDD map
val sparkContext = spark.sparkContext
val sqlContext = spark.sqlContext
import sqlContext.implicits._
val multiplier = 3
object Foo extends Serializable { val theMultiplier: Int = multiplier}
val fooBroadcast = sparkContext.broadcast(Foo)
val data = Array(1, 2, 3, 4, 5)
val dataRdd = sparkContext.parallelize(data)
val mappedRdd = dataRdd.map(x => fooBroadcast.value.theMultiplier)
val df = mappedRdd.toDF
df.show()
And finally a List[int] with a single element which I broadcast and then try and use in the RDD map
val sparkContext = spark.sparkContext
val sqlContext = spark.sqlContext
import sqlContext.implicits._
val multiplier = 3
val listBroadcast = sparkContext.broadcast(List(multiplier))
val data = Array(1, 2, 3, 4, 5)
val dataRdd = sparkContext.parallelize(data)
val mappedRdd = dataRdd.map(x => listBroadcast.value.head)
val df = mappedRdd.toDF
df.show()
However ALL the examples above fail with this error. Which as you can see is pointing towards an issue with the RDD map value not being serializable. I can not see the issue, and int value should be serializable using all the above examples I think
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:345)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:335)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2375)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:379)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:378)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:371)
at org.apache.spark.rdd.RDD.map(RDD.scala:378)
If I however make the value in the RDD map a regular int value like this
val sparkContext = spark.sparkContext
val sqlContext = spark.sqlContext
import sqlContext.implicits._
val data = Array(1, 2, 3, 4, 5)
val dataRdd = sparkContext.parallelize(data)
val mappedRdd = dataRdd.map(x => 6)
val df = mappedRdd.toDF
df.show()
Everything is working fine and I see my simple DataFrame shown as expected
Any ideas anyone?
From your code, I would assume that you are on Spark 2+. Perhaps, there is no need to drop down to the RDD level and, instead, work with DataFrames.
The code below shows how to join two DataFrames and explicitly broadcast the first one.
import sparkSession.implicits._
import org.apache.spark.sql.functions._
val data = Seq(1, 2, 3, 4, 5)
val dataDF = data.toDF("id")
val largeDataDF = Seq((0, "Apple"), (1, "Pear"), (2, "Banana")).toDF("id", "value")
val df = largeDataDF.join(broadcast(dataDF), Seq("id"))
df.show()
Typically, small DataFrames are perfect candidates for broadcasting as an optimization whereby they are sent to all executors. spark.sql.autoBroadcastJoinThreshold is a configuration which limits the size of DataFrames eligible for broadcast. Additional details can be found on the Spark official documentation
Note also that with DataFrames, you have access to a handy explain method. With it, you can see the physical plan and it can be useful for debugging.
Running explain() on our example would confirm that Spark is doing a BroadcastHashJoin optimization.
df.explain()
== Physical Plan ==
*Project [id#11, value#12]
+- *BroadcastHashJoin [id#11], [id#3], Inner, BuildRight
:- LocalTableScan [id#11, value#12]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- LocalTableScan [id#3]
If you need additional help with DataFrames, I provide an extensive list of examples at http://allaboutscala.com/big-data/spark/
So the answer was that you should not capture the Spark content in a val and then use that for the broadcast. So this is working code
import sqlContext.implicits._
val multiplier = 3
val multiplierBroadcast = spark.sparkContext.broadcast(multiplier)
val data = Array(1, 2, 3, 4, 5)
val dataRdd = sparkContext.parallelize(data)
val mappedRdd = dataRdd.map(x => multiplierBroadcast.value)
val df = mappedRdd.toDF
df.show()
Thanks to #nadim Bahadoor for this answer

DataFrame persist does not improve performance in Spark

I am writing a Scala script that reads from a table, transforms data and shows result using Spark. I am using Spark 2.1.1.2 and Scala 2.11.8. There is a dataframe instance I use twice in the script (df2 in the code below.). Since dataframes are calculated when an action is called on them, not when they are declared, I predict that this dataframe to be calculated twice. I thought that persisting this dataframe would improve performance thinking that, it would be calculated once (when persisted), instead of twice, if persisted.
However, script run lasts ~10 seconds longer when I persist compared to when I don't persist. I cannot figure out what the reason for this is. If someone has an idea, it would be much appreciated.
My submission command line is below:
spark-submit --class TestQuery --master yarn --driver-memory 10G --executor-memory 10G --executor-cores 2 --num-executors 4 /home/bcp_data/test/target/TestQuery-1.0-SNAPSHOT.jar
Scala script is below:
val spark = SparkSession
.builder()
.appName("TestQuery")
.config("spark.sql.warehouse.dir", "file:/tmp/hsperfdata_hdfs/spark-warehouse/")
.enableHiveSupport()
.getOrCreate()
val m = spark.sql("select id, startdate, enddate, status from members")
val l = spark.sql("select mid, no, status, potential from log")
val r = spark.sql("select mid, code from records")
val df1 = m.filter(($"status".isin(1,2).and($"startdate" <= one_year_ago)).and((($"enddate" >= one_year_ago)))
val df2 = df1.select($"id", $"code").join(l, "mid").filter(($"status".equalTo(1)).and($"potential".notEqual(9))).select($"no", $"id", $"code")
df2.persist
val df3 = df2.join(r, df2("id").equalTo(r("mid"))).filter($"code".isin("0001","0010","0015","0003","0012","0014","0032","0033")).groupBy($"code").agg(countDistinct($"no"))
val fa = spark.sql("select mid, acode from actions")
val fc = spark.sql("select dcode, fcode from params.codes")
val df5 = fa.join(fc, fa("acode").startsWith(fc("dcode")), "left_outer").select($"mid", $"fcode")
val df6 = df2.join(df5, df2("id").equalTo(df5("mid"))).groupBy($"code", $"fcode")
println("count1: " + df3.count + " count2: " + df6.count)
using caching is the right choice here, but your statement
df2.persist
has no effect because you do not utilize the returned dataframe. Just do
val df2 = df1.select($"id", $"code")
.join(l, "mid")
.filter(($"status".equalTo(1)).and($"potential".notEqual(9)))
.select($"no", $"id", $"code")
.persist

dynamically join two spark-scala dataframes on multiple columns without hardcoding join conditions

I would like to join two spark-scala dataframes on multiple columns dynamically. I would to avoid hard coding column name comparison as shown in the following statments;
val joinRes = df1.join(df2, df1("col1") == df2("col1") and df1("col2") == df2("col2"))
The solution for this query already exists in pyspark version --provided in the following link
PySpark DataFrame - Join on multiple columns dynamically
I would like to code the same code using spark-scala
In scala you do it in similar way like in python but you need to use map and reduce functions:
val sparkSession = SparkSession.builder().getOrCreate()
import sparkSession.implicits._
val df1 = List("a,b", "b,c", "c,d").toDF("col1","col2")
val df2 = List("1,2", "2,c", "3,4").toDF("col1","col2")
val columnsdf1 = df1.columns
val columnsdf2 = df2.columns
val joinExprs = columnsdf1
.zip(columnsdf2)
.map{case (c1, c2) => df1(c1) === df2(c2)}
.reduce(_ && _)
val dfJoinRes = df1.join(df2,joinExprs)