Broadcast not happening while joining dataframes in Spark 1.6 - scala

Below is the sample code that I am running. when this spark job runs, Dataframe joins are happening using sortmergejoin instead of broadcastjoin.
def joinedDf (sqlContext: SQLContext,
txnTable: DataFrame,
countriesDfBroadcast: Broadcast[DataFrame]):
DataFrame = {
txnTable.as("df1").join((countriesDfBroadcast.value).withColumnRenamed("CNTRY_ID", "DW_CNTRY_ID").as("countries"),
$"df1.USER_CNTRY_ID" === $"countries.DW_CNTRY_ID", "inner")
}
joinedDf(sqlContext, txnTable, countriesDfBroadcast).write.parquet("temp")
The broadcastjoin is not happening even when I specify a broadcast() hint in the join statement.
The optimizer is hashpartitioning the dataframe and it is causing data skew.
Has anyone seen this behavior?
I am running this on yarn using Spark 1.6 and HiveContext as SQLContext. The spark job runs on 200 executors. and the data size of the txnTable is 240GB and the datasize of countriesDf is 5mb.

Both the way how you broadcast DataFrame and how you access it are incorrect.
Standard broadcasts cannot be used to handle distributed data structures. If you want to perform broadcast join on a DataFrame you should use broadcast functions which marks given DataFrame for broadcasting:
import org.apache.spark.sql.functions.broadcast
val countriesDf: DataFrame = ???
val tmp: DataFrame = broadcast(
countriesDf.withColumnRenamed("CNTRY_ID", "DW_CNTRY_ID").as("countries")
)
txnTable.as("df1").join(
broadcast(tmp), $"df1.USER_CNTRY_ID" === $"countries.DW_CNTRY_ID", "inner")
Internally it will collect tmp without converting from internal and broadcast afterwards.
join arguments are eagerly evaluated. Even it was possible to use SparkContext.broadcast with distributed data structure broadcast value is evaluated locally before join is called. Thats' why your function work at all but doesn't perform broadcast join.

Related

Spark DataFrame row count is inconsistent between runs

When I am running my spark job (version 2.1.1) on EMR, each run counts a different amount of rows on a dataframe. I first read data from s3 to 4 different dataframes, these counts are always consistent an then after joining the dataframes, the result of the join have different counts. afterwards I also filter the result and that also has a different count on each run. The variations are small, 1-5 rows difference but it's still something I would like to understand.
This is the code for the join:
val impJoinKey = Seq("iid", "globalVisitorKey", "date")
val impressionsJoined: DataFrame = impressionDsNoDuplicates
.join(realUrlDSwithDatenoDuplicates, impJoinKey, "outer")
.join(impressionParamterDSwithDateNoDuplicates, impJoinKey, "left")
.join(chartSiteInstance, impJoinKey, "left")
.withColumn("timestamp", coalesce($"timestampImp", $"timestampReal", $"timestampParam"))
.withColumn("url", coalesce($"realUrl", $"url"))
and this is for the filter:
val impressionsJoined: Dataset[ImpressionJoined] = impressionsJoinedFullDay.where($"timestamp".geq(new Timestamp(start.getMillis))).cache()
I have also tried using filter method instead of where, but with same results
Any thought?
Thanks
Nir
is it possible that one of the data sources changes over over time?
since impressionsJoined is not cached, spark will reevaluate it from scratch on every action, and that includes reading the data again from the source.
try caching impressionsJoined after the join.

Spark: Faster way to join two dataframe?

I have two dataframe df1 and ip2Country.
df1 contains the IP addresses and I am trying to map the ip addresses into geolocation information like longitude and latitude which are columns in ip2Country.
I am running it as a Spark-submit job, but the operations took a very long time even though df1 only has less than 2500 lines.
My code:
val agg =df1.join(ip2Country, ip2Country("network_start_int")=df1("sint")
, "inner")
.select($"src_ip"
,$"country_name".alias("scountry")
,$"iso_3".alias("scode")
,$"longitude".alias("slong")
,$"latitude".alias("slat")
,$"dst_ip",$"dint",$"count")
.filter($"slong".isNotNull)
val agg1 =agg.join(ip2Country, ip2Country("network_start_int")=agg("dint")
, "inner")
.select($"src_ip",$"scountry"
,$"scode",$"slong"
,$"slat",$"dst_ip"
,$"country_name".alias("dcountry")
,$"iso_3".alias("dcode")
,$"longitude".alias("dlong")
,$"latitude".alias("dlat"),$"count")
.filter($"dlong".isNotNull)
Is there any other way to join the two table? Or am I doing it the wrong way?
If you have a big dataframe which needs to be joined with a small one - Broadcast joins are very effective. Read here: Broadcast Joins (aka Map-Side Joins)
bigdf.join(broadcast(smalldf))

Spark: is DataFrame caching/persistence is transferred from one to another?

Assume I have this code (Spark 1.6.2):
val finalDF: DataFrame = if (test) {
val df = sqlContext.read.parquet(url).cache
df.write.parquet(url2)
df }
else
sqlContext.read.parquet(other_url)
If I run finalDF.unpersist, will it indeed clean the data of finalDF/df from memory? If not, how can I do it?
Yes (if test is true).
Basically when cache changes the dataframe (i.e. the dataframe is not immutable) which means that if finalDF is df then you will be unpersisting df. If test is false then df would not have been created to begin with and the result of sqlContext.read.parquet is not cached anyway but calling unpersist would not do any harm.
You can check it out yourself by looking at the UI (by default in port 4040) and checking out the storage tab. It would show the cached df before unpersist and after it would disappear.
Spark drops out old data partitions in a least-recently-used (LRU) Algorithm. However, if you need to clean manually dataFrame.unpersist() works as you expected.
Refer to http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence for detail

Calling other methods/variables inside a UDF method in Spark SQL DataFrame

I have a Spark SQL DF, in which i am trying to call one UDF [ which i created using Spark SQL udf.
val udfName = udf(somemethodName)
val newDF = df.withColumn("columnnew", udfName(col("anotherDFColumn"))
I'm trying to use another DF stored as a val inside the somemethodName, but the DF is coming as null.
This is happening only when i use where clause in the newDF.
Am i missing something?Is it not possible to use another variable / method inside UDF method?
Or do i have to do something with broadcast? Currently i am running this on local, not in the cluster though.
Is it not possible to use another variable / method inside UDF method
It is possible if and only if that variable / method can be serialized - a UDF is a closure that must be serialized and distributed to executors.
A Dataframe cannot be serialized (it's a pointer to other distributed data, so there's no logical way to serialize it without collecting it into Driver memory), therefore appears as null when you try to use the UDF.
You're probably going to need to join the two dataframes on some key, and then use a UDF (or a standard transformation) that takes columns from the joined Dataframe.

Spark for loop with Rdd transformation

I am trying to accomplish the following:
For iterator i from 0 to n
Create data frames using i as one of the filter criteria in the select statement of sparksql
Create Rdd from dataframe
Perform multiple operations on rdd
How do I make sure that for loop works? I am trying to run the Scala code on a cluster.
First I would suggest to run it locally in some test suite (as in scalatest). If you are not the type of unit/integration testing, you could simply do a DF.show() on your data frames as you iteration though them. this will print a sample from each data frame.
(0 until 5).foreach(i => {
val df = [some data frame you use i in filtering]
df.show()
val df_rdd = df.rdd
})