scala spark union of empty dataframe - scala

I'm trying to perform union of two dataframes. One or both of the dataframes can be empty, leading to error
Union can only be performed on tables with the same number of columns, but the first table has 0 columns and the second table has 16 columns
How can I gracefully perform the empty check on the dataframes. Normal if chain doesn't look "clean"
if (df1.isEmpty) {
res = df2
} else if (df2.isEmpty) {
res = df1
} else if (df1.isEmpty || df2.isEmpty) {
// end
} else {
res = df1.union(df2)
}

Related

Storing Spark DataFrame Value In scala Variable

I need to Check the duplicate filename in my table and if file count is 0 then i need to load a file in my table using sparkSql. I wrote below code.
val s1=spark.sql("select count(filename) from mytable where filename='myfile.csv'") //giving '2'
s1: org.apache.spark.sql.DataFrame = [count(filename): bigint]
s1.show //giving 2 as output
//s1 is giving me the filecount from my table then i need to compare this count value using if statement.
I'm using below code.
val s2=s1.count //not working always giving 1
val s2=s1.head.count() // error: value count is not a member of org.apache.spark.sql.Row
val s2=s1.size //value size is not a member of Unit
if(s1>0){ //code } //value > is not a member of org.apache.spark.sql.DataFrame
can someone please give me a hint how should i do this.How can i get the dataframe value and can use as variable to check the condition.
i.e.
if(value of s1(i.e.2)>0){
//my code
}
You need to extract the value itself. Count will return the number of rows in the df, which is just one row.
So you can keep your original query and extract the value after with first and getInt methods
val s1 = spark.sql("select count(filename) from mytable where filename='myfile.csv'")`
val valueToCompare = s1.first().getInt(0)
And then:
if(valueToCompare>0){
//my code
}
Another option is performing the count outside the query, then the count will give you the desired value:
val s1 = spark.sql("select filename from mytable where filename='myfile.csv'")
if(s1.count>0){
//my code
}
I like the most the second option, but there is no reason other than that i think it is more clear
spark.sql("select count(filename) from mytable where filename='myfile.csv'") returns a dataframe and you need to extract both the first row and the first column of that row. It is much simpler to directly filter the dataset and count the number of rows in Scala:
val s1 = df.filter($"filename" === "myfile.csv").count
if (s1 > 0) {
...
}
where df is the dataset that corresponds to the mytable table.
If you got the table from some other source and not by registering a view, use SparkSession.table() to get a dataframe using the instance of SparkSession that you already have. For example, in Spark shell the pre-set variable spark holds the session and you'll do:
val df = spark.table("mytable")
val s1 = df.filter($"filename" === "myfile.csv").count

How to Compare columns of two tables using Spark?

I am trying to compare two tables() by reading as DataFrames. And for each common column in those tables using concatenation of a primary key say order_id with other columns like order_date, order_name, order_event.
The Scala Code I am using
val primary_key=order_id
for (i <- commonColumnsList){
val column_name = i
val tempDataFrameForNew = newDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
val tempDataFrameOld = oldDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
//Get those records which aren common in both old/new tables
matchCountCalculated = tempDataFrameForNew.intersect(tempDataFrameOld)
//Get those records which aren't common in both old/new tables
nonMatchCountCalculated = tempDataFrameOld.unionAll(tempDataFrameForNew).except(matchCountCalculated)
//Total Null/Non-Null Counts in both old and new tables.
nullsCountInNewDataFrame = newDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nullsCountInOldDataFrame = oldDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nonNullsCountInNewDataFrame = newDFCount - nullsCountInNewDataFrame
nonNullsCountInOldDataFrame = oldDFCount - nullsCountInOldDataFrame
//Put the result for a given column in a Seq variable, later convert it to Dataframe.
tempSeq = tempSeq :+ Row(column_name, matchCountCalculated.toString, nonMatchCountCalculated.toString, (nullsCountInNewDataFrame - nullsCountInOldDataFrame).toString,
(nonNullsCountInNewDataFrame - nonNullsCountInOldDataFrame).toString)
}
// Final Step: Create DataFrame using Seq and some Schema.
spark.createDataFrame(spark.sparkContext.parallelize(tempSeq), schema)
The above code is working fine for a medium set of Data, but as the number of Columns and Records increases in my New & Old Table, the execution time is increasing. Any sort of advice is appreciated.
Thank you in Advance.
You can do the following:
1. Outer join the old and new dataframe on priamary key
joined_df = df_old.join(df_new, primary_key, "outer")
2. Cache it if you possibly can. This will save you a lot of time
3. Now you can iterate over columns and compare columns using spark functions (.isNull for not matched, == for matched etc)
for (col <- df_new.columns){
val matchCount = df_joined.filter(df_new[col].isNotNull && df_old[col].isNotNull).count()
val nonMatchCount = ...
}
This should be considerably faster, especially when you can cache your dataframe. If you can't it might be a good idea so save the joined df to disk in order to avoid a shuffle each time

spark Join performance on EMR cluster

We have 3 Node spark EMR cluster(m3Xlarge). We are trying to Join some big tables have size 4GB (250+ columns) and several small referencetable (15) have 2-3 columns each. Since we are using spark dynamicallocation that is default enabled in EMR.
So while writing to HDFS it is taking 1+ hour to save results(This is because we are using coalesce(1) on final DataFrame).
Even we tried to use broadcast joins but no luck yet. how can we improve the Performance for the above?
What will be the optimize final execution time for the above Process?
What can be the Possible ways that can improve Performance?
Any help will be appreciated !
Here is My join function
def multiJoins(MasterTablesDF: DataFrame, tmpReferenceTablesDF_List: MutableList[DataFrame], tmpReferenceTableJoinDetailsList: MutableList[Array[String]], DrivingTable: String): DataFrame = {
// Define final output of Driving Table
var final_df: DataFrame = null
if (MasterTablesDF != null) {
if (!MasterTablesDF.head(1).isEmpty && tmpReferenceTablesDF_List.length >= 1) {
for (i <- 0 until tmpReferenceTablesDF_List.length) {
val eachReferenceTableDF = tmpReferenceTablesDF_List(i)
var eachJoinDetails = tmpReferenceTableJoinDetailsList(i)
//for first ref table Join
if (i == 0) {
println("!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
if (eachJoinDetails(0).equals(eachJoinDetails(1))) {
println("############## Driving table and Ref table Joining columns are same joining first Drive table ==>" + DrivingTable + "With Ref table ==>" + eachJoinDetails(3))
//if reftable and Driving table have same join columns using seq() to remove duplicate columns after Joins
final_df = MasterTablesDF.join(broadcast(eachReferenceTableDF), Seq(eachJoinDetails(0)), eachJoinDetails(2)) //.select(ReqCols.head, ReqCols.tail: _*)
} else {
//if the joining column names of the driving and ref tables are not same then
//using driving table join col and reftable join cols
println("############### Driving table and Ref table joining columns are not same joining first Drive table ==>" + DrivingTable + "With Ref table ==>" + eachJoinDetails(3) + "\n")
final_df = MasterTablesDF.join(broadcast(eachReferenceTableDF), MasterTablesDF(eachJoinDetails(0)) === eachReferenceTableDF(eachJoinDetails(1)), eachJoinDetails(2))
}
} //Joining Next reference table dataframes with final DF
else {
if (eachJoinDetails(0).equals(eachJoinDetails(1))) {
println("###### drive table and another ref table join cols are same joining driving table ==>" + DrivingTable + "With RefTable" + eachJoinDetails(3))
final_df = final_df.join(broadcast(eachReferenceTableDF), Seq(eachJoinDetails(0)), eachJoinDetails(2)) //.select(ReqCols.head, ReqCols.tail: _*)
// final_df.unpersist()
} else {
println("###### drive table and another ref table join cols are not same joining driving table ==>" + DrivingTable + "With RefTable" + eachJoinDetails(3) + "\n")
final_df = final_df.join(broadcast(eachReferenceTableDF), MasterTablesDF(eachJoinDetails(0)) === eachReferenceTableDF(eachJoinDetails(1)), eachJoinDetails(2))
}
}
}
}
}
return final_df
//Writing is too slow
//final_df.coalesce(1).write.format("com.databricks.spark.csv").option("delimiter", "|").option("header", "true")
.csv(hdfsPath)
}
Is this fine ? is this due to looping ?
Probably Spark can't optimize your very long execution plan as good as possible. I had same situation and we did a series of optimizations:
1) remove all unneccessary columns and filter as soon as possible
2) "Materialize" some tables before joining, it will help Spark to break lineage and somehow optimize your flow(in our sample 2 sortJoins were replaced by broadcast joins, because Spark realized that dataframes is very small)
3) We partitioned all datasets by same key and number of partitions(right after read).
and some other optimizations.. It reduced job time from 45 mins to 4. You need closely look at Spark UI, there we found a lot of helpful insights to optimize(one of our executor worled instead of 10 because of all data was partitioned in one part..) etc.. Good luck!

Join on several conditions without duplicated columns

Joining on identity with Spark leads to the common key column being duplicated in the final Dataset:
val result = ds1.join(ds2, ds1("key") === ds2("key"))
// result now has two "key" columns
This is avoidable by using a Seq instead of the comparison, similar to USING keyword in SQL:
val result = ds1.join(ds2, Seq("key"))
// result now has only one "key" column
However, this doesn't work when joining with a common key + another condition, like:
val result = ds1.join(ds2, ds1("key") === ds2("key") && ds1("foo") < ds2("foo"))
// result has two "key" columns
val result = ds1.join(ds2, Seq("key") && ds1("foo") < ds2("foo"))
// compile error: value && is not a member of Seq[String]
Currently one way of getting out of this is to drop the duplicated column afterwards, but this is quite cumbersome:
val result = ds1.join(ds2, ds1("key") === ds2("key") && ds1("foo") < ds2("foo"))
.drop(ds1("key"))
Is there a more natural, cleaner way to achieve the same goal?
You can separate equi join component and filter:
ds1.join(ds2, Seq("key")).where(ds1("foo") < ds2("foo"))

How to update rows based on condition in spark-sql

I am working on spark-sql for data preparation.
The problem I am facing is after getting the result of sql query. How should I update rows based on the If-then-else condition.
What I am doing
val table_join = sqlContext.sql(""" SELECT a.*,b.col as someCol
from table1 a LEFT JOIN table2 b
on a.ID=b.ID """)
table_join.registerTempTable("Table_join")
Now when I have final joined table which is in df format. How should I update rows?
//Final filtering operation
val final_filtered_table = table_join.map{ case record=>
if(record.getAs[String]("col1") == "Y" && record.getAs[String]("col2") == "") record.getAs[String]("col2")="UNKNOWN"
else if (record.getAs[String]("col1") == "N") record("col1")=""
else record
}
In the above map the if syntax works properly but the moment I apply the update condition to modify It gives me error.
But why the below query is working
if(record.getAs[String]("col1") == "Y" && record.getAs[String]("col2") == "") "UNKNOWN"
But the moment I change "UNKNOWN" to record.getAs[String]("col2")="UNKNOWN" It gives me error at at .getAs
Another approach I tried is this:
val final_filtered_sql = table_join.map{row =>
if(row.getString(6) == "Y" && row.getString(33) == "") row.getString(6) == "UNKNOWN"
else if(row.getString(6) == "N") row.getString(6) == ""
else row
}
This is working but is this the right approach as I should not call the columns by their no's but instead their names. What approach should I follow to get names of the column and then update ??
Please help me regarding this. What syntax should I do to update rows based on the condition in dataframe in spark-sql
record.getAs[String]("col2")="UNKNOWN" won't work because record.getAs[String](NAME) will return a String which doesn't have a = method and assigning a new value to a string doesn't make sense.
DataFrame records don't have any setter methods because DataFrames are based on RDD which are immutable collections, meaning you cannot change their state and that's how you're trying to do here.
One way would be to create a new DataFrame using selectExpr on table_join and put that if/else logic there using SQL.