Spark DataFrame's `except()` removes different items every time - scala

var df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
var df2 = df.limit(3)
df2.show()
var df3 = df.except(df2)
df3.show()
Surprisingly, I found that except is not working the way it should. Here is my output:
df2: created correctly, contains 1,2 and 3. But my df3 still has 1, 2 and/or 3 in it. It's kind of random. If I run it multiple times, I get different result. Can anyone please help me? Thanks in advance.

You need to put a spark "action" to collect the data that is required for "df2" before performing the "except" operation, which will ensure that the dataframe df2 get computed before hand and has the fixed content which will be subtracted from df.
Randomness is because spark lazy evaluation and spark is putting all your code in one stage. And the contents of "df2" is not fixed when you performed the "except" operation on it. As per the spark function definition for limit:
Returns a new Dataset by taking the first n rows. The difference between this function
and head is that head is an action and returns an array (by triggering query execution)
while limit returns a new Dataset.
since, it return a datset, will be lazy evaluation,
Below code will give you a consistent output.
var df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
var df2 = df.head(3).map(f => f.mkString).toList.toDF("num")
df2.show()
var df3 = df.except(df2)
df3.show()

Best way to test this is to just create a new DF that has the values you want to diff.
val df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
val df2 = List(1,2,3).toDF("num")
df2.show()
val df3 = df.except(df2)
df3.show()
Alternatively, just write a deterministic filter to select the rows you want:
val df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
val df2 = df.filter("num <= 3")
df2.show()
val df3 = df.except(df2)
df3.show()

One could use a leftanti join for this if you have uniqueness in the column for which you are comparing.
Example:
var df = List(1,2,3,4,5,6,7,8,9,10,11).toDF("num")
df.show()
var df2 = df.limit(3)
df2.show()
var df3 = df.join(df2,Seq("num"),"leftanti")
df3.show()

Related

How to concat two dataframes in which one is having record and other one is empty in pyspark?

I need help to concat two dataframes in which one is empty and other one having the data. Could you please how to do this in pyspark?
pandas I am using:
suppose df2 is empty and df1 is having some record.
df2 = pd.concat([df2, df1])
But how to perform this operation in pyspark?
df1:
+--------------------+----------+---------+
| Programname|Projectnum| Drug|
+--------------------+----------+---------+
|Non-Oncology Phar...|SR0480-000|Invokamet|
+--------------------+----------+---------+
df2:
++
||
++
++
I tried many option. One option worked for me.
For concat df2 to df1, first I need to create the structure of df2 same like df1 then use the union for concatanation.
df2 = sqlContext.createDataFrame(sc.emptyRDD(), df1.schema)
df2 = df2.union(df1)
result:
df2:
+--------------------+----------+---------+
| Programname|Projectnum| Drug|
+--------------------+----------+---------+
|Non-Oncology Phar...|SR0480-000|Invokamet|
+--------------------+----------+---------+
You can use the union method:
df = df1.union(df2)

Unexpected caching behaviour for groupBy/join operations in spark

I have been trying to do multiple aggregations on a base data frame lets say df1.
When I run the following code
df1.cache()
val df2 = df1.groupBy(col("col1"),col("col2") as "col6").agg(sum("col3"))
val df3 = df1.groupBy(col("col1"),col("col4") as "col6").agg(sum("col5"))
val df4 = df2.join(df3,Seq("col1","col6"),"outer")
df4.count()
In the query plan generated and on the SQL tab of the spark UI. I see that df2 is an in memory table scan of df1 while the complete DAG of d1 is executed for df3 generation.
When I rename the column1 while doing the join
df1.cache()
val df2 = df1.groupBy(col("col1") as "col1",col("col2") as "col6").agg(sum("col3"))
val df3 = df1.groupBy(col("col1") as "col1",col("col4") as "col6").agg(sum("col5"))
val df4 = df2.join(df3,Seq("col1","col6"),"outer")
df4.count()
Both the DFs are In memory table scans.
I didn't think this would make a difference, can someone please explain me why this could be happening.
PS: Also one more thing that i noticed is that without the join queryPlans of both the df's are inMemory table scan.

Dynamically select multiple columns while joining different Dataframe in Scala Spark

I have two spark data frame df1 and df2. Is there a way for selecting output columns dynamically while joining these two dataframes? The below definition outputs all column from df1 and df2 in case of inner join.
def joinDF (df1: DataFrame, df2: DataFrame , joinExprs: Column, joinType: String): DataFrame = {
val dfJoinResult = df1.join(df2, joinExprs, joinType)
dfJoinResult
//.select()
}
Input data:
val df1 = List(("1","new","current"), ("2","closed","saving"), ("3","blocked","credit")).toDF("id","type","account")
val df2 = List(("1","7"), ("2","5"), ("5","8")).toDF("id","value")
Expected result:
val dfJoinResult = df1
.join(df2, df1("id") === df2("id"), "inner")
.select(df1("type"), df1("account"), df2("value"))
dfJoinResult.schema():
StructType(StructField(type,StringType,true),
StructField(account,StringType,true),
StructField(value,StringType,true))
I have looked at options like df.select(cols.head, cols.tail: _*) but it does not allow to select columns from both DF's.
Is there a way to pass selectExpr columns dynamically along with dataframe details that we want to select it from in my def? I'm using Spark 2.2.0.
It is possible to pass the select expression as a Seq[Column] to the method:
def joinDF(df1: DataFrame, df2: DataFrame , joinExpr: Column, joinType: String, selectExpr: Seq[Column]): DataFrame = {
val dfJoinResult = df1.join(df2, joinExpr, joinType)
dfJoinResult.select(selectExpr:_*)
}
To call the method use:
val joinExpr = df1.col("id") === df2.col("id")
val selectExpr = Seq(df1.col("type"), df1.col("account"), df2.col("value"))
val testDf = joinDF(df1, df2, joinExpr, "inner", selectExpr)
This will give the desired result:
+------+-------+-----+
| type|account|value|
+------+-------+-----+
| new|current| 7|
|closed| saving| 5|
+------+-------+-----+
In the selectExpr above, it is necessary to specify which dataframe the columns are coming from. However, this can be further simplified if the following assumptions are true:
The columns to join on have the same name in both dataframes
The columns to be selected have unique names (the other dataframe do not have a column with the same name)
In this case, the joinExpr: Column can be changed to joinExpr: Seq[String] and selectExpr: Seq[Column] to selectExpr: Seq[String]:
def joinDF(df1: DataFrame, df2: DataFrame , joinExpr: Seq[String], joinType: String, selectExpr: Seq[String]): DataFrame = {
val dfJoinResult = df1.join(df2, joinExpr, joinType)
dfJoinResult.select(selectExpr.head, selectExpr.tail:_*)
}
Calling the method now looks cleaner:
val joinExpr = Seq("id")
val selectExpr = Seq("type", "account", "value")
val testDf = joinDF(df1, df2, joinExpr, "inner", selectExpr)
Note: When the join is performed using a Seq[String] the column names of the resulting dataframe will be different as compared to using an expression. When there are columns with the same name present, there will be no way to separately select these afterwards.
A slightly modified solution from the one given above is before performing join, select the required columns from the DataFrames beforehand as it will have a little less overhead as there will be lesser no of columns to perform JOIN operation.
val dfJoinResult = df1.select("column1","column2").join(df2.select("col1"),joinExpr,joinType)
But remember to select the columns on which you will be performing the join operations as it will first select the columns and then from the available data will from join operation.

How to join multiple dataFrames in spark with different column names and types without converting into RDD

My df1 has column of type Double, df2 has column of type Timestamp and df3 has column of type Integer.
I'm trying to achieve something like this:
df1 = ...
df2 = ...
df3 = ...
val df4 = df1.zip(df2).zip(df3)
However there's no such function like "zip". How can I archive such result?
There's no explicit zip for DataFrames. You can do workaround:
val df1Ordered = df1.withColumn("rowNr", row_number().over(Window.orderBy('someColumn));
// the same for other DataFrames
// now join those DataFrames
val newDF = df1Ordered.join(df2Ordered, "rowNr").join("df3Ordered", "rowNr")
However it will be quite slow, because there is no partitionBy in Window operation.

check condition for two column in two different dataframes in spark

Suppose there is one column in dataframe and there is similar schema column in another dataframe. how to check check the values consisting in the columns are same or not without joining them as there is not common attribute.
DF1
serial_nm
abc
mnc
pqr
DF2
ser_nm
hgf
mnc
uio
pqr
lok
And i want third DF3 as output
DF3
mnc
pqr
I tried this
val DF3 = DF1.filter(DF1("serial_nm") === DF2("ser_nm"))
But its not working
Please Help
Thanks..!!
I believe you can use a join. Consider using it like this:
val DF3 = DF1.join(DF2, DF1("serial_nm") === DF2("ser_nm"))
or
val DF3 = DF1.join(DF2).where(DF1("serial_nm") === DF2("ser_nm"))
Both approaches are quivalent.
Note: To avoid problems with ambiguous columns, one option is to rename them before the join:
val df2_renamed = DF2
.withColumnRenamed("mnc", "df2_mnc")
.withColumnRenamed("pqr", "df2_pqr")