How to join two DataFrames and change column for missing values? - scala

val df1 = sc.parallelize(Seq(
("a1",10,"ACTIVE","ds1"),
("a1",20,"ACTIVE","ds1"),
("a2",50,"ACTIVE","ds1"),
("a3",60,"ACTIVE","ds1"))
).toDF("c1","c2","c3","c4")`
val df2 = sc.parallelize(Seq(
("a1",10,"ACTIVE","ds2"),
("a1",20,"ACTIVE","ds2"),
("a1",30,"ACTIVE","ds2"),
("a1",40,"ACTIVE","ds2"),
("a4",20,"ACTIVE","ds2"))
).toDF("c1","c2","c3","c5")`
df1.show()
// +---+---+------+---+
// | c1| c2| c3| c4|
// +---+---+------+---+
// | a1| 10|ACTIVE|ds1|
// | a1| 20|ACTIVE|ds1|
// | a2| 50|ACTIVE|ds1|
// | a3| 60|ACTIVE|ds1|
// +---+---+------+---+
df2.show()
// +---+---+------+---+
// | c1| c2| c3| c5|
// +---+---+------+---+
// | a1| 10|ACTIVE|ds2|
// | a1| 20|ACTIVE|ds2|
// | a1| 30|ACTIVE|ds2|
// | a1| 40|ACTIVE|ds2|
// | a4| 20|ACTIVE|ds2|
// +---+---+------+---+
My requirement is: I need to Join both dataframes.
My output dataframe should be having all the records from df1 and also the records from df2 which are not in df1 for the matching "c1" only. And the records which I pull from df2 should be updated to Inactive at column "c3".
In this example only matching value of "c1" is a1. So I need to pull c2=30 and 40 records from df2 and make them INACTIVE.
Here is the output.
df_output.show()
// +---+---+--------+---+
// | c1| c2| c3 | c4|
// +---+---+--------+---+
// | a1| 10|ACTIVE |ds1|
// | a1| 20|ACTIVE |ds1|
// | a2| 50|ACTIVE |ds1|
// | a3| 60|ACTIVE |ds1|
// | a1| 30|INACTIVE|ds1|
// | a1| 40|INACTIVE|ds1|
// +---+---+--------+---+
Can any one help me to do this.

First, a small thing. I use different names for the columns in df2:
val df2 = sc.parallelize(...).toDF("d1","d2","d3","d4")
No big deal, but this made things easier for me to reason about.
Now for the fun stuff. I am going to be a bit verbose for the sake of clarity:
val join = df1
.join(df2, df1("c1") === df2("d1"), "inner")
.select($"d1", $"d2", $"d3", lit("ds1").as("d4"))
.dropDuplicates
Here I do the following:
Inner join between df1 and df2 on the c1 and d1 columns
Select the df2 columns and simply "hardcode" ds1 in the last column to replace ds2
Drop duplicates
This basically just filters out everything in df2 that does not have a corresponding key in c1 in df1.
Next I diff:
val diff = join
.except(df1)
.select($"d1", $"d2", lit("INACTIVE").as("d3"), $"d4")
This is a basic set operation that finds everything in join that is not in df1. These are the items to deactivate, so I select all the columns but replace the third with a hardcoded INACTIVE value.
All that's left is to put them all together:
df1.union(diff)
This simply combines df1 with the table of deactivated values we calculated earlier to produce the final result:
+---+---+--------+---+
| c1| c2| c3| c4|
+---+---+--------+---+
| a1| 10| ACTIVE|ds1|
| a1| 20| ACTIVE|ds1|
| a2| 50| ACTIVE|ds1|
| a3| 60| ACTIVE|ds1|
| a1| 30|INACTIVE|ds1|
| a1| 40|INACTIVE|ds1|
+---+---+--------+---+
And again, you don't need all these intermediate values. I just was verbose to help trace through the process.

here is dirty solution -
from pyspark.sql import functions as F
# find the rows from df2 that have matching key c1 in df2
df3 = df1.join(df2,df1.c1==df2.c1)\
.select(df2.c1,df2.c2,df2.c3,df2.c5.alias('c4'))\
.dropDuplicates()
df3.show()
:
+---+---+------+---+
| c1| c2| c3| c4|
+---+---+------+---+
| a1| 10|ACTIVE|ds2|
| a1| 20|ACTIVE|ds2|
| a1| 30|ACTIVE|ds2|
| a1| 40|ACTIVE|ds2|
+---+---+------+---+
:
# Union df3 with df1 and change columns c3 and c4 if c4 value is 'ds2'
df1.union(df3).dropDuplicates(['c1','c2'])\
.select('c1','c2',\
F.when(df1.c4=='ds2','INACTIVE').otherwise('ACTIVE').alias('c3'),
F.when(df1.c4=='ds2','ds1').otherwise('ds1').alias('c4')
)\
.orderBy('c1','c2')\
.show()
:
+---+---+--------+---+
| c1| c2| c3| c4|
+---+---+--------+---+
| a1| 10| ACTIVE|ds1|
| a1| 20| ACTIVE|ds1|
| a1| 30|INACTIVE|ds1|
| a1| 40|INACTIVE|ds1|
| a2| 50| ACTIVE|ds1|
| a3| 60| ACTIVE|ds1|
+---+---+--------+---+

Enjoyed the challenge and here is my solution.
val c1keys = df1.select("c1").distinct
val df2_in_df1 = df2.join(c1keys, Seq("c1"), "inner")
val df2inactive = df2_in_df1.join(df1, Seq("c1", "c2"), "leftanti").withColumn("c3", lit("INACTIVE"))
scala> df1.union(df2inactive).show
+---+---+--------+---+
| c1| c2| c3| c4|
+---+---+--------+---+
| a1| 10| ACTIVE|ds1|
| a1| 20| ACTIVE|ds1|
| a2| 50| ACTIVE|ds1|
| a3| 60| ACTIVE|ds1|
| a1| 30|INACTIVE|ds2|
| a1| 40|INACTIVE|ds2|
+---+---+--------+---+

Related

How to get the set of rows which contains null values from dataframe in scala using filter

I'm new to spark and have a question regarding filtering dataframe based on null condition.
I have gone through many answers which has solution like
df.filter(($"col2".isNotNULL) || ($"col2" !== "NULL") || ($"col2" !== "null") || ($"col2".trim !== "NULL"))
But in my case, I can not write hard coded column names as my schema is not fixed. I am reading csv file and depending upon the columns in it, I have to filter my dataframe for null values and want it in another dataframe. In short, any column which has null value, that complete row should come under a different dataframe.
for example :
Input DataFrame :
+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|n1#c1.com|[c1,1,d1]|
| n2| 2|null |[c1,1,d1]|
| n3| 3|n3#c1.com| null |
| n4| 4|n4#c2.com|[c2,2,d2]|
| n6| 6|n6#c2.com|[c2,2,d2]|
Output :
+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|n1#c1.com|[c1,1,d1]|
| n2| 2|null |[c1,1,d1]|
| n3| 3|n3#c1.com| null |
Thank you in advance.
Try this-
val df1 = spark.sql("select col1, col2 from values (null, 1), (2, null), (null, null), (1,2) T(col1, col2)")
/**
* +----+----+
* |col1|col2|
* +----+----+
* |null|1 |
* |2 |null|
* |null|null|
* |1 |2 |
* +----+----+
*/
df1.show(false)
df1.filter(df1.columns.map(col(_).isNull).reduce(_ || _)).show(false)
/**
* +----+----+
* |col1|col2|
* +----+----+
* |null|1 |
* |2 |null|
* |null|null|
* +----+----+
*/
Thank you so much for your answers. I tried below logic and it worked for me.
var arrayColumn = df.columns;
val filterString = String.format(" %1$s is null or %1$s == '' "+ arrayColumn(0));
val x = new StringBuilder(filterString);
for(i <- 1 until arrayColumn.length){
if (x.toString() != ""){
x ++= String.format("or %1$s is null or %1$s == '' ", arrayColumn(i))
}
}
val dfWithNullRows = df.filter(x.toString());
To deal with null values and dataframes spark has some useful functions.
I will show some dataframes examples with distinct number of columns.
val schema = StructType(List(StructField("id", IntegerType, true), StructField("obj",DoubleType, true)))
val schema1 = StructType(List(StructField("id", IntegerType, true), StructField("obj",StringType, true), StructField("obj",IntegerType, true)))
val t1 = sc.parallelize(Seq((1,null),(1,1.0),(8,3.0),(2,null),(3,1.4),(3,2.5),(null,3.7))).map(t => Row(t._1,t._2))
val t2 = sc.parallelize(Seq((1,"A",null),(2,"B",null),(3,"C",36),(null,"D",15),(5,"E",25),(6,null,7),(7,"G",null))).map(t => Row(t._1,t._2,t._3))
val tt1 = spark.createDataFrame(t1, schema)
val tt2 = spark.createDataFrame(t2, schema1)
tt1.show()
tt2.show()
// To clean all rows with null values
val dfWithoutNull = tt1.na.drop()
dfWithoutNull.show()
val df2WithoutNull = tt2.na.drop()
df2WithoutNull.show()
// To fill null values with another value
val df1 = tt1.na.fill(-1)
df1.show()
// to get new dataframes with the null values rows
val nullValues = tt1.filter(row => row.anyNull == true)
nullValues.show()
val nullValues2 = tt2.filter(row => row.anyNull == true)
nullValues2.show()
output
// input dataframes
+----+----+
| id| obj|
+----+----+
| 1|null|
| 1| 1.0|
| 8| 3.0|
| 2|null|
| 3| 1.4|
| 3| 2.5|
|null| 3.7|
+----+----+
+----+----+----+
| id| obj| obj|
+----+----+----+
| 1| A|null|
| 2| B|null|
| 3| C| 36|
|null| D| 15|
| 5| E| 25|
| 6|null| 7|
| 7| G|null|
+----+----+----+
// Dataframes without null values
+---+---+
| id|obj|
+---+---+
| 1|1.0|
| 8|3.0|
| 3|1.4|
| 3|2.5|
+---+---+
+---+---+---+
| id|obj|obj|
+---+---+---+
| 3| C| 36|
| 5| E| 25|
+---+---+---+
// Dataframe with null values replaced
+---+----+
| id| obj|
+---+----+
| 1|-1.0|
| 1| 1.0|
| 8| 3.0|
| 2|-1.0|
| 3| 1.4|
| 3| 2.5|
| -1| 3.7|
+---+----+
// Dataframes which the rows have at least one null value
+----+----+
| id| obj|
+----+----+
| 1|null|
| 2|null|
|null| 3.7|
+----+----+
+----+----+----+
| id| obj| obj|
+----+----+----+
| 1| A|null|
| 2| B|null|
|null| D| 15|
| 6|null| 7|
| 7| G|null|
+----+----+----+

Filter one data frame using other data frame in spark scala

I am going to demonstrate my question using following two data frames.
val datF1= Seq((1,"everlasting",1.39),(1,"game", 2.7),(1,"life",0.69),(1,"learning",0.69),
(2,"living",1.38),(2,"worth",1.38),(2,"life",0.69),(3,"learning",0.69),(3,"never",1.38)).toDF("ID","token","value")
datF1.show()
+---+-----------+-----+
| ID| token|value|
+---+-----------+-----+
| 1|everlasting| 1.39|
| 1| game| 2.7|
| 1| life| 0.69|
| 1| learning| 0.69|
| 2| living| 1.38|
| 2| worth| 1.38|
| 2| life| 0.69|
| 3| learning| 0.69|
| 3| never| 1.38|
+---+-----------+-----+
val dataF2= Seq(("life ",0.71),("learning",0.75)).toDF("token1","val2")
dataF2.show()
+--------+----+
| token1|val2|
+--------+----+
| life |0.71|
|learning|0.75|
+--------+----+
I want to filter the ID and value of dataF1 based on the token1 of dataF2. For the each word in token1 of dataF2 , if there is a word token then value should be equal to the value of dataF1 else value should be zero.
In other words my desired output should be like this
+---+----+----+
| ID| val|val2|
+---+----+----+
| 1|0.69|0.69|
| 2| 0.0|0.69|
| 3|0.69| 0.0|
+---+----+----+
Since learning is not presented in ID equals 2 , the val has equal to zero. Similarly since life is not there for ID equal 3, val2 equlas zero.
I did it manually as follows ,
val newQ61=datF1.filter($"token"==="learning")
val newQ7 =Seq(1,2,3).toDF("ID")
val newQ81 =newQ7.join(newQ61, Seq("ID"), "left")
val tf2=newQ81.select($"ID" ,when(col("value").isNull ,0).otherwise(col("value")) as "val" )
val newQ62=datF1.filter($"token"==="life")
val newQ71 =Seq(1,2,3).toDF("ID")
val newQ82 =newQ71.join(newQ62, Seq("ID"), "left")
val tf3=newQ82.select($"ID" ,when(col("value").isNull ,0).otherwise(col("value")) as "val2" )
val tf4 =tf2.join(tf3 ,Seq("ID"), "left")
tf4.show()
+---+----+----+
| ID| val|val2|
+---+----+----+
| 1|0.69|0.69|
| 2| 0.0|0.69|
| 3|0.69| 0.0|
+---+----+----+
Instead of doing this manually , is there a way to do this more efficiently by accessing indexes of one data frame within the other data frame ? because in real life situations, there can be more than 2 words so manually accessing each word may be very hard thing to do.
Thank you
UPDATE
When i use leftsemi join my output is like this :
datF1.join(dataF2, $"token"===$"token1", "leftsemi").show()
+---+--------+-----+
| ID| token|value|
+---+--------+-----+
| 1|learning| 0.69|
| 3|learning| 0.69|
+---+--------+-----+
I believe a left outer join and then pivoting on token can work here:
val ans = df1.join(df2, $"token" === $"token1", "LEFT_OUTER")
.filter($"token1".isNotNull)
.select("ID","token","value")
.groupBy("ID")
.pivot("token")
.agg(first("value"))
.na.fill(0)
The result (without the null handling):
ans.show
+---+--------+----+
| ID|learning|life|
+---+--------+----+
| 1| 0.69|0.69|
| 3| 0.69|0.0 |
| 2| 0.0 |0.69|
+---+--------+----+
UPDATE: as the answer by Lamanus suggest, an inner join is possibly a better approach than an outer join + filter.
I think the inner join is enough. Btw, I found the typo in your test case, which makes the result wrong.
val dataF1= Seq((1,"everlasting",1.39),
(1,"game", 2.7),
(1,"life",0.69),
(1,"learning",0.69),
(2,"living",1.38),
(2,"worth",1.38),
(2,"life",0.69),
(3,"learning",0.69),
(3,"never",1.38)).toDF("ID","token","value")
dataF1.show
// +---+-----------+-----+
// | ID| token|value|
// +---+-----------+-----+
// | 1|everlasting| 1.39|
// | 1| game| 2.7|
// | 1| life| 0.69|
// | 1| learning| 0.69|
// | 2| living| 1.38|
// | 2| worth| 1.38|
// | 2| life| 0.69|
// | 3| learning| 0.69|
// | 3| never| 1.38|
// +---+-----------+-----+
val dataF2= Seq(("life",0.71), // "life " -> "life"
("learning",0.75)).toDF("token1","val2")
dataF2.show
// +--------+----+
// | token1|val2|
// +--------+----+
// | life|0.71|
// |learning|0.75|
// +--------+----+
val resultDF = dataF1.join(dataF2, $"token" === $"token1", "inner")
resultDF.show
// +---+--------+-----+--------+----+
// | ID| token|value| token1|val2|
// +---+--------+-----+--------+----+
// | 1| life| 0.69| life|0.71|
// | 1|learning| 0.69|learning|0.75|
// | 2| life| 0.69| life|0.71|
// | 3|learning| 0.69|learning|0.75|
// +---+--------+-----+--------+----+
resultDF.groupBy("ID").pivot("token").agg(first("value"))
.na.fill(0).orderBy("ID").show
This will give you the result such as
+---+--------+----+
| ID|learning|life|
+---+--------+----+
| 1| 0.69|0.69|
| 2| 0.0|0.69|
| 3| 0.69| 0.0|
+---+--------+----+
Seems like you need "left semi-join". It will filter one dataframe, based on another one.
Try using it like
datF1.join(datF2, $"token"===$"token2", "leftsemi")
You can find a bit more info here - https://medium.com/datamindedbe/little-known-spark-dataframe-join-types-cc524ea39fd5

How to replace empty values in a column of DataFrame?

How can I replace empty values in a column Field1 of DataFrame df?
Field1 Field2
AA
12 BB
This command does not provide an expected result:
df.na.fill("Field1",Seq("Anonymous"))
The expected result:
Field1 Field2
Anonymous AA
12 BB
You can also try this.
This might handle both blank/empty/null
df.show()
+------+------+
|Field1|Field2|
+------+------+
| | AA|
| 12| BB|
| 12| null|
+------+------+
df.na.replace(Seq("Field1","Field2"),Map(""-> null)).na.fill("Anonymous", Seq("Field2","Field1")).show(false)
+---------+---------+
|Field1 |Field2 |
+---------+---------+
|Anonymous|AA |
|12 |BB |
|12 |Anonymous|
+---------+---------+
Fill: Returns a new DataFrame that replaces null or NaN values in
numeric columns with value.
Two things:
An empty string is not null or NaN, so you'll have to use a case statement for that.
Fill seems to not work well when giving a text value into a numeric column.
Failing Null Replace with Fill / Text:
scala> a.show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
scala> a.na.fill("Anonymous", Seq("f1")).show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
Working Example - Using Null With All Numbers:
scala> a.show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
scala> a.na.fill(1, Seq("f1")).show
+---+---+
| f1| f2|
+---+---+
| 1| AA|
| 12| BB|
+---+---+
Failing Example (Empty String instead of Null):
scala> b.show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
scala> b.na.fill(1, Seq("f1")).show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
Case Statement Fix Example:
scala> b.show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
scala> b.select(when(col("f1") === "", "Anonymous").otherwise(col("f1")).as("f1"), col("f2")).show
+---------+---+
| f1| f2|
+---------+---+
|Anonymous| AA|
| 12| BB|
+---------+---+
You can try using below code when you have n number of columns in dataframe.
Note: When you are trying to write data into formats like parquet, null data types are not supported. we have to type cast it.
val df = Seq(
(1, ""),
(2, "Ram"),
(3, "Sam"),
(4,"")
).toDF("ID", "Name")
// null type column
val inputDf = df.withColumn("NulType", lit(null).cast(StringType))
//Output
+---+----+-------+
| ID|Name|NulType|
+---+----+-------+
| 1| | null|
| 2| Ram| null|
| 3| Sam| null|
| 4| | null|
+---+----+-------+
//Replace all blank space in the dataframe with null
val colName = inputDf.columns //*This will give you array of string*
val data = inputDf.na.replace(colName,Map(""->"null"))
data.show()
+---+----+-------+
| ID|Name|NulType|
+---+----+-------+
| 1|null| null|
| 2| Ram| null|
| 3| Sam| null|
| 4|null| null|
+---+----+-------+

How to write UDF with values as references to other columns?

I'd like to create a UDF that does the following:
A DataFramehas 5 columns and with want to create the 6th column with the sum that the value that contain the name the first and the second column.
Let me print the DataFrame and explain with that:
case class salary(c1: String, c2: String, c3: Int, c4: Int, c5: Int)
val df = Seq(
salary("c3", "c4", 7, 5, 6),
salary("c5", "c4", 8, 10, 20),
salary("c5", "c3", 1, 4, 9))
.toDF()
DataFrame result
+---+---+---+---+---+
| c1| c2| c3| c4| c5|
+---+---+---+---+---+
| c3| c4| 7| 5| 6|
| c5| c4| 8| 10| 20|
| c5| c3| 1| 4| 9|
+---+---+---+---+---+
df.withColumn("c6",UDFName(c1,c2))
And the result for this column should be:
1º Row(C3,C4) Then 7+5= 12
2º Row(C5,C4) Then 20+10= 30
3º Row(C5,C3) Then 9+1= 10
There is really no need for UDF here. Just use virtual MapType column:
import org.apache.spark.sql.functions.{col, lit, map}
// We use an interleaved list of column name and column value
val values = map(Seq("c3", "c4", "c5").flatMap(c => Seq(lit(c), col(c))): _*)
// Check the first row
df.select(values).limit(1).show(false)
+------------------------------+
|map(c3, c3, c4, c4, c5, c5) |
+------------------------------+
|Map(c3 -> 7, c4 -> 5, c5 -> 6)|
+------------------------------+
and use it in expression:
df.withColumn("c6", values($"c1") + values($"c2"))
+---+---+---+---+---+---+
| c1| c2| c3| c4| c5| c6|
+---+---+---+---+---+---+
| c3| c4| 7| 5| 6| 12|
| c5| c4| 8| 10| 20| 30|
| c5| c3| 1| 4| 9| 10|
+---+---+---+---+---+---+
It is much cleaner, faster, and safer than dealing with UDFs and Rows:
import org.apache.spark.sql.functions.{struct, udf}
import org.apache.spark.sql.Row
val f = udf((row: Row) => for {
// Use Options to avoid problems with null columns
// Explicit null checks should be faster, but much more verbose
c1 <- Option(row.getAs[String]("c1"))
c2 <- Option(row.getAs[String]("c2"))
// In this case we could (probably) skip Options below
// but Ints in Spark SQL can get null
x <- Option(row.getAs[Int](c1))
y <- Option(row.getAs[Int](c2))
} yield x + y)
df.withColumn("c6", f(struct(df.columns map col: _*)))
+---+---+---+---+---+---+
| c1| c2| c3| c4| c5| c6|
+---+---+---+---+---+---+
| c3| c4| 7| 5| 6| 12|
| c5| c4| 8| 10| 20| 30|
| c5| c3| 1| 4| 9| 10|
+---+---+---+---+---+---+
A user-defined function (UDF) has access to the values that are passed directly as input parameters.
If you want to access the other columns, a UDF will only have access to them iff you pass them as input parameters. With that, you should easily achieve what you're after.
I highly recommend using struct function to combine all the other columns.
struct(cols: Column*): Column Creates a new struct column.
You could also use Dataset.columns method to access the columns to struct.
columns: Array[String] Returns all column names as an array.

Spark dataframe filter

val df = sc.parallelize(Seq((1,"Emailab"), (2,"Phoneab"), (3, "Faxab"),(4,"Mail"),(5,"Other"),(6,"MSL12"),(7,"MSL"),(8,"HCP"),(9,"HCP12"))).toDF("c1","c2")
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
| 6| MSL12|
| 7| MSL|
| 8| HCP|
| 9| HCP12|
+---+-------+
I want to filter out records which have first 3 characters of column 'c2' either 'MSL' or 'HCP'.
So the output should be like below.
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
+---+-------+
Can any one please help on this?
I knew that df.filter($"c2".rlike("MSL")) -- This is for selecting the records but how to exclude the records. ?
Version: Spark 1.6.2
Scala : 2.10
This works too. Concise and very similar to SQL.
df.filter("c2 not like 'MSL%' and c2 not like 'HCP%'").show
+---+-------+
| c1| c2|
+---+-------+
| 1|Emailab|
| 2|Phoneab|
| 3| Faxab|
| 4| Mail|
| 5| Other|
+---+-------+
df.filter(not(
substring(col("c2"), 0, 3).isin("MSL", "HCP"))
)
I used below to filter rows from dataframe and this worked form me.Spark 2.2
val spark = new org.apache.spark.sql.SQLContext(sc)
val data = spark.read.format("csv").
option("header", "true").
option("delimiter", "|").
option("inferSchema", "true").
load("D:\\test.csv")
import spark.implicits._
val filter=data.filter($"dept" === "IT" )
OR
val filter=data.filter($"dept" =!= "IT" )
val df1 = df.filter(not(df("c2").rlike("MSL"))&&not(df("c2").rlike("HCP")))
This worked.