Joining two DataFrames and appending where not exists - scala

I have two DataFrames. One is a MasterList, the other is an InsertList
MasterList:
+--------+--------+
| ttm_id|audit_id|
+--------+--------+
| 1| 10|
| 15| 10|
+--------+--------+
InsertList:
+--------+--------+
| ttm_id|audit_id|
+--------+--------+
| 1| 10|
| 15| 9|
+--------+--------+
In Scala, how do I join two DataFrames but only append to the new DataFrame records
WHERE MasterList.ttm_id = InsertList.ttm_id AND
MasterList.audit_id != InsertList.audit_id
-
ExpectedOutput:
+--------+--------+
| ttm_id|audit_id|
+--------+--------+
| 1| 10|
| 15| 10|
| 15| 9|
+--------+--------+

I'd anti join (NOT IN) by both columns and union
val masterList = Seq((1, 10), (15, 10)).toDF("ttm_id", "audit_id")
val insertList = Seq((1, 10), (15, 9)).toDF("ttm_id", "audit_id")
insertList
.join(masterList, Seq("ttm_id", "audit_id"), "leftanti")
.union(masterList)
.show
// +------+--------+
// |ttm_id|audit_id|
// +------+--------+
// | 15| 9|
// | 1| 10|
// | 15| 10|
// +------+--------+

It seems that you want to merge rows from insertList dataFrame that are not in masterList dataFrame. This can be achived using except function
insertList.except(masterList)
And you just use union function merge both dataFrames as
masterList.union(insertList.except(masterList))
You should get what you desire as
+------+--------+
|ttm_id|audit_id|
+------+--------+
|1 |10 |
|15 |10 |
|15 |9 |
+------+--------+

Related

Pyspark filter where value is in another dataframe

I have two data frames. I need to filter one to only show values that are contained in the other.
table_a:
+---+----+
|AID| foo|
+---+----+
| 1 | bar|
| 2 | bar|
| 3 | bar|
| 4 | bar|
+---+----+
table_b:
+---+
|BID|
+---+
| 1 |
| 2 |
+---+
In the end I want to filter out what was in table_a to only the IDs that are in the table_b, like this:
+--+----+
|ID| foo|
+--+----+
| 1| bar|
| 2| bar|
+--+----+
Here is what I'm trying to do
result_table = table_a.filter(table_b.BID.contains(table_a.AID))
But this doesn't seem to be working. It looks like I'm getting ALL values.
NOTE: I can't add any other imports other than pyspark.sql.functions import col
You can join the two tables and specify how = 'left_semi'
A left semi-join returns values from the left side of the relation that has a match with the right.
result_table = table_a.join(table_b, (table_a.AID == table_b.BID), \
how = "left_semi").drop("BID")
result_table.show()
+---+---+
|AID|foo|
+---+---+
| 1|bar|
| 2|bar|
+---+---+
In case you have duplicates or Multiple values in the second dataframe and you want to take only distinct values, below approach can be useful to tackle such use cases -
Create the Dataframe
df = spark.createDataFrame([(1,"bar"),(2,"bar"),(3,"bar"),(4,"bar")],[ "col1","col2"])
df_lookup = spark.createDataFrame([(1,1),(1,2)],[ "id","val"])
df.show(truncate=True)
df_lookup.show()
+----+----+
|col1|col2|
+----+----+
| 1| bar|
| 2| bar|
| 3| bar|
| 4| bar|
+----+----+
+---+---+
| id|val|
+---+---+
| 1| 1|
| 1| 2|
+---+---+
get all the unique values of val column in dataframe two and take in a set/list variable
df_lookup_var = df_lookup.groupBy("id").agg(F.collect_set("val").alias("val")).collect()[0][1][0]
print(df_lookup_var)
df = df.withColumn("case_col", F.when((F.col("col1").isin([1,2])), F.lit("1")).otherwise(F.lit("0")))
df = df.filter(F.col("case_col") == F.lit("1"))
df.show()
+----+----+--------+
|col1|col2|case_col|
+----+----+--------+
| 1| bar| 1|
| 2| bar| 1|
+----+----+--------+
This should work too:
table_a.where( col(AID).isin(table_b.BID.tolist() ) )

How to get the set of rows which contains null values from dataframe in scala using filter

I'm new to spark and have a question regarding filtering dataframe based on null condition.
I have gone through many answers which has solution like
df.filter(($"col2".isNotNULL) || ($"col2" !== "NULL") || ($"col2" !== "null") || ($"col2".trim !== "NULL"))
But in my case, I can not write hard coded column names as my schema is not fixed. I am reading csv file and depending upon the columns in it, I have to filter my dataframe for null values and want it in another dataframe. In short, any column which has null value, that complete row should come under a different dataframe.
for example :
Input DataFrame :
+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|n1#c1.com|[c1,1,d1]|
| n2| 2|null |[c1,1,d1]|
| n3| 3|n3#c1.com| null |
| n4| 4|n4#c2.com|[c2,2,d2]|
| n6| 6|n6#c2.com|[c2,2,d2]|
Output :
+----+----+---------+---------+
|name| id| email| company|
+----+----+---------+---------+
| n1|null|n1#c1.com|[c1,1,d1]|
| n2| 2|null |[c1,1,d1]|
| n3| 3|n3#c1.com| null |
Thank you in advance.
Try this-
val df1 = spark.sql("select col1, col2 from values (null, 1), (2, null), (null, null), (1,2) T(col1, col2)")
/**
* +----+----+
* |col1|col2|
* +----+----+
* |null|1 |
* |2 |null|
* |null|null|
* |1 |2 |
* +----+----+
*/
df1.show(false)
df1.filter(df1.columns.map(col(_).isNull).reduce(_ || _)).show(false)
/**
* +----+----+
* |col1|col2|
* +----+----+
* |null|1 |
* |2 |null|
* |null|null|
* +----+----+
*/
Thank you so much for your answers. I tried below logic and it worked for me.
var arrayColumn = df.columns;
val filterString = String.format(" %1$s is null or %1$s == '' "+ arrayColumn(0));
val x = new StringBuilder(filterString);
for(i <- 1 until arrayColumn.length){
if (x.toString() != ""){
x ++= String.format("or %1$s is null or %1$s == '' ", arrayColumn(i))
}
}
val dfWithNullRows = df.filter(x.toString());
To deal with null values and dataframes spark has some useful functions.
I will show some dataframes examples with distinct number of columns.
val schema = StructType(List(StructField("id", IntegerType, true), StructField("obj",DoubleType, true)))
val schema1 = StructType(List(StructField("id", IntegerType, true), StructField("obj",StringType, true), StructField("obj",IntegerType, true)))
val t1 = sc.parallelize(Seq((1,null),(1,1.0),(8,3.0),(2,null),(3,1.4),(3,2.5),(null,3.7))).map(t => Row(t._1,t._2))
val t2 = sc.parallelize(Seq((1,"A",null),(2,"B",null),(3,"C",36),(null,"D",15),(5,"E",25),(6,null,7),(7,"G",null))).map(t => Row(t._1,t._2,t._3))
val tt1 = spark.createDataFrame(t1, schema)
val tt2 = spark.createDataFrame(t2, schema1)
tt1.show()
tt2.show()
// To clean all rows with null values
val dfWithoutNull = tt1.na.drop()
dfWithoutNull.show()
val df2WithoutNull = tt2.na.drop()
df2WithoutNull.show()
// To fill null values with another value
val df1 = tt1.na.fill(-1)
df1.show()
// to get new dataframes with the null values rows
val nullValues = tt1.filter(row => row.anyNull == true)
nullValues.show()
val nullValues2 = tt2.filter(row => row.anyNull == true)
nullValues2.show()
output
// input dataframes
+----+----+
| id| obj|
+----+----+
| 1|null|
| 1| 1.0|
| 8| 3.0|
| 2|null|
| 3| 1.4|
| 3| 2.5|
|null| 3.7|
+----+----+
+----+----+----+
| id| obj| obj|
+----+----+----+
| 1| A|null|
| 2| B|null|
| 3| C| 36|
|null| D| 15|
| 5| E| 25|
| 6|null| 7|
| 7| G|null|
+----+----+----+
// Dataframes without null values
+---+---+
| id|obj|
+---+---+
| 1|1.0|
| 8|3.0|
| 3|1.4|
| 3|2.5|
+---+---+
+---+---+---+
| id|obj|obj|
+---+---+---+
| 3| C| 36|
| 5| E| 25|
+---+---+---+
// Dataframe with null values replaced
+---+----+
| id| obj|
+---+----+
| 1|-1.0|
| 1| 1.0|
| 8| 3.0|
| 2|-1.0|
| 3| 1.4|
| 3| 2.5|
| -1| 3.7|
+---+----+
// Dataframes which the rows have at least one null value
+----+----+
| id| obj|
+----+----+
| 1|null|
| 2|null|
|null| 3.7|
+----+----+
+----+----+----+
| id| obj| obj|
+----+----+----+
| 1| A|null|
| 2| B|null|
|null| D| 15|
| 6|null| 7|
| 7| G|null|
+----+----+----+

Fill null values in dataframe column with next value

I have to fill the first null values with immediate value of the same column in dataframe. This logic applies only on first consecutive null values only of the column.
I have a dataframe with similar to below
//I replaced null to 0 in value column
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|0 |exA |30 |
|0 |exB |22 |
|0 |exC |19 |
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 |
|13 |exH |53 |
+-----+----+----+
From this dataframe I am expecting as below
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|16 |exA |30 | // Change the value 0 to 16 at value column
|16 |exB |22 | // Change the value 0 to 16 at value column
|16 |exC |19 | // Change the value 0 to 16 at value column
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 | // value should not be change here
|13 |exH |53 |
+-----+----+----+
Please help me solve this.
You can use Window function for this purpose
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
val w = Window.orderBy($"col2".desc)
df.withColumn("Result", last(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.orderBy($"col2")
.show(10)
Will result in
+-----+----+----+------+
|value|col2|col3|Result|
+-----+----+----+------+
| 0| exA| 30| 16|
| 0| exB| 22| 16|
| 0| exC| 19| 16|
| 16| exD| 13| 16|
| 5| exE| 28| 5|
| 6| exF| 26| 6|
| 0| exG| 12| 13|
| 13| exH| 53| 13|
+-----+----+----+------+
Expression df.orderBy($"col2") is needed only to show final results in right order. You can skip it if you don't care about final order.
UPDATE
To get exactly what you need you should you a little bit more complicated code
val w = Window.orderBy($"col2")
val w2 = Window.orderBy($"col2".desc)
df.withColumn("IntermediateResult", first(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.withColumn("Result", when($"IntermediateResult".isNull, last($"IntermediateResult", ignoreNulls = true).over(w2)).otherwise($"value"))
.orderBy($"col2")
.show(10)
+-----+----+----+------------------+------+
|value|col2|col3|IntermediateResult|Result|
+-----+----+----+------------------+------+
| 0| exA| 30| null| 16|
| 0| exB| 22| null| 16|
| 0| exC| 19| null| 16|
| 16| exD| 13| 16| 16|
| 5| exE| 28| 16| 5|
| 6| exF| 26| 16| 6|
| 0| exG| 12| 16| 0|
| 13| exH| 53| 16| 13|
+-----+----+----+------------------+------+
I think you need to take the 1st not null or non-zero value based on col2 's order. Please find the script below. I have created a table in spark's memory to write sql.
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
df.registerTempTable("table_df")
spark.sql("with cte as(select *,row_number() over(order by col2) rno from table_df) select case when value = 0 and rno<(select min(rno) from cte where value != 0) then (select value from cte where rno=(select min(rno) from cte where value != 0)) else value end value,col2,col3 from cte").show(df.count.toInt,false)
Please let me know if you have any questions.
I added a new column with incremental id to your DF
import org.apache.spark.sql.functions._
val df_1 = Seq((0,"exA",30),
(0,"exB",22),
(0,"exC",19),
(16,"exD",13),
(5,"exE",28),
(6,"exF",26),
(0,"exG",12),
(13,"exH",53))
.toDF("value", "col2", "col3")
.withColumn("UniqueID", monotonically_increasing_id)
filter DF to have non-zero values
val df_2 = df_1.filter("value != 0")
create a variable "limit" to limit first N row that we need and variable Nvar for the first non-zero value
val limit = df_2.agg(min("UniqueID")).collect().map(_(0)).mkString("").toInt + 1
val nVal = df_1.limit(limit).agg(max("value")).collect().map(_(0)).mkString("").toInt
create DF with a column with the same name ("value") with a condition
val df_4 = df_1.withColumn("value", when(($"UniqueID" < limit), nVal).otherwise($"value"))

How to replace empty values in a column of DataFrame?

How can I replace empty values in a column Field1 of DataFrame df?
Field1 Field2
AA
12 BB
This command does not provide an expected result:
df.na.fill("Field1",Seq("Anonymous"))
The expected result:
Field1 Field2
Anonymous AA
12 BB
You can also try this.
This might handle both blank/empty/null
df.show()
+------+------+
|Field1|Field2|
+------+------+
| | AA|
| 12| BB|
| 12| null|
+------+------+
df.na.replace(Seq("Field1","Field2"),Map(""-> null)).na.fill("Anonymous", Seq("Field2","Field1")).show(false)
+---------+---------+
|Field1 |Field2 |
+---------+---------+
|Anonymous|AA |
|12 |BB |
|12 |Anonymous|
+---------+---------+
Fill: Returns a new DataFrame that replaces null or NaN values in
numeric columns with value.
Two things:
An empty string is not null or NaN, so you'll have to use a case statement for that.
Fill seems to not work well when giving a text value into a numeric column.
Failing Null Replace with Fill / Text:
scala> a.show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
scala> a.na.fill("Anonymous", Seq("f1")).show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
Working Example - Using Null With All Numbers:
scala> a.show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
scala> a.na.fill(1, Seq("f1")).show
+---+---+
| f1| f2|
+---+---+
| 1| AA|
| 12| BB|
+---+---+
Failing Example (Empty String instead of Null):
scala> b.show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
scala> b.na.fill(1, Seq("f1")).show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
Case Statement Fix Example:
scala> b.show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
scala> b.select(when(col("f1") === "", "Anonymous").otherwise(col("f1")).as("f1"), col("f2")).show
+---------+---+
| f1| f2|
+---------+---+
|Anonymous| AA|
| 12| BB|
+---------+---+
You can try using below code when you have n number of columns in dataframe.
Note: When you are trying to write data into formats like parquet, null data types are not supported. we have to type cast it.
val df = Seq(
(1, ""),
(2, "Ram"),
(3, "Sam"),
(4,"")
).toDF("ID", "Name")
// null type column
val inputDf = df.withColumn("NulType", lit(null).cast(StringType))
//Output
+---+----+-------+
| ID|Name|NulType|
+---+----+-------+
| 1| | null|
| 2| Ram| null|
| 3| Sam| null|
| 4| | null|
+---+----+-------+
//Replace all blank space in the dataframe with null
val colName = inputDf.columns //*This will give you array of string*
val data = inputDf.na.replace(colName,Map(""->"null"))
data.show()
+---+----+-------+
| ID|Name|NulType|
+---+----+-------+
| 1|null| null|
| 2| Ram| null|
| 3| Sam| null|
| 4|null| null|
+---+----+-------+

Spark Scala: moving average for multiple columns

Input:
val customers = sc.parallelize(List(("Alice", "2016-05-01", 50.00,4),
("Alice", "2016-05-03", 45.00,2),
("Alice", "2016-05-04", 55.00,4),
("Bob", "2016-05-01", 25.00,6),
("Bob", "2016-05-04", 29.00,7),
("Bob", "2016-05-06", 27.00,10))).
toDF("name", "date", "amountSpent","NumItems")
Procedure:
// Import the window functions.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
// Create a window spec.
val wSpec1 = Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1)
In this window spec, the data is partitioned by customer. Each customer’s data is ordered by date. And, the window frame is defined as starting from -1 (one row before the current row) and ending at 1 (one row after the current row), for a total of 3 rows in the sliding window. The problem is to take window-based summation for a list of columns. In this case, they're "amountSpent","NumItems". But the problem can have up to hundreds of columns.
Below is the solution for doing window-based summation for each column. However, how to perform the summation more effectively? because we don't need to do find slided-window rows every time for each column.
// Calculate the sum of spent
customers.withColumn("sumSpent",sum(customers("amountSpent")).over(wSpec1)).show()
+-----+----------+-----------+--------+--------+
| name| date|amountSpent|NumItems|sumSpent|
+-----+----------+-----------+--------+--------+
|Alice|2016-05-01| 50.0| 4| 95.0|
|Alice|2016-05-03| 45.0| 2| 150.0|
|Alice|2016-05-04| 55.0| 4| 100.0|
| Bob|2016-05-01| 25.0| 6| 54.0|
| Bob|2016-05-04| 29.0| 7| 81.0|
| Bob|2016-05-06| 27.0| 10| 56.0|
+-----+----------+-----------+--------+--------+
// Calculate the sum of items
customers.withColumn( "sumItems",
sum(customers("NumItems")).over(wSpec1) ).show()
+-----+----------+-----------+--------+--------+
| name| date|amountSpent|NumItems|sumItems|
+-----+----------+-----------+--------+--------+
|Alice|2016-05-01| 50.0| 4| 6|
|Alice|2016-05-03| 45.0| 2| 10|
|Alice|2016-05-04| 55.0| 4| 6|
| Bob|2016-05-01| 25.0| 6| 13|
| Bob|2016-05-04| 29.0| 7| 23|
| Bob|2016-05-06| 27.0| 10| 17|
+-----+----------+-----------+--------+--------+
Currently, I guess, its not possible to update multiple columns using Window function. You can act as if its happening at the same time as below
val customers = sc.parallelize(List(("Alice", "2016-05-01", 50.00,4),
("Alice", "2016-05-03", 45.00,2),
("Alice", "2016-05-04", 55.00,4),
("Bob", "2016-05-01", 25.00,6),
("Bob", "2016-05-04", 29.00,7),
("Bob", "2016-05-06", 27.00,10))).
toDF("name", "date", "amountSpent","NumItems")
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
// Create a window spec.
val wSpec1 = Window.partitionBy("name").orderBy("date").rowsBetween(-1, 1)
var tempdf = customers
val colNames = List("amountSpent", "NumItems")
for(column <- colNames){
tempdf = tempdf.withColumn(column+"Sum", sum(tempdf(column)).over(wSpec1))
}
tempdf.show(false)
You should have output as
+-----+----------+-----------+--------+--------------+-----------+
|name |date |amountSpent|NumItems|amountSpentSum|NumItemsSum|
+-----+----------+-----------+--------+--------------+-----------+
|Bob |2016-05-01|25.0 |6 |54.0 |13 |
|Bob |2016-05-04|29.0 |7 |81.0 |23 |
|Bob |2016-05-06|27.0 |10 |56.0 |17 |
|Alice|2016-05-01|50.0 |4 |95.0 |6 |
|Alice|2016-05-03|45.0 |2 |150.0 |10 |
|Alice|2016-05-04|55.0 |4 |100.0 |6 |
+-----+----------+-----------+--------+--------------+-----------+
Yes, it's possible to calculate the window only once (if you have Spark 2 which allows you to use collect_list with struct-types), assuming the to have the dataframe and windowSpec as in your code, then:
val colNames = List("amountSpent","NumItems")
val cols= colNames.map(col(_))
// put window-content of all columns in one struct
val df_wc_arr = customers
.withColumn("window_content_arr",collect_list(struct(cols:_*)).over(wSpec1))
// calculate sum of window-content for each column
// aggregation exression used later
val aggExpr = colNames.map(n => sum(col("window_content."+n)).as(n+"Sum"))
df_wc_arr
.withColumn("window_content",explode($"window_content_arr"))
.drop($"window_content_arr")
.groupBy(($"name" :: $"date" :: cols):_*)
.agg(aggExpr.head,aggExpr.tail:_*)
.orderBy($"name",$"date")
.show
gives
+-----+----------+-----------+--------+--------------+-----------+
| name| date|amountSpent|NumItems|amountSpentSum|NumItemsSum|
+-----+----------+-----------+--------+--------------+-----------+
|Alice|2016-05-01| 50.0| 4| 95.0| 6|
|Alice|2016-05-03| 45.0| 2| 150.0| 10|
|Alice|2016-05-04| 55.0| 4| 100.0| 6|
| Bob|2016-05-01| 25.0| 6| 54.0| 13|
| Bob|2016-05-04| 29.0| 7| 81.0| 23|
| Bob|2016-05-06| 27.0| 10| 56.0| 17|
+-----+----------+-----------+--------+--------------+-----------+