Comparing two dataframes in Spark - scala

I'm comparing two dataframes in spark using except().
For exmaple: df.except(df2)
I will get all the records that are not available in df2 from df. However, I would like to list field details also which are not matching.
For example:
df:
------------------
id,name,age,city
101,kp,28,CHN
------------------
df2:
-----------------
id,name,age,city
101,kp,28,HYD
----------------
Expected output:
df3
--------------------------
id,name,age,city,diff
101,kp,28,CHN,City is not matching
--------------------------------
How can I acheive this?

Use intersect to get the values common to both DataFrames,then build your not matching logic
intersect -returns a new Dataset containing rows only in both this Dataset and another Dataset.
df.intersect(df2)
return a new RDD that contains the intersection of elements in the source dataset and the argument.
intersection(anotherrdd) returns the elements which are present in both the DF.
intersection(anotherrdd) remove all the duplicate including duplicated in single DF

Newer again attempt on the above but not possible elegantly, but with JOIN as opposed to except. Best I can do.
I believe it does what you need and takes into the fact there are things in one data set or not.
Run under Databricks.
case class Person(personid: Int, personname: String, cityid: Int)
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.functions._
val df1 = Seq(
Person(0, "AgataZ", 0),
Person(1, "Iweta", 0),
Person(2, "Patryk", 2),
Person(9999, "Maria", 2),
Person(5, "John", 2),
Person(6, "Patsy", 2),
Person(7, "Gloria", 222),
Person(3333, "Maksym", 0)).toDF
val df2 = Seq(
Person(0, "Agata", 0),
Person(1, "Iweta", 0),
Person(2, "Patryk", 2),
Person(5, "John", 2),
Person(6, "Patsy", 333),
Person(7, "Gloria", 2),
Person(4444, "Hans", 3)).toDF
val joined = df1.join(df2, df1("personid") === df2("personid"), "outer")
val newNames = Seq("personId1", "personName1", "personCity1", "personId2", "personName2", "personCity2")
val df_Renamed = joined.toDF(newNames: _*)
// Some deliberate variation shown in approach for learning
val df_temp = df_Renamed.filter($"personCity1" =!= $"personCity2" || $"personName1" =!= $"personName2" || $"personName1".isNull || $"personName2".isNull || $"personCity1".isNull || $"personCity2".isNull).select($"personId1", $"personName1".alias("Name"), $"personCity1", $"personId2", $"personName2".alias("Name2"), $"personCity2"). withColumn("PersonID", when($"personId1".isNotNull, $"personId1").otherwise($"personId2"))
val df_final = df_temp.withColumn("nameChange ?", when($"Name".isNull or $"Name2".isNull or $"Name" =!= $"Name2", "Yes").otherwise("No")).withColumn("cityChange ?", when($"personCity1".isNull or $"personCity2".isNull or $"personCity1" =!= $"personCity2", "Yes").otherwise("No")).drop("PersonId1").drop("PersonId2")
df_final.show()
gives:
+------+-----------+------+-----------+--------+------------+------------+
| Name|personCity1| Name2|personCity2|PersonID|nameChange ?|cityChange ?|
+------+-----------+------+-----------+--------+------------+------------+
| Patsy| 2| Patsy| 333| 6| No| Yes|
|Maksym| 0| null| null| 3333| Yes| Yes|
| null| null| Hans| 3| 4444| Yes| Yes|
|Gloria| 222|Gloria| 2| 7| No| Yes|
| Maria| 2| null| null| 9999| Yes| Yes|
|AgataZ| 0| Agata| 0| 0| Yes| No|
+------+-----------+------+-----------+--------+------------+------------+

Related

Sum columns of a Spark dataframe and create another dataframe

I have a dataframe like below -
I am trying to create another dataframe from this which has 2 columns - the column name and the sum of values in each column like this -
So far, I've tried this (in Spark 2.2.0) but throws a stack trace -
val get_count: (String => Long) = (c: String) => {
df.groupBy("id")
.agg(sum(c) as "s")
.select("s")
.collect()(0)
.getLong(0)
}
val sqlfunc = udf(get_count)
summary = summary.withColumn("sum_of_column", sqlfunc(col("c")))
Are there any other alternatives of accomplishing this task?
I think that the most efficient way is to do an aggregation and then build a new dataframe. That way you avoid a costly explode.
First, let's create the dataframe. BTW, it's always nice to provide the code to do it when you ask a question. This way we can reproduce your problem in seconds.
val df = Seq((1, 1, 0, 0, 1), (1, 1, 5, 0, 0),
(0, 1, 0, 6, 0), (0, 1, 0, 4, 3))
.toDF("output_label", "ID", "C1", "C2", "C3")
Then we build the list of columns that we are interested in, the aggregations, and compute the result.
val cols = (1 to 3).map(i => s"C$i")
val aggs = cols.map(name => sum(col(name)).as(name))
val agg_df = df.agg(aggs.head, aggs.tail :_*) // See the note below
agg_df.show
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 5| 10| 4|
+---+---+---+
We almost have what we need, we just need to collect the data and build a new dataframe:
val agg_row = agg_df.first
cols.map(name => name -> agg_row.getAs[Long](name))
.toDF("column", "sum")
.show
+------+---+
|column|sum|
+------+---+
| C1| 5|
| C2| 10|
| C3| 4|
+------+---+
EDIT:
NB: df.agg(aggs.head, aggs.tail :_*) may seem strange. The idea is simply to compute all the aggregations computed in aggs. One would expect something more simple like df.agg(aggs : _*). Yet the signature of the agg method is as follows:
def agg(expr: org.apache.spark.sql.Column,exprs: org.apache.spark.sql.Column*)
maybe to ensure that at least one column is used, and this is why you need to split aggs in aggs.head and aggs.tail.
What i do is to define a method to create a struct from the desired values:
def kv (columnsToTranspose: Array[String]) = explode(array(columnsToTranspose.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
This functions receives a list of columns to transpose (your 3 last columns in your case) and transform them in a struct with the column name as key and the column value as value
And then use that method to create an struct and process it as you want
df.withColumn("kv", kv(df.columns.tail.tail))
.select( $"kv.k".as("column"), $"kv.v".alias("values"))
.groupBy("column")
.agg(sum("values").as("sum"))
First apply the previous defined function to have the desired columns as the said struct, and then deconstruct the struct to have a column key and a column value in each row.
Then you can aggregate by the column name and sum the values
INPUT
+------------+---+---+---+---+
|output_label| id| c1| c2| c3|
+------------+---+---+---+---+
| 1| 1| 0| 0| 1|
| 1| 1| 5| 0| 0|
| 0| 1| 0| 6| 0|
| 0| 1| 0| 4| 3|
+------------+---+---+---+---+
OUTPUT
+------+---+
|column|sum|
+------+---+
| c1| 5|
| c3| 4|
| c2| 10|
+------+---+

Scala dataframe: replace spaces with null value using regexp_replace

I am trying to replace white-spaces with a null value using regexp_replace in Scala. However, all variations I have tried do not arrive at the expected output:
+---+-----+
| Id|col_1|
+---+-----+
| 0| null|
| 1| null|
+---+-----+
I had a go at it which looks like this:
import org.apache.spark.sql.functions._
val df = spark.createDataFrame(Seq(
(0, " "),
(1, null),
(2, "hello"))).toDF("Id", "col_1")
val test = df.withColumn("col_1", regexp_replace(df("col_1"), "^\\s*", lit(Null)))
test.filter("col_1 is null").show()
The way you use regexp_replace won't work as the result will simply be a string with the matched substring replaced with another provided substring. You can use regexp_extract instead for a regex equality check in a when/other clause as shown below:
import org.apache.spark.sql.functions._
val df = Seq(
(0, " "),
(1, null),
(2, "hello"),
(3, "")
).toDF("Id", "col_1")
df.withColumn("col_1",
when($"col_1" === regexp_extract($"col_1", "(^\\s*$)", 1), null).
otherwise($"col_1")
).show
// +---+-----+
// | Id|col_1|
// +---+-----+
// | 0| null|
// | 1| null|
// | 2|hello|
// | 3| null|
// +---+-----+

Iterating on columns in dataframe

I have the following data frames
df1
+----------+----+----+----+-----+
| WEEK|DIM1|DIM2| T1| T2|
+----------+----+----+----+-----+
|2016-04-02| 14|NULL|9874| 880|
|2016-04-30| 14| FR|9875| 13|
|2017-06-10| 15| PQR|9867|57721|
+----------+----+----+----+-----+
df2
+----------+----+----+----+-----+
| WEEK|DIM1|DIM2| T1| T2|
+----------+----+----+----+-----+
|2016-04-02| 14|NULL|9879| 820|
|2016-04-30| 14| FR|9785| 9|
|2017-06-10| 15| XYZ|9967|57771|
+----------+----+----+----+-----+
I need to produce my output as following -
+----------+----+----+----+-----+----+-----+-------+-------+----------+------------+
| WEEK|DIM1|DIM2| T1| T2| T1| T2|t1_diff|t2_diff|pr_primary|pr_reference|
+----------+----+----+----+-----+----+-----+-------+-------+----------+------------+
|2016-04-02| 14|NULL|9874| 880|9879| 820| -5| 60| Y| Y|
|2017-06-10| 15| PQR|9867|57721|null| null| null| null| Y| N|
|2017-06-10| 15| XYZ|null| null|9967|57771| null| null| N| Y|
|2016-04-30| 14| FR|9875| 13|9785| 9| 90| 4| Y| Y|
+----------+----+----+----+-----+----+-----+-------+-------+----------+------------+
Here, t1_diff is difference between left T1 and right T1, t2_diff is difference between left T2 and right T2, pr_primary is Y if row is present in df1 and not in df2 and similarly for pr_reference.
I have generated the above with following piece of code
val df1 = Seq(
("2016-04-02", "14", "NULL", 9874, 880), ("2016-04-30", "14", "FR", 9875, 13), ("2017-06-10", "15", "PQR", 9867, 57721)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")
val df2 = Seq(
("2016-04-02", "14", "NULL", 9879, 820), ("2016-04-30", "14", "FR", 9785, 9), ("2017-06-10", "15", "XYZ", 9967, 57771)
).toDF("WEEK", "DIM1", "DIM2","T1","T2")
import org.apache.spark.sql.functions._
val joined = df1.as("l").join(df2.as("r"), Seq("WEEK", "DIM1", "DIM2"), "fullouter")
val j1 = joined.withColumn("t1_diff",col(s"l.T1") - col(s"r.T1")).withColumn("t2_diff",col(s"l.T2") - col(s"r.T2"))
val isPresentSubstitution = udf( (x: String, y: String) => if (x == null && y == null) "N" else "Y")
j1.withColumn("pr_primary",isPresentSubstitution(col(s"l.T1"), col(s"l.T2"))).withColumn("pr_reference",isPresentSubstitution(col(s"r.T1"), col(s"r.T2"))).show
I want to make it generalize for any number of columns not just T1 and T2. Can someone suggest me a better way to do this ? I am running this in spark.
To be able to set any number of columns like t1_diff with any expresion calculating their values, we need to make some refactoring allowing to use withColumn in a more generic manner.
First, we need to collect the target values: the names of the target columns and the expressions that calculate their contents. This can be done with a sequence of Tuples:
val diffColumns = Seq(
("t1_diff", col("l.T1") - col("r.T1")),
("t2_diff", col("l.T2") - col("r.T2"))
)
// or, to make it more readable, create a dedicated "case class DiffColumn(colName: String, expression: Column)"
Now we can use folding to produce the joined DataFrame from joined and the sequence above:
val joinedWithDiffCols =
diffColumns.foldLeft(joined) { case(df, diffTuple) =>
df.withColumn(diffTuple._1, diffTuple._2)
}
joinedWithDiffCols contains the same data as j1 from the question.
To append new columns, you now have to modify diffColumns sequence only. You can even put the calculation of pr_primary and pr_reference in this sequence (but rename the ref to appendedColumns in this case, to be more precise).
Update
To facilitate the creation of the tuples for diffCollumns, it also can be generalized, for example:
// when both column names are same:
def generateDiff(column: String): (String, Column) = generateDiff(column, column)
// when left and right column names are different:
def generateDiff(leftCol: String, rightCol: String): (String, Column) =
(s"${leftCol}_diff", col("l." + leftCol) - col("r." + rightCol))
val diffColumns = Seq("T1", "T2").map(generateDiff)
End-of-update
Assuming the columns are named same in both df1 and df2, you can do something like:
val diffCols = df1.columns
.filter(_.matches("T\\d+"))
.map(c => col(s"l.$c") - col(s"r.$c") as (s"${c.toLowerCase}_diff") )
And then use it with joined like:
joined.select( ( col("*") :+ diffCols ) :_*).show(false)
//+----------+----+----+----+-----+----+-----+-------+-------+
//|WEEK |DIM1|DIM2|T1 |T2 |T1 |T2 |t1_diff|t2_diff|
//+----------+----+----+----+-----+----+-----+-------+-------+
//|2016-04-02|14 |NULL|9874|880 |9879|820 |-5 |60 |
//|2017-06-10|15 |PQR |9867|57721|null|null |null |null |
//|2017-06-10|15 |XYZ |null|null |9967|57771|null |null |
//|2016-04-30|14 |FR |9875|13 |9785|9 |90 |4 |
//+----------+----+----+----+-----+----+-----+-------+-------+
You can do it by adding sequence number to each dataframe and later join those two dataframes based on seq number.
val df3 = df1.withColumn("SeqNum", monotonicallyIncreasingId)
val df4 = df2.withColumn("SeqNum", monotonicallyIncreasingId)
df3.as("l").join(df4.as("r"),"SeqNum").withColumn("t1_diff",col("l.T1") - col("r.T1")).withColumn("t2_diff",col("l.T2") - col("r.T2")).drop("SeqNum").show()

Spark Dataframe - Method to take row as input & dataframe has output

I need to write a method that iterates all the rows from DF2 and generate a Dataframe based on some conditions.
Here is the inputs DF1 & DF2 :
val df1Columns = Seq("Eftv_Date","S_Amt","A_Amt","Layer","SubLayer")
val df2Columns = Seq("Eftv_Date","S_Amt","A_Amt")
var df1 = List(
List("2016-10-31","1000000","1000","0","1"),
List("2016-12-01","100000","950","1","1"),
List("2017-01-01","50000","50","2","1"),
List("2017-03-01","50000","100","3","1"),
List("2017-03-30","80000","300","4","1")
)
.map(row =>(row(0), row(1),row(2),row(3),row(4))).toDF(df1Columns:_*)
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
|2017-03-01| 50000| 100| 3| 1|
|2017-03-30| 80000| 300| 4| 1|
+----------+-------+-----+-----+--------+
val df2 = List(
List("2017-02-01","0","400")
).map(row =>(row(0), row(1),row(2))).toDF(df2Columns:_*)
+----------+-----+-----+
| Eftv_Date|S_Amt|A_Amt|
+----------+-----+-----+
|2017-02-01| 0| 400|
+----------+-----+-----+
Now I need to write a method that filters DF1 based on the Eftv_Date values from each row of DF2.
For example, first row of df2.Eftv_date=Feb 01 2017, so need to filter df1 having records Eftv_date less than or equal to Feb 01 2017.So this will generate 3 records as below:
Expected Result :
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
+----------+-------+-----+-----+--------+
I have written the method as below and called it using map function.
def transformRows(row: Row ) = {
val dateEffective = row.getAs[String]("Eftv_Date")
val df1LayerMet = df1.where(col("Eftv_Date").leq(dateEffective))
df1 = df1LayerMet
df1
}
val x = df2.map(transformRows)
But while calling this I am facing this error:
Error:(154, 24) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val x = df2.map(transformRows)
Note : We can implement this using join , But I need to implement a custom scala method to do this , since there were a lot of transformations involved. For simplicity I have mentioned only one condition.
Seems you need a non-equi join:
df1.alias("a").join(
df2.select("Eftv_Date").alias("b"),
df1("Eftv_Date") <= df2("Eftv_Date") // non-equi join condition
).select("a.*").show
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
+----------+-------+-----+-----+--------+

Simple Roll-Down With Spark Dataframe (Scala)

If you have a simple dataframe that looks like this:
val n = sc.parallelize(List[String](
"Alice", null, null,
"Bob", null, null,
"Chuck"
)).toDF("name")
Which looks like this:
//+-----+
//| name|
//+-----+
//|Alice|
//| null|
//| null|
//| Bob|
//| null|
//| null|
//|Chuck|
//+-----+
How can I use dataframe roll-down functions to get:
//+-----+
//| name|
//+-----+
//|Alice|
//|Alice|
//|Alice|
//| Bob|
//| Bob|
//| Bob|
//|Chuck|
//+-----+
Note: Please state any needed imports, I suspect these include:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.{WindowSpec, Window}
Note: Some sites I tried to mimic are:
http://xinhstechblog.blogspot.com/2016/04/spark-window-functions-for-dataframes.html
and
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
I've come across something like this in the past so I realize that Spark versions will differ. I am using 1.5.2 in the cluster (where this solution is more useful) and 2.0 in local emulation. I prefer a 1.5.2 compatible solution.
Also, I'd like to get away from writing SQL directly - avoid using sqlContext.sql(...)
If you have another column that allows grouping of the values, here's a suggestion:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import sqlContext.implicits._
val df = Seq(
(Some("Alice"), 1),
(None, 1),
(None, 1),
(Some("Bob"), 2),
(None, 2),
(None, 2),
(Some("Chuck"), 3)
).toDF("name", "group")
val result = df.withColumn("new_col", min(col("name")).over(Window.partitionBy("group")))
result.show()
+-----+-----+-------+
| name|group|new_col|
+-----+-----+-------+
|Alice| 1| Alice|
| null| 1| Alice|
| null| 1| Alice|
| Bob| 2| Bob|
| null| 2| Bob|
| null| 2| Bob|
|Chuck| 3| Chuck|
+-----+-----+-------+
On the other hand, if you only have a column that allows ordering, but not grouping, the solution is a little harder. My first idea is to create a subset and then do a join:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import sqlContext.implicits._
val df = Seq(
(Some("Alice"), 1),
(None, 2),
(None, 3),
(Some("Bob"), 4),
(None, 5),
(None, 6),
(Some("Chuck"), 7)
).toDF("name", "order")
val subset = df
.select("name", "order")
.where(col("name").isNotNull)
.withColumn("next", lead("order", 1).over(Window.orderBy("order")))
val partial = df.as("a")
.join(subset.as("b"), col("a.order") >= col("b.order") && (col("a.order") < subset("next")), "left")
val result = partial.select(coalesce(col("a.name"), col("b.name")).as("name"), col("a.order"))
result.show()
+-----+-----+
| name|order|
+-----+-----+
|Alice| 1|
|Alice| 2|
|Alice| 3|
| Bob| 4|
| Bob| 5|
| Bob| 6|
|Chuck| 7|
+-----+-----+