Column wise comparison between Spark Dataframe using Spark core - scala

Given Example, But looking for N number of columns comparison between two data frame as column-wise.
Given sample with 5 rows and 3 columns with EmpID as Primary key.
How can I do this comparison in Spark core?
InputDf1:
|EMPID |Dept | Salary
--------------------------
|1 |HR | 100
|2 |IT | 200
|3 |Finance | 250
|4 |Accounts | 200
|5 |IT | 150
InfputDF2:
|EMPID |Dept |Salary
------------------------------
|1 |HR | 100
|2 |IT | 200
|3 |FIN | 250
|4 |Accounts | 150
|5 |IT | 150
Expected Result DF:
|EMPID |Dept |Dept |status |Salary |Salary |status
--------------------------------------------------------------------
|1 |HR |HR | TRUE | 100 | 100 | TRUE
|2 |IT |IT | TRUE | 200 | 200 | TRUE
|3 |Finance |FIN | False | 250 | 250 | TRUE
|4 |Accounts |Accounts | TRUE | 200 | 150 | FALSE
|5 |IT |IT | TRUE | 150 | 150 | TRUE

You can do a join using the EMPID and compare the resulting columns:
val result = df1.alias("df1").join(
df2.alias("df2"), "EMPID"
).select(
$"EMPID",
$"df1.Dept", $"df2.Dept",
($"df1.Dept" === $"df2.Dept").as("status"),
$"df1.Salary", $"df2.Salary",
($"df1.Salary" === $"df2.Salary").as("status")
)
result.show
+-----+--------+--------+------+------+------+------+
|EMPID| Dept| Dept|status|Salary|Salary|status|
+-----+--------+--------+------+------+------+------+
| 1| HR| HR| true| 100| 100| true|
| 2| IT| IT| true| 200| 200| true|
| 3| Finance| FIN| false| 250| 250| true|
| 4|Accounts|Accounts| true| 200| 150| false|
| 5| IT| IT| true| 150| 150| true|
+-----+--------+--------+------+------+------+------+
Note that you may wish to rename the columns because duplicate column names are not possible to query in the future.

You can use join and then iterate over df.columns to select the desired output columns :
val df_final = df1.alias("df1")
.join(df2.alias("df2"), "EMPID")
.select(
Seq(col("EMPID")) ++
df1.columns.filter(_ != "EMPID")
.flatMap(c =>
Seq(
col(s"df1.$c").as(s"df1_$c"),
col(s"df2.$c").as(s"df2_$c"),
(col(s"df1.$c") === col(s"df2.$c")).as(s"status_$c")
)
): _*
)
df_final.show
//+-----+--------+--------+-----------+----------+----------+-------------+
//|EMPID|df1_Dept|df2_Dept|status_Dept|df1_Salary|df2_Salary|status_Salary|
//+-----+--------+--------+-----------+----------+----------+-------------+
//| 1| HR| HR| true| 100| 100| true|
//| 2| IT| IT| true| 200| 200| true|
//| 3| Finance| FIN| false| 250| 250| true|
//| 4|Accounts|Accounts| true| 200| 150| false|
//| 5| IT| IT| true| 150| 150| true|
//+-----+--------+--------+-----------+----------+----------+-------------+

You could also do this in a way below:
//Source data
val df = Seq((1,"HR",100),(2,"IT",200),(3,"Finance",250),(4,"Accounts",200),(5,"IT",150)).toDF("EMPID","Dept","Salary")
val df1 = Seq((1,"HR",100),(2,"IT",200),(3,"Fin",250),(4,"Accounts",150),(5,"IT",150)).toDF("EMPID","Dept","Salary")
//joins and other operations
val finalDF = df.as("d").join(df1.as("d1"),Seq("EMPID"),"inner")
.withColumn("DeptStatus",$"d.Dept" === $"d1.Dept")
.withColumn("Salarystatus",$"d.Salary" === $"d1.Salary")
.selectExpr("EMPID","d.Dept","d1.Dept","DeptStatus as
Status","d.Salary","d1.Salary","SalaryStatus as Status")
display(finalDF)
You can see the output as below:

Related

How to perform one to many mapping on spark scala dataframe column using flatmaps

I am looking for specifically a flatmap solution to a problem of mocking the data column in a spark-scala dataframe by using data duplicacy technique like 1 to many mapping inside flatmap
My given data is something like this
|id |name|marks|
+---+----+-----+
|1 |ABCD|12 |
|2 |CDEF|12 |
|3 |FGHI|14 |
+---+----+-----+
and my expectation after doing 1 to 3 mapping of the id column will be something like this
|id |name|marks|
+---+----+-----+
|1 |ABCD|12 |
|2 |CDEF|12 |
|3 |FGHI|14 |
|2 |null|null |
|3 |null|null |
|1 |null|null |
|2 |null|null |
|1 |null|null |
|3 |null|null |
+---+----+-----+
Please feel free to let me know if there is any clarification required on the requirement part
Thanks in advance!!!
I see that you are attempting to generate data with a requirement of re-using values in the ID column.
You can just select the ID column and generate random values and do a union back to your original dataset.
For example:
val data = Seq((1,"asd",15), (2,"asd",20), (3,"test",99)).toDF("id","testName","marks")
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| asd| 15|
| 2| asd| 20|
| 3| test| 99|
+---+--------+-----+
import org.apache.spark.sql.types._
val newRecords = data.select("id").withColumn("testName", concat(lit("name_"), lit(rand()*10).cast(IntegerType).cast(StringType))).withColumn("marks", lit(rand()*100).cast(IntegerType))
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| name_2| 35|
| 2| name_9| 20|
| 3| name_3| 7|
+---+--------+-----+
val result = data.unionAll(newRecords)
+---+--------+-----+
| id|testName|marks|
+---+--------+-----+
| 1| asd| 15|
| 2| asd| 20|
| 3| test| 99|
| 1| name_2| 35|
| 2| name_9| 20|
| 3| name_3| 7|
+---+--------+-----+
you can run the randomisation portion of the code using a loop and do a union of all the generated dataframes.

Spark Scala: group by multiple columns with aggregation logic

Given the following DataFrame:
+----+--------+--------+-----+------+------+------+
|name|platform|group_id|width|height| x| y|
+----+--------+--------+-----+-------------+------+
| a| plat_a| 0|500.0|1000.0|250.41|500.01|
| a| plat_a| 0|250.0| 500.0|125.75| 250.7|
| a| plat_a| 0|300.0| 800.0| 120.0| 111.7|
| b| plat_b| 0|500.0|1000.0| 250.5|500.67|
| b| plat_b| 1|400.0| 800.0|100.67|200.67|
| b| plat_b| 1|800.0|1600.0|201.07|401.07|
+----+--------+--------+-----+------+------+------+
I would like to group by name, platform, group_id and count by the following columns logic:
//normalizing value to percent with 2 digit precision
new_x = Math.round(x / width * 100.0) / 100.0
new_y = Math.round(y / height * 100.0) / 100.0
So the output DataFrame would be:
+----+--------+--------+------+------+-----+
|name|platform|group_id| new_x| new_y|count|
+----+--------+---------------+------+-----+
| a| plat_a| 0| 0.5| 0.5| 2|
| a| plat_a| 0| 0.4| 0.13| 1|
| b| plat_b| 0| 0.5| 0.5| 1|
| b| plat_b| 1| 0.25| 0.25| 2|
+----+--------+--------+------+------+-----+
How should I approach this problem?
It should be quite straightforward groupBy and count
import org.apache.spark.sql.functions._
df.withColumn("new_x", round($"x" / $"width" * 100.0 ) / 100.0)
.withColumn("new_y", round($"y" / $"height" * 100.0 ) / 100.0)
.groupBy("name", "platform", "group_id", "new_x", "new_y")
.count()
.show(false)
Output:
+----+--------+--------+-----+-----+-----+
|name|platform|group_id|new_x|new_y|count|
+----+--------+--------+-----+-----+-----+
|a |plat_a |0 |0.5 |0.5 |2 |
|b |plat_b |0 |0.5 |0.5 |1 |
|b |plat_b |1 |0.25 |0.25 |2 |
|a |plat_a |0 |0.4 |0.14 |1 |
+----+--------+--------+-----+-----+-----+

I have a DataFrame in two rows and multiple columns, how to transpose into two columns and multiple rows?

I have a spark DataFrame like this:
+---+---+---+---+---+---+---+
| f1| f2| f3| f4| f5| f6| f7|
+---+---+---+---+---+---+---+
| 5| 4| 5| 2| 5| 5| 5|
+---+---+---+---+---+---+---+
how can you povit to
+---+---+
| f1| 5|
+---+---+
| f2| 4|
+---+---+
| f3| 5|
+---+---+
| f4| 2|
+---+---+
| f5| 5|
+---+---+
| f6| 5|
+---+---+
| f7| 5|
+---+---+
Is there a simple code in spark scala that can be used for transposition?
scala> df.show()
+---+---+---+---+---+---+---+
| f1| f2| f3| f4| f5| f6| f7|
+---+---+---+---+---+---+---+
| 5| 4| 5| 2| 5| 5| 5|
+---+---+---+---+---+---+---+
scala> import org.apache.spark.sql.DataFrame
scala> def transposeUDF(transDF: DataFrame, transBy: Seq[String]): DataFrame = {
| val (cols, types) = transDF.dtypes.filter{ case (c, _) => !transBy.contains(c)}.unzip
| require(types.distinct.size == 1)
|
| val kvs = explode(array(
| cols.map(c => struct(lit(c).alias("columns"), col(c).alias("value"))): _*
| ))
|
| val byExprs = transBy.map(col(_))
|
| transDF
| .select(byExprs :+ kvs.alias("_kvs"): _*)
| .select(byExprs ++ Seq($"_kvs.columns", $"_kvs.value"): _*)
| }
scala> val df1 = df.withColumn("tempColumn", lit("1"))
scala> transposeUDF(df1, Seq("tempColumn")).drop("tempColumn").show(false)
+-------+-----+
|columns|value|
+-------+-----+
|f1 |5 |
|f2 |4 |
|f3 |5 |
|f4 |2 |
|f5 |5 |
|f6 |5 |
|f7 |5 |
+-------+-----+
spark 2.4+ use map_from_arrays
scala> var df =Seq(( 5, 4, 5, 2, 5, 5, 5)).toDF("f1", "f2", "f3", "f4", "f5", "f6", "f7")
scala> df.select(array('*).as("v"), lit(df.columns).as("k")).select('v.getItem(0).as("cust_id"), map_from_arrays('k,'v).as("map")).select(explode('map)).show(false)
+---+-----+
|key|value|
+---+-----+
|f1 |5 |
|f2 |4 |
|f3 |5 |
|f4 |2 |
|f5 |5 |
|f6 |5 |
|f7 |5 |
+---+-----+
hope its helps you.
I wrote a function
object DT {
val KEY_COL_NAME = "dt_key"
val VALUE_COL_NAME = "dt_value"
def pivot(df: DataFrame, valueDataType: DataType, cols: Array[String], keyColName: String, valueColName: String): DataFrame = {
val tempData: RDD[Row] = df.rdd.flatMap(row => row.getValuesMap(cols).map(Row.fromTuple))
val keyStructField = DataTypes.createStructField(keyColName, DataTypes.StringType, false)
val valueStructField = DataTypes.createStructField(valueColName, DataTypes.StringType, true)
val structType = DataTypes.createStructType(Array(keyStructField, valueStructField))
df.sparkSession.createDataFrame(tempData, structType).select(col(keyColName), col(valueColName).cast(valueDataType))
}
def pivot(df: DataFrame, valueDataType: DataType): DataFrame = {
pivot(df, valueDataType, df.columns, KEY_COL_NAME, VALUE_COL_NAME)
}
}
it worked
df.show()
DT.pivot(df,DoubleType).show()
like this
+---+---+-----------+---+---+ +------+-----------+
| f1| f2| f3| f4| f5| |dt_key| dt_value|
+---+---+-----------+---+---+ to +------+-----------+
|100| 1|0.355072464| 0| 31| | f1| 100.0|
+---+---+-----------+---+---+ | f5| 31.0|
| f3|0.355072464|
| f4| 0.0|
| f2| 1.0|
+------+-----------+
and
+---+---+-----------+-----------+---+ +------+-----------+
| f1| f2| f3| f4| f5| |dt_key| dt_value|
+---+---+-----------+-----------+---+ to +------+-----------+
|100| 1|0.355072464| 0| 31| | f1| 100.0|
| 63| 2|0.622775801|0.685809375| 16| | f5| 31.0|
+---+---+-----------+-----------+---+ | f3|0.355072464|
| f4| 0.0|
| f2| 1.0|
| f1| 63.0|
| f5| 16.0|
| f3|0.622775801|
| f4|0.685809375|
| f2| 2.0|
+------+-----------+
very nice!

Fill null values in dataframe column with next value

I have to fill the first null values with immediate value of the same column in dataframe. This logic applies only on first consecutive null values only of the column.
I have a dataframe with similar to below
//I replaced null to 0 in value column
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|0 |exA |30 |
|0 |exB |22 |
|0 |exC |19 |
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 |
|13 |exH |53 |
+-----+----+----+
From this dataframe I am expecting as below
scala> df.show(false)
+-----+----+----+
|value|col2|col3|
+-----+----+----+
|16 |exA |30 | // Change the value 0 to 16 at value column
|16 |exB |22 | // Change the value 0 to 16 at value column
|16 |exC |19 | // Change the value 0 to 16 at value column
|16 |exD |13 |
|5 |exE |28 |
|6 |exF |26 |
|0 |exG |12 | // value should not be change here
|13 |exH |53 |
+-----+----+----+
Please help me solve this.
You can use Window function for this purpose
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
val w = Window.orderBy($"col2".desc)
df.withColumn("Result", last(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.orderBy($"col2")
.show(10)
Will result in
+-----+----+----+------+
|value|col2|col3|Result|
+-----+----+----+------+
| 0| exA| 30| 16|
| 0| exB| 22| 16|
| 0| exC| 19| 16|
| 16| exD| 13| 16|
| 5| exE| 28| 5|
| 6| exF| 26| 6|
| 0| exG| 12| 13|
| 13| exH| 53| 13|
+-----+----+----+------+
Expression df.orderBy($"col2") is needed only to show final results in right order. You can skip it if you don't care about final order.
UPDATE
To get exactly what you need you should you a little bit more complicated code
val w = Window.orderBy($"col2")
val w2 = Window.orderBy($"col2".desc)
df.withColumn("IntermediateResult", first(when($"value" === 0, null).otherwise($"value"), ignoreNulls = true).over(w))
.withColumn("Result", when($"IntermediateResult".isNull, last($"IntermediateResult", ignoreNulls = true).over(w2)).otherwise($"value"))
.orderBy($"col2")
.show(10)
+-----+----+----+------------------+------+
|value|col2|col3|IntermediateResult|Result|
+-----+----+----+------------------+------+
| 0| exA| 30| null| 16|
| 0| exB| 22| null| 16|
| 0| exC| 19| null| 16|
| 16| exD| 13| 16| 16|
| 5| exE| 28| 16| 5|
| 6| exF| 26| 16| 6|
| 0| exG| 12| 16| 0|
| 13| exH| 53| 16| 13|
+-----+----+----+------------------+------+
I think you need to take the 1st not null or non-zero value based on col2 's order. Please find the script below. I have created a table in spark's memory to write sql.
val df = Seq( (0,"exA",30), (0,"exB",22), (0,"exC",19), (16,"exD",13),
(5,"exE",28), (6,"exF",26), (0,"exG",12), (13,"exH",53))
.toDF("value", "col2", "col3")
df.registerTempTable("table_df")
spark.sql("with cte as(select *,row_number() over(order by col2) rno from table_df) select case when value = 0 and rno<(select min(rno) from cte where value != 0) then (select value from cte where rno=(select min(rno) from cte where value != 0)) else value end value,col2,col3 from cte").show(df.count.toInt,false)
Please let me know if you have any questions.
I added a new column with incremental id to your DF
import org.apache.spark.sql.functions._
val df_1 = Seq((0,"exA",30),
(0,"exB",22),
(0,"exC",19),
(16,"exD",13),
(5,"exE",28),
(6,"exF",26),
(0,"exG",12),
(13,"exH",53))
.toDF("value", "col2", "col3")
.withColumn("UniqueID", monotonically_increasing_id)
filter DF to have non-zero values
val df_2 = df_1.filter("value != 0")
create a variable "limit" to limit first N row that we need and variable Nvar for the first non-zero value
val limit = df_2.agg(min("UniqueID")).collect().map(_(0)).mkString("").toInt + 1
val nVal = df_1.limit(limit).agg(max("value")).collect().map(_(0)).mkString("").toInt
create DF with a column with the same name ("value") with a condition
val df_4 = df_1.withColumn("value", when(($"UniqueID" < limit), nVal).otherwise($"value"))

How to replace empty values in a column of DataFrame?

How can I replace empty values in a column Field1 of DataFrame df?
Field1 Field2
AA
12 BB
This command does not provide an expected result:
df.na.fill("Field1",Seq("Anonymous"))
The expected result:
Field1 Field2
Anonymous AA
12 BB
You can also try this.
This might handle both blank/empty/null
df.show()
+------+------+
|Field1|Field2|
+------+------+
| | AA|
| 12| BB|
| 12| null|
+------+------+
df.na.replace(Seq("Field1","Field2"),Map(""-> null)).na.fill("Anonymous", Seq("Field2","Field1")).show(false)
+---------+---------+
|Field1 |Field2 |
+---------+---------+
|Anonymous|AA |
|12 |BB |
|12 |Anonymous|
+---------+---------+
Fill: Returns a new DataFrame that replaces null or NaN values in
numeric columns with value.
Two things:
An empty string is not null or NaN, so you'll have to use a case statement for that.
Fill seems to not work well when giving a text value into a numeric column.
Failing Null Replace with Fill / Text:
scala> a.show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
scala> a.na.fill("Anonymous", Seq("f1")).show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
Working Example - Using Null With All Numbers:
scala> a.show
+----+---+
| f1| f2|
+----+---+
|null| AA|
| 12| BB|
+----+---+
scala> a.na.fill(1, Seq("f1")).show
+---+---+
| f1| f2|
+---+---+
| 1| AA|
| 12| BB|
+---+---+
Failing Example (Empty String instead of Null):
scala> b.show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
scala> b.na.fill(1, Seq("f1")).show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
Case Statement Fix Example:
scala> b.show
+---+---+
| f1| f2|
+---+---+
| | AA|
| 12| BB|
+---+---+
scala> b.select(when(col("f1") === "", "Anonymous").otherwise(col("f1")).as("f1"), col("f2")).show
+---------+---+
| f1| f2|
+---------+---+
|Anonymous| AA|
| 12| BB|
+---------+---+
You can try using below code when you have n number of columns in dataframe.
Note: When you are trying to write data into formats like parquet, null data types are not supported. we have to type cast it.
val df = Seq(
(1, ""),
(2, "Ram"),
(3, "Sam"),
(4,"")
).toDF("ID", "Name")
// null type column
val inputDf = df.withColumn("NulType", lit(null).cast(StringType))
//Output
+---+----+-------+
| ID|Name|NulType|
+---+----+-------+
| 1| | null|
| 2| Ram| null|
| 3| Sam| null|
| 4| | null|
+---+----+-------+
//Replace all blank space in the dataframe with null
val colName = inputDf.columns //*This will give you array of string*
val data = inputDf.na.replace(colName,Map(""->"null"))
data.show()
+---+----+-------+
| ID|Name|NulType|
+---+----+-------+
| 1|null| null|
| 2| Ram| null|
| 3| Sam| null|
| 4|null| null|
+---+----+-------+