Compare two data frame row by row in pyspark

Compare two data frame row by row in pyspark - pyspark

I am trying to compare two data frame row by row such that if any mismatch found it prints in below formatted way.
Example:
data = [("James", "M", 60000), ("Michael", "M", 70000),
("Robert", None, 400000), ("Maria", "F", 500000),
("Jen", "", None),(None,None,None)]
columns = ["name", "gender", "salary"]
source_df = spark.createDataFrame(data=data, schema=columns)
source_df.show()
+-------+------+------+
| name|gender|salary|
+-------+------+------+
| James| M| 60000|
|Michael| M| 70000|
| Robert| null|400000|
| Maria| F|500000|
| Jen| | null|
| null| null| null|
+-------+------+------+
data1 = [("Anurag", "M", 70000), ("Michael", "M", 70000),
("Sunil", None, 900000), ("Maria", "F", 500000),
("Jen", "", None),(None,None,None)]
columns = ["name_1", "gender_1", "salary_1"]
target_df = spark.createDataFrame(data=data1, schema=columns)
target_df.show()
| name_1|gender_1|salary_1|
+-------+--------+--------+
| Anurag| M| 70000|
|Michael| M| 70000|
| Sunil| null| 900000|
| Maria| F| 500000|
| Jen| | null|
| null| null| null|
+-------+--------+--------+
So, here I have to iterate over row of 1st dataframe James|M|60000 and compare with row of 2nd dataframe Anurag|M|70000 and so on.. and print the output in formatted way if any mismatch found
eg :
1st df: 'James|M|60000'
2nd df: 'Anurag|M|70000'
output: Mismatch: name-->James,name_1--> Anurag
salary-->60000, salary_1--> 70000..so on..
Please let me know if any other info required.
Very new to pyspark, so need your help. Thanks in Advance for your help.
The below code working fine for me.
from pyspark.sql.functions import concat, monotonically_increasing_id, udf, col
from pyspark.sql import SparkSession
from os import truncate
import findspark
findspark.init()
findspark.find()
findspark.find()
def print_mismatch(row):
output = ""
for i in range(len(source_cols)):
output += f"Row Index--> {row['id']}, "
if row[source_cols[i]] != row[target_cols[i]]:
output += f"{source_cols[i]}--> {row[source_cols[i]]}, {target_cols[i]}--> {row[target_cols[i]]} "
return output
spark = SparkSession \
.builder \
.appName("SparkExample") \
.getOrCreate()
data = [("James", "M", 60000), ("Michael", "M", 70000),
("Robert", None, 400000), ("Maria", "F", 500000),
("Jen", "", None), (None, None, None)]
columns = ["name", "gender", "salary"]
source_df = spark.createDataFrame(data=data, schema=columns)
rdd_df = source_df.rdd.zipWithIndex()
source_df = rdd_df.toDF().select(col("_1.*"), col("_2").alias("id"))
data1 = [("Anurag", "M", 70000), ("Michael", "M", 70000),
("Sunil", None, 900000), ("Maria", "F", 500000),
("Jen", "", None), (None, None, None)]
columns = ["name_1", "gender_1", "salary_1"]
target_df = spark.createDataFrame(data=data1, schema=columns)
rdd_df = target_df.rdd.zipWithIndex()
target_df = rdd_df.toDF().select(col("_1.*"), col("_2").alias("id1"))
final_df = source_df.join(target_df, source_df.id == target_df.id1)
source_cols = source_df.columns
target_cols = target_df.columns
final_coulmn = source_cols + target_cols
df = spark.createDataFrame([], schema=final_df.schema)
for i in range(len(source_cols)):
final = final_df.filter(
final_df[f'{source_cols[i]}'] != final_df[f'{target_cols[i]}'])
df = df.union(final)
# final.collect()
final_df.show()
#df.show()
df_rdd = df.rdd
df_rdd.map(print_mismatch).collect() ```

You can achieve the same as follows:
from pyspark.sql import SparkSession
from pyspark.sql.functions import concat, monotonically_increasing_id, udf, col
def print_mismatch(row):
output = ""
for i in range(len(source_cols)):
if row[source_cols[i]] != row[target_cols[i]]:
output += f"{source_cols[i]}--> {row[source_cols[i]]}, {target_cols[i]}--> {row[target_cols[i]]}"
print(output)
spark = SparkSession \
.builder \
.appName("SparkExample") \
.getOrCreate()
data = [("James", "M", 60000), ("Michael", "M", 70000),
("Robert", None, 400000), ("Maria", "F", 500000),
("Jen", "", None), (None, None, None)]
columns = ["name", "gender", "salary"]
source_df = spark.createDataFrame(data=data, schema=columns)
rdd_df = source_df.rdd.zipWithIndex()
source_df = rdd_df.toDF().select(col("_1.*"), col("_2").alias("id"))
# source_df = source_df.withColumn("id", monotonically_increasing_id())
source_df.show()
data1 = [("Anurag", "M", 70000), ("Michael", "M", 70000),
("Sunil", None, 900000), ("Maria", "F", 500000),
("Jen", "", None), (None, None, None)]
columns = ["name_1", "gender_1", "salary_1"]
target_df = spark.createDataFrame(data=data1, schema=columns)
rdd_df = target_df.rdd.zipWithIndex()
target_df = rdd_df.toDF().select(col("_1.*"), col("_2").alias("id"))
# target_df = target_df.withColumn("id", monotonically_increasing_id())
target_df.show()
final_df = source_df.join(target_df, source_df.id == target_df.id)
final_df.show()
source_cols = source_df.columns
target_cols = target_df.columns
final_df.foreach(lambda row: print_mismatch(row))
name--> James, name_1--> Anurag salary--> 60000, salary_1--> 70000
name--> Robert, name_1--> Sunil salary--> 400000, salary_1--> 900000
spark-submit --master local spark_combining_df.py
enter image description here

Related

How can I capitalize specific words in a spark column?

I am trying to capitalize some words in a column in my spark dataframe. The words are all in a list.
val wrds = ["usa","gb"]
val dF = List(
(1, "z",3, "Bob lives in the usa"),
(4, "t", 2, "gb is where Beth lives"),
(5, "t", 2, "ogb")
).toDF("id", "name", "thing", "country")
I would like to have an output of
val dF = List(
(1, "z",3, "Bob lives in the USA"),
(4, "t", 2, "GB is where Beth lives")
(5, "t", 2, "ogb")
).toDF("id", "name", "thing", "country")
It seems I have to do a string split on the column and then capitalize based on if that part of a string is present in the value. I am mostly struggling with row 3 where I do not want to capitalize ogb even though it does contain gb. Could anyone point me in the right direction?

import org.apache.spark.sql.functions._
val words = Array("usa","gb")
val df = List(
(1, "z",3, "Bob lives in the usa"),
(4, "t", 2, "gb is where Beth lives"),
(5, "t", 2, "ogb")
).toDF("id", "name", "thing", "country")
val replaced = words.foldLeft(df){
case (adf, word) =>
adf.withColumn("country", regexp_replace($"country", "(\\b" + word + "\\b)", word.toUpperCase))
}
replaced.show
Output:
+---+----+-----+--------------------+
| id|name|thing| country|
+---+----+-----+--------------------+
| 1| z| 3|Bob lives in the USA|
| 4| t| 2|GB is where Beth ...|
| 5| t| 2| ogb|
+---+----+-----+--------------------+

Combine two datasets based on value

I have following two datasets:
val dfA = Seq(
("001", "10", "Cat"),
("001", "20", "Dog"),
("001", "30", "Bear"),
("002", "10", "Mouse"),
("002", "20", "Squirrel"),
("002", "30", "Turtle"),
).toDF("Package", "LineItem", "Animal")
val dfB = Seq(
("001", "", "X", "A"),
("001", "", "Y", "B"),
("002", "", "X", "C"),
("002", "", "Y", "D"),
("002", "20", "X" ,"E")
).toDF("Package", "LineItem", "Flag", "Category")
I need to join them with specific conditions:
a) There is always a row in dfB with the X flag and empty LineItem which should be the default Category for the Package from dfA
b) When there is a LineItem specified in dfB the default Category should be overwritten with the Category associated to this LineItem
Expected output:
+---------+----------+----------+----------+
| Package | LineItem | Animal | Category |
+---------+----------+----------+----------+
| 001 | 10 | Cat | A |
+---------+----------+----------+----------+
| 001 | 20 | Dog | A |
+---------+----------+----------+----------+
| 001 | 30 | Bear | A |
+---------+----------+----------+----------+
| 002 | 10 | Mouse | C |
+---------+----------+----------+----------+
| 002 | 20 | Squirrel | E |
+---------+----------+----------+----------+
| 002 | 30 | Turtle | C |
+---------+----------+----------+----------+
I spend some time on it today, but I don't have an idea how it could be accomplished. I appreciate your assistance.
Thanks!

You can use two join + when clause:
val dfC = dfA
.join(dfB, dfB.col("Flag") === "X" && dfA.col("LineItem") === dfB.col("LineItem") && dfA.col("Package") === dfB.col("Package"))
.select(dfA.col("Package").as("priorPackage"), dfA.col("LineItem").as("priorLineItem"), dfB.col("Category").as("priorCategory"))
.as("dfC")
val dfD = dfA
.join(dfB, dfB.col("LineItem") === "" && dfB.col("Flag") === "X" && dfA.col("Package") === dfB.col("Package"), "left_outer")
.join(dfC, dfA.col("LineItem") === dfC.col("priorLineItem") && dfA.col("Package") === dfC.col("priorPackage"), "left_outer")
.select(
dfA.col("package"),
dfA.col("LineItem"),
dfA.col("Animal"),
when(dfC.col("priorCategory").isNotNull, dfC.col("priorCategory")).otherwise(dfB.col("Category")).as("Category")
)
dfD.show()

This should work for you:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
val dfA = Seq(
("001", "10", "Cat"),
("001", "20", "Dog"),
("001", "30", "Bear"),
("002", "10", "Mouse"),
("002", "20", "Squirrel"),
("002", "30", "Turtle")
).toDF("Package", "LineItem", "Animal")
val dfB = Seq(
("001", "", "X", "A"),
("001", "", "Y", "B"),
("002", "", "X", "C"),
("002", "", "Y", "D"),
("002", "20", "X" ,"E")
).toDF("Package", "LineItem", "Flag", "Category")
val result = {
dfA.as("a")
.join(dfB.where('Flag === "X").as("b"), $"a.Package" === $"b.Package" and ($"a.LineItem" === $"b.LineItem" or $"b.LineItem" === ""), "left")
.withColumn("anyRowsInGroupWithBLineItemDefined", first(when($"b.LineItem" =!= "", lit(true)), ignoreNulls = true).over(Window.partitionBy($"a.Package", $"a.LineItem")).isNotNull)
.where(!$"anyRowsInGroupWithBLineItemDefined" or ($"anyRowsInGroupWithBLineItemDefined" and $"b.LineItem" =!= ""))
.select($"a.Package", $"a.LineItem", $"a.Animal", $"b.Category")
}
result.orderBy($"a.Package", $"a.LineItem").show(false)
// +-------+--------+--------+--------+
// |Package|LineItem|Animal |Category|
// +-------+--------+--------+--------+
// |001 |10 |Cat |A |
// |001 |20 |Dog |A |
// |001 |30 |Bear |A |
// |002 |10 |Mouse |C |
// |002 |20 |Squirrel|E |
// |002 |30 |Turtle |C |
// +-------+--------+--------+--------+
The "tricky" part is calculating whether or not there are any rows with LineItem defined in dfB for a given Package, LineItem in dfA. You can see how I perform this calculation in anyRowsInGroupWithBLineItemDefined which involves the use of a window function. Other than that, it's just a normal SQL programming exercise.
Also want to note that this code should be more efficient than the other solution as here we only shuffle the data twice (during join and during window function) and only read in each dataset once.

Spark: Row filter based on Column value

I have millions of rows as dataframe like this:
val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE")).toDF("id", "status")
scala> df.show(false)
+---+--------+
|id |status |
+---+--------+
|id1|ACTIVE |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE |
|id3|INACTIVE|
|id3|INACTIVE|
+---+--------+
Now I want to divide this data into three separate dataFrame like this:
Only ACTIVE ids (like id2), say activeDF
Only INACTIVE ids (like id3), say inactiveDF
Having both ACTIVE and INACTIVE as status, say bothDF
How can I calculate activeDF and inactiveDF?
I know that bothDF can be calculated like
df.select("id").distinct.except(activeDF).except(inactiveDF)
, but this will involve shuffling (as 'distinct' operation required same). Is there any better way to calculate bothDF
Versions:
Spark : 2.2.1
Scala : 2.11

The most elegant solution is to pivot on status
val counts = df
.groupBy("id")
.pivot("status", Seq("ACTIVE", "INACTIVE"))
.count
or equivalent direct agg
val counts = df
.groupBy("id")
.agg(
count(when($"status" === "ACTIVE", true)) as "ACTIVE",
count(when($"status" === "INACTIVE", true)) as "INACTIVE"
)
followed by a simple CASE ... WHEN:
val result = counts.withColumn(
"status",
when($"ACTIVE" === 0, "INACTIVE")
.when($"inactive" === 0, "ACTIVE")
.otherwise("BOTH")
)
result.show
+---+------+--------+--------+
| id|ACTIVE|INACTIVE| status|
+---+------+--------+--------+
|id3| 0| 2|INACTIVE|
|id1| 1| 2| BOTH|
|id2| 1| 0| ACTIVE|
+---+------+--------+--------+
Later you can separate the result with filters or dump to disk with source that supports partitionBy (How to split a dataframe into dataframes with same column values?).

just another way - groupBy, collect as set and then if the size of the set is 1, it is either active or inactive only, else both
scala> val df = Seq(("id1", "ACTIVE"), ("id1", "INACTIVE"), ("id1", "INACTIVE"), ("id2", "ACTIVE"), ("id3", "INACTIVE"), ("id3", "INACTIVE"), ("id4", "ACTIVE"), ("id5", "ACTIVE"), ("id6", "INACTIVE"), ("id7", "ACTIVE"), ("id7", "INACTIVE")).toDF("id", "status")
df: org.apache.spark.sql.DataFrame = [id: string, status: string]
scala> df.show(false)
+---+--------+
|id |status |
+---+--------+
|id1|ACTIVE |
|id1|INACTIVE|
|id1|INACTIVE|
|id2|ACTIVE |
|id3|INACTIVE|
|id3|INACTIVE|
|id4|ACTIVE |
|id5|ACTIVE |
|id6|INACTIVE|
|id7|ACTIVE |
|id7|INACTIVE|
+---+--------+
scala> val allstatusDF = df.groupBy("id").agg(collect_set("status") as "allstatus")
allstatusDF: org.apache.spark.sql.DataFrame = [id: string, allstatus: array<string>]
scala> allstatusDF.show(false)
+---+------------------+
|id |allstatus |
+---+------------------+
|id7|[ACTIVE, INACTIVE]|
|id3|[INACTIVE] |
|id5|[ACTIVE] |
|id6|[INACTIVE] |
|id1|[ACTIVE, INACTIVE]|
|id2|[ACTIVE] |
|id4|[ACTIVE] |
+---+------------------+
scala> allstatusDF.withColumn("status", when(size($"allstatus") === 1, $"allstatus".getItem(0)).otherwise("BOTH")).show(false)
+---+------------------+--------+
|id |allstatus |status |
+---+------------------+--------+
|id7|[ACTIVE, INACTIVE]|BOTH |
|id3|[INACTIVE] |INACTIVE|
|id5|[ACTIVE] |ACTIVE |
|id6|[INACTIVE] |INACTIVE|
|id1|[ACTIVE, INACTIVE]|BOTH |
|id2|[ACTIVE] |ACTIVE |
|id4|[ACTIVE] |ACTIVE |
+---+------------------+--------+

Row aggregations in Scala

I am looking for a way to get a new column in a data frame in Scala that calculates the min/max of the values in col1, col2, ..., col10 for each row.
I know I can do it with a UDF but maybe there is an easier way.
Thanks!

Porting this Python answer by user6910411
import org.apache.spark.sql.functions._
val df = Seq(
(1, 3, 0, 9, "a", "b", "c")
).toDF("col1", "col2", "col3", "col4", "col5", "col6", "Col7")
val cols = Seq("col1", "col2", "col3", "col4")
val rowMax = greatest(
cols map col: _*
).alias("max")
val rowMin = least(
cols map col: _*
).alias("min")
df.select($"*", rowMin, rowMax).show
// +----+----+----+----+----+----+----+---+---+
// |col1|col2|col3|col4|col5|col6|Col7|min|max|
// +----+----+----+----+----+----+----+---+---+
// | 1| 3| 0| 9| a| b| c|0.0|9.0|
// +----+----+----+----+----+----+----+---+---+

Spark: Rows to Columns (like transpose or pivot)

How to transpose rows to columns using RDD or data frame without pivot.
SessionId,date,orig, dest, legind, nbr
1 9/20/16,abc0,xyz0,o,1
1 9/20/16,abc1,xyz1,o,2
1 9/20/16,abc2,xyz2,i,3
1 9/20/16,abc3,xyz3,i,4
So I want to generate new schema like:
SessionId,date,orig1, orig2, orig3, orig4, dest1, dest2, dest3,dest4
1,9/20/16,abc0,abc1,null, null, xyz0,xyz1, null, null
Logic is if:
nbr is 1 and legind = o then orig1 value (fetch from row 1) ...
nbr is 3 and legind = i then dest1 value (fetch from row 3)
So how to transpose the rows to columns...
Any idea will be great appreciated.
Tried with below option but its just flatten all in single row..
val keys = List("SessionId");
val selectFirstValueOfNoneGroupedColumns =
df.columns
.filterNot(keys.toSet)
.map(_ -> "first").toMap
val grouped =
df.groupBy(keys.head, keys.tail: _*)
.agg(selectFirstValueOfNoneGroupedColumns).show()

It is relatively simple if you use pivot function. First lets create a data set like the one in your question:
import org.apache.spark.sql.functions.{concat, first, lit, when}
val df = Seq(
("1", "9/20/16", "abc0", "xyz0", "o", "1"),
("1", "9/20/16", "abc1", "xyz1", "o", "2"),
("1", "9/20/16", "abc2", "xyz2", "i", "3"),
("1", "9/20/16", "abc3", "xyz3", "i", "4")
).toDF("SessionId", "date", "orig", "dest", "legind", "nbr")
then define and attach helper columns:
// This will be the column name
val key = when($"legind" === "o", concat(lit("orig"), $"nbr"))
.when($"legind" === "i", concat(lit("dest"), $"nbr"))
// This will be the value
val value = when($"legind" === "o", $"orig") // If o take origin
.when($"legind" === "i", $"dest") // If i take dest
val withKV = df.withColumn("key", key).withColumn("value", value)
This will result in a DataFrame like this:
+---------+-------+----+----+------+---+-----+-----+
|SessionId| date|orig|dest|legind|nbr| key|value|
+---------+-------+----+----+------+---+-----+-----+
| 1|9/20/16|abc0|xyz0| o| 1|orig1| abc0|
| 1|9/20/16|abc1|xyz1| o| 2|orig2| abc1|
| 1|9/20/16|abc2|xyz2| i| 3|dest3| xyz2|
| 1|9/20/16|abc3|xyz3| i| 4|dest4| xyz3|
+---------+-------+----+----+------+---+-----+-----+
Next let's define a list of possible levels:
val levels = Seq("orig", "dest").flatMap(x => (1 to 4).map(y => s"$x$y"))
and finally pivot
val result = withKV
.groupBy($"sessionId", $"date")
.pivot("key", levels)
.agg(first($"value", true)).show
And the result is:
+---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+
|sessionId| date|orig1|orig2|orig3|orig4|dest1|dest2|dest3|dest4|
+---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+
| 1|9/20/16| abc0| abc1| null| null| null| null| xyz2| xyz3|
+---------+-------+-----+-----+-----+-----+-----+-----+-----+-----+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Compare two data frame row by row in pyspark - pyspark

Related

How can I capitalize specific words in a spark column?

Combine two datasets based on value

Spark: Row filter based on Column value

Row aggregations in Scala

Spark: Rows to Columns (like transpose or pivot)

Categories

Resources