Pyspark how can identify unmatched row value from two data frame

Pyspark how can identify unmatched row value from two data frame - pyspark

I have below two data frame from which i am trying to identify the unmatched row value from data frame two. This is the part of migration where i want to see the difference after source data being migrated/moved to different destination.
source_df
+---+-----+-----+
|key|val11|val12|
+---+-----+-----+
|abc| 1.1| 1.2|
|def| 3.0| 3.4|
+---+-----+-----+
dest_df
+---+-----+-----+
|key|val11|val12|
+---+-----+-----+
|abc| 2.1| 2.2|
|def| 3.0| 3.4|
+---+-----+-----+
i want to see the output something like below
key: abc,
col: val11 val12
difference: [src-1.1,dst:2.1] [src:1.2,dst:2.2]
Any solution for this?

source_df = spark.createDataFrame(
[
('abc','1.1','1.2'),
('def','3.0','3.4'),
], ['key','val11','val12']
)
dest_df = spark.createDataFrame(
[
('abc','2.1','2.2'),
('def','3.0','3.4'),
], ['key','val11','val12']
)
report = source_df\
.join(dest_df, 'key', 'full')\
.filter((source_df.val11 != dest_df.val11) | (source_df.val12 != dest_df.val12))\
.withColumn('difference_val11', F.concat(F.lit('[src:'), source_df.val11, F.lit(',dst:'),dest_df.val11,F.lit(']')))\
.withColumn('difference_val12', F.concat(F.lit('[src:'), source_df.val12, F.lit(',dst:'),dest_df.val12,F.lit(']')))\
.select('key', 'difference_val11', 'difference_val12')
report.show()
+---+-----------------+-----------------+
|key| difference_val11| difference_val12|
+---+-----------------+-----------------+
|abc|[src:1.1,dst:2.1]|[src:1.1,dst:2.1]|
+---+-----------------+-----------------+
Or, if you want exactally in that format:
for x in report.select('key', 'difference_val11', 'difference_val12').collect():
print("key: " + str(x[0]) + ",\n\n" +\
"col: val11 val12\n\n" +\
"difference: " + str(x[1]) + " " + str(x[2]))
Output:
key: abc,
col: val11 val12
difference: [src:1.1,dst:2.1] [src:1.2,dst:2.2]

Related

Create summary of Spark Dataframe

I have a Spark Dataframe which I am trying to summarise in order to find overly long columns:
// Set up test data
// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df = Seq(
( 1, "a", "bb", "cc", "file1" ),
( 2, "d", "ee", "fff", "file2" ),
( 3, "g", "hhhh", "ii", "file3" )
).
toDF("rowId", "col1", "col2", "col3", "filename")
I can summarise the lengths of the columns and find overly long ones like this:
// Look for long columns (>=3), ie 1 is ok row,, 2 is bad on column 3, 3 is bad on column 2
val df2 = df.columns
.map(c => (c, df.agg(max(length(df(s"$c")))).as[String].first()))
.toSeq.toDF("columnName", "maxLength")
.filter($"maxLength" > 2)
If I try and add the existing filename column to the map I get an error:
val df2 = df.columns
.map(c => ($"filename", c, df.agg(max(length(df(s"$c")))).as[String].first()))
.toSeq.toDF("fn", "columnName", "maxLength")
.filter($"maxLength" > 2)
I have tried a few variations of the $"filename" syntax. How can I incorporate the filename column into the summary?
columnName
maxLength
filename
col2
4
file3
col3
3
file2
The real dataframes have 300+ columns and millions of rows so I cannot hard-type column names.

#wBob does the following achieve your goal?
group by file name and get the maximum per column:
val cols = df.columns.dropRight(1) // to remove the filename col
val maxLength = cols.map(c => s"max(length(${c})) as ${c}").mkString(",")
print(maxLength)
df.createOrReplaceTempView("temp")
val df1 = spark
.sql(s"select filename, ${maxLength} from temp group by filename")
df1.show()`
With the output:
+--------+-----+----+----+----+
|filename|rowId|col1|col2|col3|
+--------+-----+----+----+----+
| file1| 1| 1| 2| 2|
| file2| 1| 1| 2| 3|
| file3| 1| 1| 4| 2|
+--------+-----+----+----+----+
Use subqueries to get the maximum per column and concatenate the results using union:
df1.createOrReplaceTempView("temp2")
val res = cols.map(col => {
spark.sql(s"select '${col}' as columnName, $col as maxLength, filename from temp2 " +
s"where $col = (select max(${col}) from temp2)")
}).reduce(_ union _)
res.show()
With the result:
+----------+---------+--------+
|columnName|maxLength|filename|
+----------+---------+--------+
| rowId| 1| file1|
| rowId| 1| file2|
| rowId| 1| file3|
| col1| 1| file1|
| col1| 1| file2|
| col1| 1| file3|
| col2| 4| file3|
| col3| 3| file2|
+----------+---------+--------+
Note that there are multiple entries for rowId and col1 since the maximum is not unique.
There is probably a more elegant way to write it, but I am struggling to find one at the moment.

Pushed a little further for better result.
df.select(
col("*"),
array( // make array of columns name/value/length
(for{ col_name <- df.columns } yield
struct(
length(col(col_name)).as("length"),
lit(col_name).as("col"),
col(col_name).cast("String").as("col_value")
)
).toSeq:_* ).alias("rowInfo")
)
.select(
col("rowId"),
explode( // explode array into rows
expr("filter(rowInfo, x -> x.length >= 3)") //filter the array for the length your interested in
).as("rowInfo")
)
.select(
col("rowId"),
col("rowInfo.*") // turn struct fields into columns
)
.sort("length").show
+-----+------+--------+---------+
|rowId|length| col|col_value|
+-----+------+--------+---------+
| 2| 3| col3| fff|
| 3| 4| col2| hhhh|
| 3| 5|filename| file3|
| 1| 5|filename| file1|
| 2| 5|filename| file2|
+-----+------+--------+---------+

It might be enough to sort your table by total text length. This can be achieved quickly and concisely.
df.select(
col("*"),
length( // take the length
concat( //slap all the columns together
(for( col_name <- df.columns ) yield col(col_name)).toSeq:_*
)
)
.as("length")
)
.sort( //order by total length
col("length").desc
).show()
+-----+----+----+----+--------+------+
|rowId|col1|col2|col3|filename|length|
+-----+----+----+----+--------+------+
| 3| g|hhhh| ii| file3| 13|
| 2| d| ee| fff| file2| 12|
| 1| a| bb| cc| file1| 11|
+-----+----+----+----+--------+------+

Sorting an array[struct] it will sort on the first field first and second field next. This works as we put the size of the sting up front. If you re-order the fields you'll get different results. You can easily accept more than 1 result if you so desired but I think dsicovering a row is challenging is likely enough.
df.select(
col("*"),
reverse( //sort ascending
sort_array( //sort descending
array( // add all columns lengths to an array
(for( col_name <- df.columns ) yield struct(length(col(col_name)),lit(col_name),col(col_name).cast("String")) ).toSeq:_* )
)
)(0) // grab the row max
.alias("rowMax") )
.sort("rowMax").show
+-----+----+----+----+--------+--------------------+
|rowId|col1|col2|col3|filename| rowMax|
+-----+----+----+----+--------+--------------------+
| 1| a| bb| cc| file1|[5, filename, file1]|
| 2| d| ee| fff| file2|[5, filename, file2]|
| 3| g|hhhh| ii| file3|[5, filename, file3]|
+-----+----+----+----+--------+--------------------+

How to combine multi columns into one in pyspark

I have a dataframe with 2 columns (df1). Now I want to merge columns values into one (df2). How?

Let's say you have DataFrame like this:
d = [
("Value 1", 1),
("Value 2", 2),
("Value 3", 3),
("Value 4", 4),
("Value 5", 5),
]
df = spark.createDataFrame(d,['col1','col2'])
df.show()
# output
+-------+----+
| col1|col2|
+-------+----+
|Value 1| 1|
|Value 2| 2|
|Value 3| 3|
|Value 4| 4|
|Value 5| 5|
+-------+----+
You can join columns and format them as you want using following syntax:
(
df.withColumn("newCol",
F.format_string("Col 1: %s Col 2: %s", df.col1, df.col2))
.show(truncate=False)
)
# output
+-------+----+-----------------------+
|col1 |col2|newCol |
+-------+----+-----------------------+
|Value 1|1 |Col 1: Value 1 Col 2: 1|
|Value 2|2 |Col 1: Value 2 Col 2: 2|
|Value 3|3 |Col 1: Value 3 Col 2: 3|
|Value 4|4 |Col 1: Value 4 Col 2: 4|
|Value 5|5 |Col 1: Value 5 Col 2: 5|
+-------+----+-----------------------+

from pyspark.sql.functions import concat
df1.withColumn("Merge", concat(df1.Column_1, df1.Column_2)).show()

You can use a struct or a map.
struct:
df.withColumn(
"price_struct",
F.struct(
(F.col("total_price")*100).alias("amount"),
"total_price_currency",
F.lit("CENTI").alias("unit")
)
)
results in
+-----------+--------------------+--------------------+
|total_price|total_price_currency| price_struct|
+-----------+--------------------+--------------------+
| 79.0| USD|[7900.0, USD, CENTI]|
+-----------+--------------------+--------------------+
or as a map
df
.withColumn("price_map",
F.create_map(
F.lit("currency"), F.col("total_price_currency"),
F.lit("amount"), F.col("total_price")*100,
F.lit("unit"), F.lit("CENTI")
).alias("price_struct")
)
results in
+-----------+--------------------+--------------------+
|total_price|total_price_currency| price_map|
+-----------+--------------------+--------------------+
| 79.0| USD|[currency -> USD,...|
+-----------+--------------------+--------------------+

new column in dataframe derived from second dataframe

I've two dataframes df1 and df2.I've to add a new columns in df1 from df2 :
df1
X Y Z
1 2 3
4 5 6
7 8 9
3 6 9
df2
col1 col2
XX aa
YY bb
XX cc
ZZ vv
The values of col1 in df2 should be added as new column(if it does'nt exists) in df1 and col2 as value of new column.For example :
df1
X Y Z XX YY ZZ
1 2 3 aa bb vv
4 5 6 cc
7 8 9
3 6 9
df2
col1 col2
XX aa
YY bb
XX cc
ZZ vv

First, spark dataset are made to be distributed. But column name are part of the schema, so they are in memory of the master. Thus, to add columns for each distinct values of df2.col1, you first need to get those values in the master (i.e. collect)
// inputs
val df1 = List((1,2,3), (4,5,6), (7,8,9), (3,6,9)).toDF("X", "Y", "Z")
val df2 = List(("XX", "aa"), ("YY", "bb"), ("XX", "cc"), ("ZZ", "vv")).toDF("col1", "col2")
val newColumns = df2.select("col1").as[String].distinct.collect
val newDF = newColumns.foldLeft(df1)( (df, col) => df.withColumn(col, lit("?")))
newDF.show
+---+---+---+---+---+---+
| X| Y| Z| ZZ| YY| XX|
+---+---+---+---+---+---+
| 1| 2| 3| ?| ?| ?|
| 4| 5| 6| ?| ?| ?|
| 7| 8| 9| ?| ?| ?|
| 3| 6| 9| ?| ?| ?|
+---+---+---+---+---+---+
But
I don't know what values you want to put in those column (above, I put "?" everywhere)
if there are a lot of rows in df2, like 10's of thousand, it can kill the master to collect and add them all to df1
Now, to give a little more, here is how you can add columns from df2.col1 and put as values the concatenated values of df2.col2
val toAdd = df2.groupBy("col1").agg(concat_ws(",", collect_set("col2")).as("col2All"))
toAdd.show
+----+-------+
|col1|col2All|
+----+-------+
| ZZ| vv|
| YY| bb|
| XX| cc,aa|
+----+-------+
val newColumns = toAdd.rdd.map(r => (r.getAs[String]("col1"), r.getAs[String]("col2All"))).collectAsMap()
val newDF = newColumns.foldLeft(df1){ case (df, (name, value)) => df.withColumn(name, lit(value))}
newDF.show
+---+---+---+-----+---+---+
| X| Y| Z| XX| YY| ZZ|
+---+---+---+-----+---+---+
| 1| 2| 3|cc,aa| bb| vv|
| 4| 5| 6|cc,aa| bb| vv|
| 7| 8| 9|cc,aa| bb| vv|
| 3| 6| 9|cc,aa| bb| vv|
+---+---+---+-----+---+---+

How to filter columns in one table based on the same columns in another table using Spark

I need to filter columns in one table (fixTablehb004_p) based on the same columns in another table (filtredTable109_p)
I first wanted to use this code:
val filtredTablehb004_p = fixTablehb004_p
.where($"servizio_rap" === filtredTable109_p.col("servizio_rap"))
.where($"filiale_rap" === filtredTable109_p.col("filiale_rap"))
.where($"codice_rap" === filtredTable109_p.col("codice_rap"))
But it gave out an error.
Then I tried the code based on this stackoverflow question, and I get this code. But the problem is that there are extra columns, I know what you can do drop(columnName), but I want to ask you if I'm doing it right and if there is another better option
val filtredTablehb004_p = sparkSession.sql("SELECT * FROM fixTablehb004_p " +
"JOIN filtredTable109_p " +
"ON fixTablehb004_p.servizio_rap = filtredTable109_p.servizio_rap AND " +
"fixTablehb004_p.filiale_rap = filtredTable109_p.filiale_rap AND " +
"fixTablehb004_p.codice_rap = filtredTable109_p.codice_rap ")

Let's take 2 sample dataframes and see how we can select required columns or avoid duplicate key column names in joined output dataframe.
USING DATAFRAME API:
val df1 = Seq(("A1", "A2", 1), ("A3", "A4", 2), ("A1", "A3", 3))
.toDF("c1", "c2", "c3")
val df2 = Seq(("A1", "A2", 10), ("A3", "A4", 11))
.toDF("c1", "c2", "c4")
df1.createOrReplaceTempView("tab1")
df2.createOrReplaceTempView("tab2")
If column names which you used for joining condition from both dataframes are same then output dataframe will have duplicate columns. To avoid this you can pass all those columns as Seq to join().
df1.join(df2, Seq("c1", "c2")).show()
+---+---+---+---+
| c1| c2| c3| c4|
+---+---+---+---+
| A1| A2| 1| 10|
| A3| A4| 2| 11|
+---+---+---+---+
To select required columns from specific dataframe you can use below syntax:
df1.join(df2, Seq("c1", "c2")).select('c1, 'c2, df1("c3")).show()
// OR
df1.join(df2, df1("c1") === df2("c1") && df1("c2") === df2("c2"))
.select(df1("c1"), df1("c2"), df1("c3")).show()
+---+---+---+
| c1| c2| c3|
+---+---+---+
| A1| A2| 1|
| A3| A4| 2|
+---+---+---+
USING SQL API:
spark.sql(
"""
|SELECT t2.c1, t2.c2, t2.c4 FROM tab1 t1
|JOIN tab2 t2 ON t1.c1 = t2.c1 AND t1.c2 = t2.c2
|""".stripMargin).show()
//OR
spark.sql(
"""
|SELECT c1, c2, t2.c4 FROM tab1 t1
|JOIN tab2 t2 USING(c1, c2)
|""".stripMargin).show()
+---+---+---+
| c1| c2| c4|
+---+---+---+
| A1| A2| 10|
| A3| A4| 11|
+---+---+---+

How to find weighted sum on top of groupby in pyspark dataframe?

I have a dataframe where i need to first apply dataframe and then get weighted average as shown in the output calculation below. What is an efficient way in pyspark to do that?
data = sc.parallelize([
[111,3,0.4],
[111,4,0.3],
[222,2,0.2],
[222,3,0.2],
[222,4,0.5]]
).toDF(['id', 'val','weight'])
data.show()
+---+---+------+
| id|val|weight|
+---+---+------+
|111| 3| 0.4|
|111| 4| 0.3|
|222| 2| 0.2|
|222| 3| 0.2|
|222| 4| 0.5|
+---+---+------+
Output:
id weigthed_val
111 (3*0.4 + 4*0.3)/(0.4 + 0.3)
222 (2*0.2 + 3*0.2+4*0.5)/(0.2+0.2+0.5)

You can multiply columns weight and val, then aggregate:
import pyspark.sql.functions as F
data.groupBy("id").agg((F.sum(data.val * data.weight)/F.sum(data.weight)).alias("weighted_val")).show()
+---+------------------+
| id| weighted_val|
+---+------------------+
|222|3.3333333333333335|
|111|3.4285714285714293|
+---+------------------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark how can identify unmatched row value from two data frame - pyspark

Related

Create summary of Spark Dataframe

How to combine multi columns into one in pyspark

new column in dataframe derived from second dataframe

How to filter columns in one table based on the same columns in another table using Spark

How to find weighted sum on top of groupby in pyspark dataframe?

Categories

Resources