pyspark processing & compare 2 dataframes - pyspark

I am working on pyspark (Spark 2.2.0) with 2 dataframes that have common columns. Requirement I am dealing with is as below: Join the 2 frames as per rule below.
frame1 = [Column 1, Column 2, Column 3....... column_n] ### dataframe
frame2 = [Column 1, Column 2, Column 3....... column_n] ### dataframe
key = [Column 1, Column 2] ### is an array
If frame1.[Column1, column2] == frame1.[Column1, column2]
if frame1.column_n == frame2.column_n
write to a new data frame DF_A using values from frame 2 as is
if frame1.column_n != frame2.column_n
write to a new data frame DF_A using values from frame 1 as is
write to a new data frame DF_B using values from frame 2 but with column3, & column 5 hard coded values
To do this, I am first creating 2 temp views and constructing 3 SQLs dynamically.
sql_1 = select frame1.* from frame1 join frame2 on [frame1.keys] = [frame2.keys]
where frame1.column_n=frame2.column_n
DFA = sqlContext.sql(sql_1)
sql_2 = select [all columns from frame1] from frame1 join frame2 on [frame1.keys] = [frame2.keys]
where frame1.column_n != frame2.column_n
DF_A = DF_A.union(sqlContext.sql(sql_2))
sql_3 = select [all columns from frame2 except for column3 & column5 to be hard coded] from frame1 join frame2 on [frame1.keys] = [frame2.keys]
where frame1.column_n != frame2.column_n
DF_B = sqlContext.sql(sql_1)
Question1: is there better way to dynamically pass key columns for joining? I am currently doing this by maintaining key columns in arrays (is working) and constructing SQL.
Question2: is there better way to dynamically pass selection columns without changing sequence of columns? I am currently doing this by maintaining column names in array and performing concatenation.
I did consider one single full outer join option but since column names are same I thought it will have more overhead of renaming.

For question#1 and #2, I went with getting the column names form dataframe schema (df.schema.names and df.columns) and string processing inside the loop.
For the logic, I went with minimal of 2 SQLs - one with full outer join.

Related

Is there any pyspark code to join two data frames and update null column values in 1st dataframe by referring 2nd dataframe column values

Trying to convert update sql query into pyspark.
UPDATE tblTCincompletes INNER JOIN tblMAPCIMdata ON tblTCincompletes.CRM_Student_ID = tblMAPCIMdata.ContactID SET
tblMAPCIMdata.result_type = [tblTCincompletes].[result_type], tblMAPCIMdata.result_name = [tblTCincompletes].[result_name]
WHERE (((tblMAPCIMdata.result_type) Is Null) AND ((tblMAPCIMdata.result_name) Is Null));

Does spark supports the below cascaded query?

I have one requirement to run some queries against some tables in the postgresql database to populate a dataframe. Tables are as following.
table 1 has the below data.
QueryID, WhereClauseID, Enabled
1 1 true
2 2 true
3 3 true
...
table 2 has the below data.
WhereClauseID, WhereClauseString
1 a>b
2 a>c
3 a>b && a<c
...
table 3 has the below data.
a, b, c, value
30, 20, 30, 100
20, 10, 40, 200
...
I want to query in the following way. For table 1, I want to pick up the rows when Enabled is true. Based on the WhereClauseID in each row, I want to pick up the rows in table 2. Based on the WhereClause condition picked up from table 2, I want to run the query using Where Clause to query table 3 to get the Value. Finally, I want to get all records in table 3 meeting the WhereClauses enabled in table 1.
I know I can go through table 1 row by row, and use the parameterized string to build sql query to query table 3. But I think the efficiency is very low to query row by row, especially if table 1 is big. Are there some better way to organize the query to improve the efficiency? Thanks a lot!
Depending on you usecase, but for pyspark databases, you'd might be able to solve it using the .when statement in pyspark.
Here is a suggestion.
import pyspark.sql.functions as F
tbl1 = spark.table("table1")
tbl3 = spark.table("table3")
tbl3 = (
tbl3
.withColumn("WhereClauseID",
## You can do some fancy parsing of your tbl2
## here if you want this to be evaluated programatically from your table2.
(
F.when( F.col("a") > F.col("b"), 1)
.when( F.col("a") > F.col("b"), 2)
.otherwise(-1)
)
)
)
tbl1_with_tbl_3 = tbl1.join(tbl3, "WhereClauseID", "left")

How to Compare columns of two tables using Spark?

I am trying to compare two tables() by reading as DataFrames. And for each common column in those tables using concatenation of a primary key say order_id with other columns like order_date, order_name, order_event.
The Scala Code I am using
val primary_key=order_id
for (i <- commonColumnsList){
val column_name = i
val tempDataFrameForNew = newDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
val tempDataFrameOld = oldDataFrame.selectExpr(s"concat($primaryKey,$i) as concatenated")
//Get those records which aren common in both old/new tables
matchCountCalculated = tempDataFrameForNew.intersect(tempDataFrameOld)
//Get those records which aren't common in both old/new tables
nonMatchCountCalculated = tempDataFrameOld.unionAll(tempDataFrameForNew).except(matchCountCalculated)
//Total Null/Non-Null Counts in both old and new tables.
nullsCountInNewDataFrame = newDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nullsCountInOldDataFrame = oldDataFrame.select(s"$i").filter(x => x.isNullAt(0)).count().toInt
nonNullsCountInNewDataFrame = newDFCount - nullsCountInNewDataFrame
nonNullsCountInOldDataFrame = oldDFCount - nullsCountInOldDataFrame
//Put the result for a given column in a Seq variable, later convert it to Dataframe.
tempSeq = tempSeq :+ Row(column_name, matchCountCalculated.toString, nonMatchCountCalculated.toString, (nullsCountInNewDataFrame - nullsCountInOldDataFrame).toString,
(nonNullsCountInNewDataFrame - nonNullsCountInOldDataFrame).toString)
}
// Final Step: Create DataFrame using Seq and some Schema.
spark.createDataFrame(spark.sparkContext.parallelize(tempSeq), schema)
The above code is working fine for a medium set of Data, but as the number of Columns and Records increases in my New & Old Table, the execution time is increasing. Any sort of advice is appreciated.
Thank you in Advance.
You can do the following:
1. Outer join the old and new dataframe on priamary key
joined_df = df_old.join(df_new, primary_key, "outer")
2. Cache it if you possibly can. This will save you a lot of time
3. Now you can iterate over columns and compare columns using spark functions (.isNull for not matched, == for matched etc)
for (col <- df_new.columns){
val matchCount = df_joined.filter(df_new[col].isNotNull && df_old[col].isNotNull).count()
val nonMatchCount = ...
}
This should be considerably faster, especially when you can cache your dataframe. If you can't it might be a good idea so save the joined df to disk in order to avoid a shuffle each time

Extract and Replace values from duplicates rows in PySpark Data Frame

I have duplicate rows of the may contain the same data or having missing values in the PySpark data frame.
The code that I wrote is very slow and does not work as a distributed system.
Does anyone know how to retain single unique values from duplicate rows in a PySpark Dataframe which can run as a distributed system and with fast processing time?
I have written complete Pyspark code and this code works correctly.
But the processing time is really slow and its not possible to use it on a Spark Cluster.
'''
# Columns of duplicate Rows of DF
dup_columns = df.columns
for row_value in df_duplicates.rdd.toLocalIterator():
print(row_value)
# Match duplicates using std name and create RDD
fill_duplicated_rdd = ((df.where((sf.col("stdname") == row_value['stdname'] ))
.where(sf.col("stdaddress")== row_value['stdaddress']))
.rdd.map(fill_duplicates))
# Creating feature names for the same RDD
fill_duplicated_rdd_col_names = (((df.where((sf.col("stdname") == row_value['stdname']) &
(sf.col("stdaddress")== row_value['stdaddress'])))
.rdd.map(fill_duplicated_columns_extract)).first())
# Creating DF using the previous RDD
# This DF stores value of a single set of matching duplicate rows
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
col_value = ([str(value[column]) for value in
df_streamline.select(col(column)).distinct().rdd.toLocalIterator() if value[column] != ""])
if len(col_value) >= 1:
# non null or empty value of a column store here
# This value is a no duplicate distinct value
col_value = col_value[0]
#print(col_value)
# The non-duplicate distinct value of the column is stored back to
# replace any rows in the PySpark DF that were empty.
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
#print(col_value)
except:
print("None")
'''
There are no error messages but the code is running very slow. I want a solution that fills rows with unique values in PySpark DF that are empty. It can fill the rows with even mode of the value
"""
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
# distinct() was replaced by isNOTNULL().limit(1).take(1) to improve the speed of the code and extract values of the row.
col_value = df_streamline.select(column).where(sf.col(column).isNotNull()).limit(1).take(1)[0][column]
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
"""

Appending multiple samples of a column into dataframe in spark

I have n (length) values in a spark column. I want to create a spark dataframe of k columns (where k is number of samples) and m rows (where m is sample size). I tried using withColumn, it is not working. Join by creating unique id will be very inefficient for me.
e.g. Spark column has following values :
102
320
11
101
2455
124
I want to create 2 samples of fraction 0.5 as columns in data frame.
So sampled data frame will be something like
sample1,sample2
320,101
124,2455
2455,11
Let df has a column UNIQUE_ID_D, I need k samples from this column. Here is the sample code for k = 2
var df1 = df.select("UNIQUE_ID_D").sample(false, 0.1).withColumnRenamed("UNIQUE_ID_D", "ID_1")
var df2 = df.select("UNIQUE_ID_D").sample(false, 0.1).withColumnRenamed("UNIQUE_ID_D", "ID_2")
df1.withColumn("NEW_UNIQUE_ID", df2.col("ID_2")).show
This wont work since withColumn can not access df2 column.
There is only way to join df1 and df2 by adding sequence column(join column) in both df's.
It is very inefficient for my use case since if I want to take 100 samples, I need to join 100 times in a loop for a single column. I need to perform this operation for all columns in original df.
How could I achieve this?