I received help with following PySpark to prevent errors when doing a Merge in Databricks, see here
Databricks Error: Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table conflicting way
I was wondering if I could get help to modify the code to drop NULLs.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
df2 = partdf.withColumn("rn", row_number().over(Window.partitionBy("P_key").orderBy("Id")))
df3 = df2.filter("rn = 1").drop("rn")
Thanks
The code that you are using does not completely delete the rows where P_key is null. It is applying the row number for null values and where row number value is 1 where P_key is null, that row is not getting deleted.
You can instead use the df.na.drop instead to get the required result.
df.na.drop(subset=["P_key"]).show(truncate=False)
To make your approach work, you can use the following approach. Add a row with least possible unique id value. Store this id in a variable, use the same code and add additional condition in filter as shown below.
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number,when,col
df = spark.read.option("header",True).csv("dbfs:/FileStore/sample1.csv")
#adding row with least possible id value.
dup_id = '0'
new_row = spark.createDataFrame([[dup_id,'','x','x']], schema = ['id','P_key','c1','c2'])
#replacing empty string with null for P_Key
new_row = new_row.withColumn('P_key',when(col('P_key')=='',None).otherwise(col('P_key')))
df = df.union(new_row) #row added
#code to remove duplicates
df2 = df.withColumn("rn", row_number().over(Window.partitionBy("P_key").orderBy("id")))
df2.show(truncate=False)
#additional condition to remove added id row.
df3 = df2.filter((df2.rn == 1) & (df2.P_key!=dup_id)).drop("rn")
df3.show()
Related
I have duplicate rows of the may contain the same data or having missing values in the PySpark data frame.
The code that I wrote is very slow and does not work as a distributed system.
Does anyone know how to retain single unique values from duplicate rows in a PySpark Dataframe which can run as a distributed system and with fast processing time?
I have written complete Pyspark code and this code works correctly.
But the processing time is really slow and its not possible to use it on a Spark Cluster.
'''
# Columns of duplicate Rows of DF
dup_columns = df.columns
for row_value in df_duplicates.rdd.toLocalIterator():
print(row_value)
# Match duplicates using std name and create RDD
fill_duplicated_rdd = ((df.where((sf.col("stdname") == row_value['stdname'] ))
.where(sf.col("stdaddress")== row_value['stdaddress']))
.rdd.map(fill_duplicates))
# Creating feature names for the same RDD
fill_duplicated_rdd_col_names = (((df.where((sf.col("stdname") == row_value['stdname']) &
(sf.col("stdaddress")== row_value['stdaddress'])))
.rdd.map(fill_duplicated_columns_extract)).first())
# Creating DF using the previous RDD
# This DF stores value of a single set of matching duplicate rows
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
col_value = ([str(value[column]) for value in
df_streamline.select(col(column)).distinct().rdd.toLocalIterator() if value[column] != ""])
if len(col_value) >= 1:
# non null or empty value of a column store here
# This value is a no duplicate distinct value
col_value = col_value[0]
#print(col_value)
# The non-duplicate distinct value of the column is stored back to
# replace any rows in the PySpark DF that were empty.
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
#print(col_value)
except:
print("None")
'''
There are no error messages but the code is running very slow. I want a solution that fills rows with unique values in PySpark DF that are empty. It can fill the rows with even mode of the value
"""
df_streamline = fill_duplicated_rdd.toDF(fill_duplicated_rdd_col_names)
for column in df_streamline.columns:
try:
# distinct() was replaced by isNOTNULL().limit(1).take(1) to improve the speed of the code and extract values of the row.
col_value = df_streamline.select(column).where(sf.col(column).isNotNull()).limit(1).take(1)[0][column]
df_dedup = (df_dedup
.withColumn(column,sf.when((sf.col("stdname") == row_value['stdname'])
& (sf.col("stdaddress")== row_value['stdaddress'])
,col_value)
.otherwise(df_dedup[column])))
"""
I have another solution, but I prefer to use PySpark 2.3 to do it.
I have a two dimensional PySpark data frame like this:
Date | ID
---------- | ----
08/31/2018 | 10
09/31/2018 | 10
09/01/2018 | null
09/01/2018 | null
09/01/2018 | 12
I wanted to replace ID null values by looking for the closest in the past, or if that value is null, by looking forward (and if it is again null, set a default value)
I have imagined adding a new column with .withColumn and use a UDF function which will query the data frame itself.
Something like that in pseudo code (not perfect but it is the main idea):
from pyspark.sql.types import StringType
from pyspark.sql.functions import udf
def return_value(value,date):
if value is not null:
return val
value1 = df.filter(df['date']<= date).select(df['value']).collect()
if (value1)[0][0] is not null:
return (value1)[0][0]
value2 = df.filter(tdf['date']>= date).select(df['value']).collect()
return (value2)[0][0]
value_udf = udf(return_value,StringType())
new_df = tr.withColumn("new_value", value_udf(df.value,df.date))
But it does not work. Am I completely on the wrong way to do it? Is it only possible to query a Spark data frame in a UDF function? Did I miss an easier solution?
Create new dataframe that have one column - unique list of all dates:
datesDF = yourDF.select('Date').distinct()
Create another one that will consist of dates and ID's but only ones where there is no nulls. And also lets keep only first (whatever will be first) occurrence of ID for each date (judging from your example you can have multiple rows per date)
noNullsDF = yourDF.dropna().dropDuplicates(subset='Date')
Lets now join those two so that we have list of all dates with whatever value we have for it (or null)
joinedDF = datesDF.join(noNullsDF, 'Date', 'left')
Now for every date get the value of ID from previous date and next date using window functions and also lets rename our ID column so later there will be less problems with join:
from pyspark.sql.window import Window
from pyspark.sql import functions as f
w = Window.orderBy('Date')
joinedDF = joinedDF.withColumn('previousID',f.lag('ID').over(w))
.withColumn('nextID',f.lead('ID').over(w))
.withColumnRenamed('ID','newID')
Now lets join it back to our original Dataframe by date
yourDF = yourDF.join(joinedDF, 'Date', 'left')
Now our Dataframe have 4 ID columns:
original ID
newID - ID of any non-null value of given date if any or null
previousID - ID from previous date (non null if any or null)
nextID - ID from next date (non null if any or null)
Now we need to combine them into finalID in order:
original value if not null
value for current date if any non nulls present (it's in contrast with your question but you pandas code suggest you go <= on date checking) if result is not null
value for previous date if its not null
value for next date if its not null
some default value
We do it's simply by coalescing:
default = 0
finalDF = yourDF.select('Date',
'ID',
f.coalesce('ID',
'newID',
'previousID',
'nextID',
f.lit(default)).alias('finalID')
)
I want to maintain the date sort-order, using collect_list for multiple columns, all with the same date order. I'll need them in the same dataframe so I can utilize to create a time series model input. Below is a sample of the "train_data":
I'm using a Window with PartitionBy to ensure sort order by tuning_evnt_start_dt for each Syscode_Stn. I can create one column with this code:
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data
.withColumn('spp_imp_daily', F.collect_list('spp_imp_daily').over(w)
)\
.groupBy('Syscode_Stn')\
.agg(F.max('spp_imp_daily').alias('spp_imp_daily'))
but how do I create two columns in the same new dataframe?
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data
.withColumn('spp_imp_daily',F.collect_list('spp_imp_daily').over(w))
.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))
.groupBy('Syscode_Stn')
.agg(F.max('spp_imp_daily').alias('spp_imp_daily')))
Note that MarchMadInd is not shown in the screenshot, but is included in train_data. Explanation of how I got to where I am: https://stackoverflow.com/a/49255498/8691976
Yes, the correct way is to add successive .withColumn statements, followed by a .agg statement that removes the duplicates for each array.
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data.withColumn('spp_imp_daily',
F.collect_list('spp_imp_daily').over(w)
)\
.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))\
.groupBy('Syscode_Stn')\
.agg(F.max('spp_imp_daily').alias('spp_imp_daily'),
F.max('MarchMadInd').alias('MarchMadInd')
)
I have a dataframe of
df = df.select("employee_id", "employee_name", "employee_address")
I need to rename the first two fields, but also still select the third field. So I thought this would work, but this appears to only select employee_address.
df = (df.withColumnRenamed("employee_id", "empId")
.withColumnRenamed("employee_name", "empName")
.select("employee_address")
)
How do I properly rename the first two fields while also selecting the third field?
I tried a mix of withColumn usages, but that doesn't work. Do I have to use a select on all three fields?
You can use the alias command:
import pyspark.sql.functions as func
df = df.select(
func.col("employee_id").alias("empId"),
func.col("employee_name").alias("empName"),
func.col("employee_address")
)
in pyspark what is the best way to do an operation for a id when a groupby does not apply. here is a sample code:
for id in [int(i.id) for i in df.select('id').distinct().collect()]:
temp = df.where("id == {}".format(id))
temp = temp.sort("date")
my_window = Window.partitionBy().orderBy("id")
temp = temp.withColumn("prev_transaction",lag(temp['date']).over(my_window))
temp = temp.withColumn("diff", temp['date']-temp["prev_transaction"))
temp = temp.where('day_diff > 0')
#select a row and so on
whats the best way to optimize this?
I suppose you have transactions data frame you want to add previous transactions column to it.
from pyspark.sql.window import Window
from pyspark.sql import functions as F
windowSpec = Window.partitionBy('id').orderBy('date')
df.withColumn('pre_transactions', F.collect_list('id') \
.over(windowSpec.rangeBetween(Window.unboundedPreceding, 0)))
Code above will add array column with previous transactions untill this transactions .But last transactions will be also added to array. You can remove last transaction by simple custom udf like:
remove_element = lambda l: l[:-1]
remove_udf = udf(remove_element, ArrayType(StringType()))