Spark dropduplicates but choose column with null - pyspark

I have a table that looks like that:
+---------+-------------+--------------+-----------+--------+--------------+--------------+
| cust_num|valid_from_dt|valid_until_dt|cust_row_id| cust_id|insert_load_dt|update_load_dt|
+---------+-------------+--------------+-----------+--------+--------------+--------------+
|950379405| 2018-08-24| 2018-08-24| 06885247|06885247| 2018-08-24| 2018-08-25|
|950379405| 2018-08-25| 2018-08-28| 06885247|06885247| 2018-08-25| 2018-08-29|
|950379405| 2018-08-29| 2019-12-16| 27344328|06885247| 2018-08-29| 2019-12-17|<- pair 1
|950379405| 2018-08-29| 2019-12-16| 27344328|06885247| 2018-08-29| |<- pair 1
|950379405| 2019-12-17| 2019-12-24| 91778710|06885247| 2019-12-17| |<- pair 2
|950379405| 2019-12-17| 2019-12-24| 91778710|06885247| 2019-12-17| 2019-12-25|<- pair 2
|950379405| 2019-12-25| 2019-12-25| 08396180|06885247| 2019-12-25| 2019-12-26|<- pair 3
|950379405| 2019-12-25| 2019-12-25| 08396180|06885247| 2019-12-25| |<- pair 3
As you can see I have some duplicated rows in my table and they are only different regarding update_load_dt being empty or with a date.
I would like to drop duplicates in my dataframe in such a way:
cable_dv_customer_fixed.dropDuplicates(['cust_num',
'valid_from_dt',
'valid_until_dt',
'cust_row_id',
'cust_id'])
but I would like to keep the row with more information. By that I mean I would like to keep the row where update_load_dt <> ''
Is it possible to modify dropduplicates() function so that I can choose which row from duplicates to choose? or is there some other (better) way to do this?

This is how I would go about this, F.max() will do what you want and keep the row with the highest value. (on date col max() keeps latest date entry if there's multiple).
from pyspark.sql.window import Window
key_cols = ['cust_num','valid_from_dt','valid_until_dt','cust_row_id','cust_id']
w = Window.partitionBy(key_cols)
df.withColumn('update_load_dt', F.max('update_load_dt').over(w)).dropDuplicates(key_cols)
I work with 1billion+ rows and this is not slow.
Let me know if this helped!

you can use window functions for that. With Big Data, it can be slow, though.
import pyspark.sql.function as F
from pyspark.sql.window import Window
df.withColumn("row_number", F.row_number().over(Window.partitionBy(<cols>).orderBy(F.asc_null_last("update_load_dt"))))
.filter("row_number = 1")
.drop("row_number") # optional

Using #Topde's answer, if you create a bolean column that checks if the value that you have present in your column is the highest one, you only need to add a filter that will only eliminate the duplicate entries with the "update_load_dt" column as null
from pyspark.sql.window import Window
import pyspark.sql.function as F
key_cols = ['cust_num','valid_from_dt','valid_until_dt','cust_row_id','cust_id']
w = Window.partitionBy(key_cols)
(df.withColumn('update_load_dt', F.when(F.max('update_load_dt').over(w), True).otherwise(False))
.filter(F.col('update_load_dt'))
.distinct()
)

Related

Pivot by year and also get the sum of all amounts Pyspark

I have data like this
I want output like this
How do I achieve this?
One way of doing is: pivot, create an array and sum values within the array
from pyspark.sql.functions import *
s =df.groupby('id').pivot('year').agg(sum('amount'))#Pivot
(s.withColumn('x', array(*[x for x in s.columns if x!='id']))#create array
.withColumn('x', expr("reduce(x,cast(0 as bigint),(c,i)-> c+i)"))#sum
).show()
OR use pysparks inbuilt aggregate function
s =df.groupby('id').pivot('year').agg(sum('amount'))#Pivot
(s.withColumn('x', array(*[x for x in s.columns if x!='id']))#create array
.withColumn('x', expr("aggregate(x,cast(0 as bigint),(c,i)-> c+i)"))#sum
).show()

How to maintain sort order in PySpark collect_list and collect multiple lists

I want to maintain the date sort-order, using collect_list for multiple columns, all with the same date order. I'll need them in the same dataframe so I can utilize to create a time series model input. Below is a sample of the "train_data":
I'm using a Window with PartitionBy to ensure sort order by tuning_evnt_start_dt for each Syscode_Stn. I can create one column with this code:
from pyspark.sql import functions as F
from pyspark.sql import Window
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data
.withColumn('spp_imp_daily', F.collect_list('spp_imp_daily').over(w)
)\
.groupBy('Syscode_Stn')\
.agg(F.max('spp_imp_daily').alias('spp_imp_daily'))
but how do I create two columns in the same new dataframe?
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data
.withColumn('spp_imp_daily',F.collect_list('spp_imp_daily').over(w))
.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))
.groupBy('Syscode_Stn')
.agg(F.max('spp_imp_daily').alias('spp_imp_daily')))
Note that MarchMadInd is not shown in the screenshot, but is included in train_data. Explanation of how I got to where I am: https://stackoverflow.com/a/49255498/8691976
Yes, the correct way is to add successive .withColumn statements, followed by a .agg statement that removes the duplicates for each array.
w = Window.partitionBy('Syscode_Stn').orderBy('tuning_evnt_start_dt')
sorted_list_df = train_data.withColumn('spp_imp_daily',
F.collect_list('spp_imp_daily').over(w)
)\
.withColumn('MarchMadInd', F.collect_list('MarchMadInd').over(w))\
.groupBy('Syscode_Stn')\
.agg(F.max('spp_imp_daily').alias('spp_imp_daily'),
F.max('MarchMadInd').alias('MarchMadInd')
)

pyspark group by sum

I have a pyspark dataframe with 4 columns.
id/ number / value / x
I want to groupby columns id, number, and then add a new columns with the sum of value per id and number. I want to keep colunms x without doing nothing on it.
df= df.select("id","number","value","x")
.groupBy( 'id', 'number').withColumn("sum_of_value",df.value.sum())
At the end I want a data frame with 5 columns : id/ number / value / x /sum_of_value)
Does anyone can help ?
The result you are trying to achieve doesn't make sense. Your output dataframe will only have columns that were grouped by or aggregated (summed in this case). x and value would have multiple values when you group by id and number.
You can have a 3-column output (id, number and sum(value)) like this:
df_summed = df.groupBy(['id', 'number'])['value'].sum()
Lets say your DataFrame df has 3 Columns Initially.
df1 = df.groupBy("id","number").count()
Now df1 will contain 2 columns id, number and count.
Now you can join df1 and df based on columns "id" and "number" and select whatever columns you would like to select.
Hope it helps.
Regards,
Neeraj

Creating table with multiple row and column names

I'm trying to create a table in matlab, with rows and columns which have several 'levels' of names, par example column name 'Neutral' which is divided into sublevels 'M' and 'SD' (see below for an illustration). I have the same problem with rows. Does anyone know if this is possible in Matlab, and if yes, how?
| Neutral |<- Column name
|----|----|
| M | SD |<- Sublevel of column name
|----|----|
|5.70|2.39|<- Data
|7.37|2.27|<-
| .. | .. |<-
| .. | .. |<-
You can nest table objects, like so:
t = table(table((1:10)', rand(10,1), 'VariableNames', {'M', 'SD'}), ...
'VariableNames', {'Neutral'});
The display looks a little odd, but you can index the nested variables in the way that you might expect, i.e. t.Neutral.M etc.

pyspark: get unique items in each column of a dataframe

I have a spark dataframe containing 1 million rows and 560 columns. I need to find the count of unique items in each column of the dataframe.
I have written the following code to achieve this but it is getting stuck and taking too much time to execute:
count_unique_items=[]
for j in range(len(cat_col)):
var=cat_col[j]
count_unique_items.append(data.select(var).distinct().rdd.map(lambda r:r[0]).count())
cat_col contains the column names of all the categorical variables
Is there any way to optimize this?
Try using approxCountDistinct or countDistinct:
from pyspark.sql.functions import approxCountDistinct, countDistinct
counts = df.agg(approxCountDistinct("col1"), approxCountDistinct("col2")).first()
but counting distinct elements is expensive.
You can do something like this, but as stated above, distinct element counting is expensive. The single * passes in each value as an argument, so the return value will be 1 row X N columns. I frequently do a .toPandas() call to make it easier to manipulate later down the road.
from pyspark.sql.functions import col, approxCountDistinct
distvals = df.agg(*(approxCountDistinct(col(c), rsd = 0.01).alias(c) for c in
df.columns))
You can use get every different element of each column with
df.stats.freqItems([list with column names], [percentage of frequency (default = 1%)])
This returns you a dataframe with the different values, but if you want a dataframe with just the count distinct of each column, use this:
from pyspark.sql.functions import countDistinct
df.select( [ countDistinct(cn).alias("c_{0}".format(cn)) for cn in df.columns ] ).show()
The part of the count, taken from here: check number of unique values in each column of a matrix in spark