Handle missing data and assign value as 0 in pyspark - pyspark

I want the answer in PySpark :
Here i have a DataFrame with column id, date and value
i want to fill the missing date with value 0 and i want every id should have same number of date's.
ex : [ 2022/02/09 to 2022/02/15 ] for all id. [ The min date is 2022/02/09 and max date is 2022/02/15 ]
before :
id
date
value
201
2022/02/11
10
201
2022/02/13
2
202
2022/02/09
50
202
2022/02/11
1
202
2022/02/12
3
401
2022/02/11
12
401
2022/02/12
9
401
2022/02/15
15
after :
id
date
value
201
2022/02/09
0
201
2022/02/10
0
201
2022/02/11
10
201
2022/02/12
0
201
2022/02/13
2
201
2022/02/14
0
201
2022/02/15
0
202
2022/02/09
50
202
2022/02/10
0
202
2022/02/11
1
202
2022/02/12
3
202
2022/02/13
0
202
2022/02/14
0
202
2022/02/15
0
401
2022/02/09
0
401
2022/02/10
0
401
2022/02/11
12
401
2022/02/12
9
401
2022/02/13
0
401
2022/02/14
0
401
2022/02/15
15

Here's an approach with sequence(). You first find the min and max dates and use them to create distinct dates. This dates dataframe can then be cross joined with distinct ID values so that all ID values have all dates. The values from the value field can then be joined to the said cross joined dataframe to fetch all values. The remaining null values can be replaced with 0.
# convert date column to compatible format in the input dataframe
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['id', 'dt', 'val']). \
withColumn('dt', func.to_date('dt', 'yyyy/MM/dd'))
# +---+----------+---+
# | id| dt|val|
# +---+----------+---+
# |201|2022-02-11| 10|
# |201|2022-02-13| 2|
# |202|2022-02-09| 50|
# |202|2022-02-11| 1|
# |202|2022-02-12| 3|
# |401|2022-02-11| 12|
# |401|2022-02-12| 9|
# |401|2022-02-15| 15|
# +---+----------+---+
all_dt_sdf = data_sdf. \
select(func.min('dt').alias('min_dt'), func.max('dt').alias('max_dt')). \
withColumn('all_dts', func.expr('sequence(min_dt, max_dt, interval 1 day)')). \
select(func.explode('all_dts').alias('dt'))
# +----------+
# | dt|
# +----------+
# |2022-02-09|
# |2022-02-10|
# |2022-02-11|
# |2022-02-12|
# |2022-02-13|
# |2022-02-14|
# |2022-02-15|
# +----------+
data_sdf. \
select('id'). \
dropDuplicates(). \
crossJoin(all_dt_sdf). \
join(data_sdf, ['id', 'dt'], 'left'). \
fillna(0, subset=['val']). \
show()
# +---+----------+---+
# | id| dt|val|
# +---+----------+---+
# |201|2022-02-09| 0|
# |201|2022-02-10| 0|
# |201|2022-02-11| 10|
# |201|2022-02-12| 0|
# |201|2022-02-13| 2|
# |201|2022-02-14| 0|
# |201|2022-02-15| 0|
# |202|2022-02-09| 50|
# |202|2022-02-10| 0|
# |202|2022-02-11| 1|
# |202|2022-02-12| 3|
# |202|2022-02-13| 0|
# |202|2022-02-14| 0|
# |202|2022-02-15| 0|
# |401|2022-02-09| 0|
# |401|2022-02-10| 0|
# |401|2022-02-11| 12|
# |401|2022-02-12| 9|
# |401|2022-02-13| 0|
# |401|2022-02-14| 0|
# +---+----------+---+
# only showing top 20 rows
A short approach employing min() max() window functions
data_sdf. \
withColumn('data_min_dt', func.min('dt').over(wd.partitionBy(func.lit(1)))). \
withColumn('data_max_dt', func.max('dt').over(wd.partitionBy(func.lit(1)))). \
select('id', 'data_min_dt', 'data_max_dt'). \
dropDuplicates(). \
withColumn('all_dts', func.expr('sequence(data_min_dt, data_max_dt, interval 1 day)')). \
select('id', func.explode('all_dts').alias('dt')). \
join(data_sdf, ['id', 'dt'], 'left'). \
fillna(0, subset=['val']). \
orderBy(['id', 'dt']). \
show()
# +---+----------+---+
# | id| dt|val|
# +---+----------+---+
# |201|2022-02-09| 0|
# |201|2022-02-10| 0|
# |201|2022-02-11| 10|
# |201|2022-02-12| 0|
# |201|2022-02-13| 2|
# |201|2022-02-14| 0|
# |201|2022-02-15| 0|
# |202|2022-02-09| 50|
# |202|2022-02-10| 0|
# |202|2022-02-11| 1|
# |202|2022-02-12| 3|
# |202|2022-02-13| 0|
# |202|2022-02-14| 0|
# |202|2022-02-15| 0|
# |401|2022-02-09| 0|
# |401|2022-02-10| 0|
# |401|2022-02-11| 12|
# |401|2022-02-12| 9|
# |401|2022-02-13| 0|
# |401|2022-02-14| 0|
# +---+----------+---+
# only showing top 20 rows

Related

Spark rolling window with multiple occurrence in same date

Input:
Request:
Would like to calculate 3 months rolling sum, avg across time.
But, there are two rows for "2022-07-01". Would like to get result for both row.
Expected Output:
If you can use rowsBetween instead of rangeBetween, you can assign a row number to each of the dates in an order and then use that row number in the sum and avg window.
Below is an example.
data_sdf. \
withColumn('rn', func.row_number().over(wd.partitionBy().orderBy('ym'))). \
withColumn('sum_val_roll3m',
func.sum('val').over(wd.partitionBy().orderBy('rn').rowsBetween(-2, 0))
). \
withColumn('mean_val_roll3m',
func.avg('val').over(wd.partitionBy().orderBy('rn').rowsBetween(-2, 0))
). \
show()
# +----------+---+---+--------------+-----------------+
# | ym|val| rn|sum_val_roll3m| mean_val_roll3m|
# +----------+---+---+--------------+-----------------+
# |2022-03-01| 8| 1| 8| 8.0|
# |2022-04-01| 7| 2| 15| 7.5|
# |2022-05-01| 7| 3| 22|7.333333333333333|
# |2022-06-01| 10| 4| 24| 8.0|
# |2022-07-01| 4| 5| 21| 7.0|
# |2022-07-01| 1| 6| 15| 5.0|
# +----------+---+---+--------------+-----------------+

Filling column null fields with specific values from another column fields

I'm new to pyspark, i need your help concerning dataframe column creation. I have a dataframe of type:
FROM_CURRENCY
TO_CURRENCY
RATIO_FROM
RATIO_TO
AED
EUR
0
0
AED
EUR
1
1
GNF
EUR
0
0
DZD
EUR
1
1
GNF
EUR
1000
1000
I would like to create two additional columns: Ratio_FROM_BIS and Ratio_To BIS ( Based on values of FROM_CURRENCY and TO_CURRENCY), if you noticed, 0 values were replaced by non null values from other fields with same FROM_CURRENCY values.
FROM_CURRENCY
TO_CURRENCY
RATIO_FROM_BIS
RATIO_TO_BIS
AED
EUR
1
1
AED
EUR
1
1
GNF
EUR
1000
1000
DZD
EUR
1
1
GNF
EUR
1000
1000
I have tried to used .withColumn(field1,F.Lit(command)) but it's not working.
Based on your comment, there can be multiple records for a certain currency (from_currency) and all of the non-zero records will have the same ratio values. I've added the last row to denote this scenario.
An approach with max() window function.
import pyspark.sql.functions as func
from pyspark.sql.window import Window as wd
data_ls = [
('AED', 'EUR', 0, 0),
('AED', 'EUR', 1, 1),
('GNF', 'EUR', 0, 0),
('DZD', 'EUR', 1, 1),
('GNF', 'EUR', 1000, 1000),
('GNF', 'EUR', 1000, 1000)
]
data_sdf = spark.sparkContext.parallelize(data_ls). \
toDF(['from_curr', 'to_curr', 'ratio_to', 'ratio_from'])
# +---------+-------+--------+----------+
# |from_curr|to_curr|ratio_to|ratio_from|
# +---------+-------+--------+----------+
# | AED| EUR| 0| 0|
# | AED| EUR| 1| 1|
# | GNF| EUR| 0| 0|
# | DZD| EUR| 1| 1|
# | GNF| EUR| 1000| 1000|
# | GNF| EUR| 1000| 1000|
# +---------+-------+--------+----------+
data_sdf. \
withColumn('ratio_to_bis',
func.when(func.col('ratio_to') > 0, func.col('ratio_to')).
otherwise(func.max('ratio_to').over(wd.partitionBy('from_curr', 'to_curr')))
). \
withColumn('ratio_from_bis',
func.when(func.col('ratio_from') > 0, func.col('ratio_from')).
otherwise(func.max('ratio_from').over(wd.partitionBy('from_curr', 'to_curr')))
). \
show()
# +---------+-------+--------+----------+------------+--------------+
# |from_curr|to_curr|ratio_to|ratio_from|ratio_to_bis|ratio_from_bis|
# +---------+-------+--------+----------+------------+--------------+
# | DZD| EUR| 1| 1| 1| 1|
# | GNF| EUR| 0| 0| 1000| 1000|
# | GNF| EUR| 1000| 1000| 1000| 1000|
# | GNF| EUR| 1000| 1000| 1000| 1000|
# | AED| EUR| 0| 0| 1| 1|
# | AED| EUR| 1| 1| 1| 1|
# +---------+-------+--------+----------+------------+--------------+

pypsark convert for loop to map

I have a dataset that has null values
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| 0|null|
| 1|null| 0|
|null| 1| 0|
| 1| 0| 0|
| 1| 0| 0|
|null| 0| 1|
| 1| 1| 0|
| 1| 1|null|
|null| 1| 0|
+----+----+----+
I wrote a function to count the percentage of null values of each column in the dataset and removing those columns from the dataset. Below is the function
import pyspark.sql.functions as F
def calc_null_percent(df, strength=None):
if strength is None:
strength = 80
total_count = df.count()
null_cols = []
df2 = df.select([F.count(F.when(F.col(c).contains('None') | \
F.col(c).contains('NULL') | \
(F.col(c) == '' ) | \
F.col(c).isNull() | \
F.isnan(c), c
)).alias(c)
for c in df.columns])
for i in df2.columns:
get_null_val = df2.first()[i]
if (get_null_val/total_count)*100 > strength:
null_cols.append(i)
df = df.drop(*null_cols)
return df
I am using a for loop to get the columns based on the condition. Can we use map or Is there any other way to optimise the for loop in pyspark?
Here's a way to do it with list comprehension.
data_ls = [
(1, 0, 'blah'),
(0, None, 'None'),
(None, 1, 'NULL'),
(1, None, None)
]
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['id1', 'id2', 'id3'])
# +----+----+----+
# | id1| id2| id3|
# +----+----+----+
# | 1| 0|blah|
# | 0|null|None|
# |null| 1|NULL|
# | 1|null|null|
# +----+----+----+
Now, calculate the percentage of nulls in a dataframe and collect() it for further use.
# total row count
tot_count = data_sdf.count()
# percentage of null records per column
data_null_perc_sdf = data_sdf. \
select(*[(func.sum((func.col(k).isNull() | (func.upper(k).isin(['NONE', 'NULL']))).cast('int')) / tot_count).alias(k+'_nulls_perc') for k in data_sdf.columns])
# +--------------+--------------+--------------+
# |id1_nulls_perc|id2_nulls_perc|id3_nulls_perc|
# +--------------+--------------+--------------+
# | 0.25| 0.5| 0.75|
# +--------------+--------------+--------------+
# collection of the dataframe for list comprehension
data_null_perc = data_null_perc_sdf.collect()
# [Row(id1_nulls_perc=0.25, id2_nulls_perc=0.5, id3_nulls_perc=0.75)]
threshold = 0.5
# retain columns of `data_sdf` that have more null records than aforementioned threshold
cols2drop = [k for k in data_sdf.columns if data_null_perc[0][k+'_nulls_perc'] >= threshold]
# ['id2', 'id3']
Use cols2drop variable to drop the columns from data_sdf in the next step
new_data_sdf = data_sdf.drop(*cols2drop)
# +----+
# | id1|
# +----+
# | 1|
# | 0|
# |null|
# | 1|
# +----+

Change value on duplicated rows using Pyspark, keeping the first record as is

how can I change the column status value on rows that contains duplicate records on specific columns, and keep the first one(with the lower id) as A, for example:
logic:
if the account_id and user_id already exists the status is E, the first record(lower id) is A
if the user_id exists and the account_id is different the status is I, the first record(lower id) is A
input sample:
id
account_id
user_id
1
a
1
2
a
1
3
b
1
4
c
2
5
c
2
6
c
2
7
d
3
8
d
3
9
e
3
output sample
id
account_id
user_id
status
1
a
1
A
2
a
1
E
3
b
1
I
4
c
2
A
5
c
2
E
6
c
2
E
7
d
3
A
8
d
3
E
9
e
3
I
I think I need to group into multiple datasets and join it back, compare and change the values, I think I'm overthinking, help?
Thanks!!
Thank you
Two window functions would help you to determine the duplications and rank them.
from pyspark.sql import functions as F
from pyspark.sql import Window as W
(df
# Distinguishes between "first occurrence" vs "2nd occurrence" and so on
.withColumn('rank', F.rank().over(W.partitionBy('account_id', 'user_id').orderBy('id')))
# Detecting if there is no duplication per pair of 'account_id' and 'user_id'
.withColumn('count', F.count('*').over(W.partitionBy('account_id', 'user_id')))
# building status based on conditions
.withColumn('status', F
.when(F.col('count') == 1, 'I') # if there is only one record, status is 'I'
.when(F.col('rank') == 1, 'A') # if there is more than one record, the first occurrence is 'A'
.otherwise('E') # finally, the other occurrences are 'E'
)
.orderBy('id')
.show()
)
# Output
# +---+----------+-------+----+-----+------+
# | id|account_id|user_id|rank|count|status|
# +---+----------+-------+----+-----+------+
# | 1| a| 1| 1| 2| A|
# | 2| a| 1| 2| 2| E|
# | 3| b| 1| 1| 1| I|
# | 4| c| 2| 1| 3| A|
# | 5| c| 2| 2| 3| E|
# | 6| c| 2| 3| 3| E|
# | 7| d| 3| 1| 2| A|
# | 8| d| 3| 2| 2| E|
# | 9| e| 3| 1| 1| I|
# +---+----------+-------+----+-----+------+

create another columns for checking different value in pyspark

I wish to have below expected output:
My code:
import numpy as np
pd_dataframe = pd.DataFrame({'id': [i for i in range(10)],
'values': [10,5,3,-1,0,-10,-4,10,0,10]})
sp_dataframe = spark.createDataFrame(pd_dataframe)
sign_acc_row = F.udf(lambda x: int(np.sign(x)), IntegerType())
sp_dataframe = sp_dataframe.withColumn('sign', sign_acc_row('values'))
sp_dataframe.show()
I wanted to create another column with which it returns an additional of 1 when the value is different from previous row.
Expected output:
id values sign numbering
0 0 10 1 1
1 1 5 1 1
2 2 3 1 1
3 3 -1 -1 2
4 4 0 0 3
5 5 -10 -1 4
6 6 -4 -1 4
7 7 10 1 5
8 8 0 0 6
9 9 10 1 7
Here's a way you can do using a custom function:
import pyspark.sql.functions as F
# compare the next value with previous
def f(x):
c = 1
l = [c]
last_value = [x[0]]
for i in x[1:]:
if i == last_value[-1]:
l.append(c)
else:
c += 1
l.append(c)
last_value.append(i)
return l
# take sign column as a list
sign_list = sp_dataframe.select('sign').rdd.map(lambda x: x.sign).collect()
# create a new dataframe using the output
sp = spark.createDataFrame(pd.DataFrame(f(sign_list), columns=['numbering']))
Append a list as a column to a dataframe is a bit tricky in pyspark. For this we'll need to create a dummy row_idx to join the dataframes.
# create dummy indexes
sp_dataframe = sp_dataframe.withColumn("row_idx", F.monotonically_increasing_id())
sp = sp.withColumn("row_idx", F.monotonically_increasing_id())
# join the dataframes
final_df = (sp_dataframe
.join(sp, sp_dataframe.row_idx == sp.row_idx)
.orderBy('id')
.drop("row_idx"))
final_df.show()
+---+------+----+---------+
| id|values|sign|numbering|
+---+------+----+---------+
| 0| 10| 1| 1|
| 1| 5| 1| 1|
| 2| 3| 1| 1|
| 3| -1| -1| 2|
| 4| 0| 0| 3|
| 5| -10| -1| 4|
| 6| -4| -1| 4|
| 7| 10| 1| 5|
| 8| 0| 0| 6|
| 9| 10| 1| 7|
+---+------+----+---------+