Add new row to pyspark dataframe based on values - pyspark

I have got dataframe like this:
client_username|workstation|session_duration|access_point_name|start_date|
XX1#AD |Apple |1.55 |idf_1 |2019-06-01|
XX2#AD |Apple |30.12 |idf_2 |2019-06-04|
XX3#AD |Apple |78.25 |idf_3 |2019-06-02|
XX4#AD |Apple |0.45 |idf_1 |2019-06-02|
XX1#AD |Apple |23.11 |idf_1 |2019-06-02|
client_username - id of user in domain
workstation - user workstation
session_duration - duration (in hours) of the active session (user logged on hist host)
access_point_name - the name of access point that supplies the network to users host
start_date - start session
I would like to achieve dataframe like this:
client_username|workstation|session_duration|access_point_name|start_date|
XX1#AD |Apple |1.55 |idf_1 |2019-06-01|
XX2#AD |Apple |8 |idf_2 |2019-06-04|
XX2#AD |Apple |8 |idf_2 |2019-06-05|
XX3#AD |Apple |8 |idf_3 |2019-06-02|
XX3#AD |Apple |8 |idf_3 |2019-06-03|
XX3#AD |Apple |8 |idf_3 |2019-06-04|
XX3#AD |Apple |8 |idf_3 |2019-06-05|
XX4#AD |Apple |0.45 |idf_1 |2019-06-02|
XX1#AD |Apple |23.11 |idf_1 |2019-06-02|
The idea is as follows:
* if the length of session is over 24 hours, but less than 48 hours I would like change it:
XX2#AD |Apple |30.12 |idf_2 |2019-06-04|
to it:
XX2#AD |Apple |8 |idf_2 |2019-06-04|
XX2#AD |Apple |8 |idf_2 |2019-06-05|
The duration of the session changes to 8 hours, but the number of days increases to two days (2019-06-04 and 2019-06-05).
Analytical situations for duration above 48 hours (3 days), 72 hours (4 days) etc.
I'm starting to learn pyspark. I tried use union or crossJoin on dataframe, but this is very complicated for me at the moment. I would like to do this task with use pyspark.

Here are some methods you can try:
Method-1: string functions: repeat, substring
calculate number of repeats n = ceil(session_duration/24)
create a string a which repeats the substring 8, for n times and then use substring() or regexp_replace() to remove the trailing comma ,
split a by comma and then posexplode it into rows of pos and session_duration
adjust the start_date by pos from the above step
cast the string session_duration into double
see below code example:
from pyspark.sql import functions as F
# assume the columns in your dataframe are read with proper data types
# for example using inferSchema=True
df = spark.read.csv('/path/to/file', header=True, inferSchema=True)
df1 = df.withColumn('n', F.ceil(F.col('session_duration')/24).astype('int')) \
.withColumn('a', F.when(F.col('n')>1, F.expr('substring(repeat("8,",n),0,2*n-1)')).otherwise(F.col('session_duration')))
>>> df1.show()
+---------------+-----------+----------------+-----------------+-------------------+---+-------+
|client_username|workstation|session_duration|access_point_name| start_date| n| a|
+---------------+-----------+----------------+-----------------+-------------------+---+-------+
| XX1#AD| Apple| 1.55| idf_1|2019-06-01 00:00:00| 1| 1.55|
| XX2#AD| Apple| 30.12| idf_2|2019-06-04 00:00:00| 2| 8,8|
| XX3#AD| Apple| 78.25| idf_3|2019-06-02 00:00:00| 4|8,8,8,8|
| XX4#AD| Apple| 0.45| idf_1|2019-06-02 00:00:00| 1| 0.45|
| XX1#AD| Apple| 23.11| idf_1|2019-06-02 00:00:00| 1| 23.11|
+---------------+-----------+----------------+-----------------+-------------------+---+-------+
df_new = df1.select(
'client_username'
, 'workstation'
, F.posexplode(F.split('a', ',')).alias('pos', 'session_duration')
, 'access_point_name'
, F.expr('date_add(start_date, pos)').alias('start_date')
).drop('pos')
>>> df_new.show()
+---------------+-----------+----------------+-----------------+----------+
|client_username|workstation|session_duration|access_point_name|start_date|
+---------------+-----------+----------------+-----------------+----------+
| XX1#AD| Apple| 1.55| idf_1|2019-06-01|
| XX2#AD| Apple| 8| idf_2|2019-06-04|
| XX2#AD| Apple| 8| idf_2|2019-06-05|
| XX3#AD| Apple| 8| idf_3|2019-06-02|
| XX3#AD| Apple| 8| idf_3|2019-06-03|
| XX3#AD| Apple| 8| idf_3|2019-06-04|
| XX3#AD| Apple| 8| idf_3|2019-06-05|
| XX4#AD| Apple| 0.45| idf_1|2019-06-02|
| XX1#AD| Apple| 23.11| idf_1|2019-06-02|
+---------------+-----------+----------------+-----------------+----------+
The above code can also be written into one chain:
df_new = df.withColumn('n'
, F.ceil(F.col('session_duration')/24).astype('int')
).withColumn('a'
, F.when(F.col('n')>1, F.expr('substring(repeat("8,",n),0,2*n-1)')).otherwise(F.col('session_duration'))
).select('client_username'
, 'workstation'
, F.posexplode(F.split('a', ',')).alias('pos', 'session_duration')
, 'access_point_name'
, F.expr('date_add(start_date, pos)').alias('start_date')
).withColumn('session_duration'
, F.col('session_duration').astype('double')
).drop('pos')
Method-2: array function array_repeat (pyspark 2.4+)
Similar to the Method-1, but a is already an array thus no need to split a string into array:
df1 = df.withColumn('n', F.ceil(F.col('session_duration')/24).astype('int')) \
.withColumn('a', F.when(F.col('n')>1, F.expr('array_repeat(8,n)')).otherwise(F.array('session_duration')))
>>> df1.show()
+---------------+-----------+----------------+-----------------+-------------------+---+--------------------+
|client_username|workstation|session_duration|access_point_name| start_date| n| a|
+---------------+-----------+----------------+-----------------+-------------------+---+--------------------+
| XX1#AD| Apple| 1.55| idf_1|2019-06-01 00:00:00| 1| [1.55]|
| XX2#AD| Apple| 30.12| idf_2|2019-06-04 00:00:00| 2| [8.0, 8.0]|
| XX3#AD| Apple| 78.25| idf_3|2019-06-02 00:00:00| 4|[8.0, 8.0, 8.0, 8.0]|
| XX4#AD| Apple| 0.45| idf_1|2019-06-02 00:00:00| 1| [0.45]|
| XX1#AD| Apple| 23.11| idf_1|2019-06-02 00:00:00| 1| [23.11]|
+---------------+-----------+----------------+-----------------+-------------------+---+--------------------+
df_new = df1.select('client_username'
, 'workstation'
, F.posexplode('a').alias('pos', 'session_duration')
, 'access_point_name'
, F.expr('date_add(start_date, pos)').alias('start_date')
).drop('pos')

Related

Need to add order on ids on the basis of timestamp

Desired Outcome
Tried Everthing group by and condition but not working
You can achieve that with windowing function, like this:
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
window = Window.partitionBy("user").orderBy("timestamp")
df.withColumn("order", row_number().over(window)).show()
+----+---------+-----+
|user|timestamp|order|
+----+---------+-----+
| 111| 12:00| 1|
| 111| 12:30| 2|
| 111| 12:45| 3|
| 112| 12:00| 1|
| 112| 12:30| 2|
| 112| 12:45| 3|
| 113| 12:00| 1|
| 113| 12:30| 2|
| 113| 12:45| 3|
+----+---------+-----+

PySpark - Getting the latest date less than another given date

I need some help. I have two dataframes, one has a few dates and the other has my significant data, catalogued by date.
It goes something like this:
First df, with the relevant data
+------+----------+---------------+
| id| test_date| score|
+------+----------+---------------+
| 1|2021-03-31| 94|
| 1|2021-01-31| 93|
| 1|2020-12-31| 100|
| 1|2020-06-30| 95|
| 1|2019-10-31| 58|
| 1|2017-10-31| 78|
| 2|2020-01-31| 79|
| 2|2018-03-31| 66|
| 2|2016-05-31| 77|
| 3|2021-05-31| 97|
| 3|2020-07-31| 100|
| 3|2019-07-31| 99|
| 3|2019-06-30| 98|
| 3|2018-07-31| 91|
| 3|2018-02-28| 86|
| 3|2017-11-30| 82|
+------+----------+---------------+
Second df, with the dates
+--------------+--------------+--------------+
| eval_date_1| eval_date_2| eval_date_3|
+--------------+--------------+--------------+
| 2021-01-31| 2020-10-31| 2019-06-30|
+--------------+--------------+--------------+
Needed DF
+------+--------------+---------+--------------+---------+--------------+---------+
| id| eval_date_1| score_1 | eval_date_2| score_2 | eval_date_3| score_3 |
+------+--------------+---------+--------------+---------+--------------+---------+
| 1| 2021-01-31| 93| 2020-10-31| 95| 2019-06-30| 78|
| 2| 2021-01-31| 79| 2020-10-31| 79| 2019-06-30| 66|
| 3| 2021-01-31| 100| 2020-10-31| 100| 2019-06-30| 98|
+------+--------------+---------+--------------+---------+--------------+---------+
So, for instance, for the first id, the needed df takes the scores from the second, fourth and sixth rows from the first df. Those are the most updated dates that stay equal to or below the eval_date on the second df.
Assuming df is your main dataframe and df_date is the one which contains only dates.
from functools import reduce
from pyspark.sql import functions as F, Window as W
df_final = reduce(
lambda a, b: a.join(b, on="id"),
(
df.join(
F.broadcast(df_date.select(f"eval_date_{i}")),
on=F.col(f"eval_date_{i}") >= F.col("test_date"),
)
.withColumn(
"rnk",
F.row_number().over(W.partitionBy("id").orderBy(F.col("test_date").desc())),
)
.where("rnk=1")
.select("id", f"eval_date_{i}", "score")
for i in range(1, 4)
),
)
df_final.show()
+---+-----------+-----+-----------+-----+-----------+-----+
| id|eval_date_1|score|eval_date_2|score|eval_date_3|score|
+---+-----------+-----+-----------+-----+-----------+-----+
| 1| 2021-01-31| 93| 2020-10-31| 95| 2019-06-30| 78|
| 3| 2021-01-31| 100| 2020-10-31| 100| 2019-06-30| 98|
| 2| 2021-01-31| 79| 2020-10-31| 79| 2019-06-30| 66|
+---+-----------+-----+-----------+-----+-----------+-----+

PySpark generating consecutive increasing index for each window

I would like to generate consecutively increasing index ids for each dataframe window, and the index start point can be customed, say 212 for the following example.
INPUT:
+---+-------------+
| id| component|
+---+-------------+
| a|1047972020224|
| b|1047972020224|
| c|1047972020224|
| d| 670014898176|
| e| 670014898176|
| f| 146028888064|
| g| 146028888064|
+---+-------------+
EXPECTED OUTPUT:
+---+-------------+-----------------------------+
| id| component| partition_index|
+---+-------------+-----------------------------+
| a|1047972020224| 212|
| b|1047972020224| 212|
| c|1047972020224| 212|
| d| 670014898176| 213|
| e| 670014898176| 213|
| f| 146028888064| 214|
| g| 146028888064| 214|
+---+-------------+-----------------------------+
Not sure if Window.partitionBy('component').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) can be helpful in this problem. Any ideas?
You don't have any obvious partitioning here, so you can use dense_rank with an unpartitioned window and add 211 to the result. e.g.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'index',
F.dense_rank().over(Window.orderBy(F.desc('component'))) + 211
)
df2.show()
+---+-------------+-----+
| id| component|index|
+---+-------------+-----+
| a|1047972020224| 212|
| b|1047972020224| 212|
| c|1047972020224| 212|
| d| 670014898176| 213|
| e| 670014898176| 213|
| f| 146028888064| 214|
| g| 146028888064| 214|
+---+-------------+-----+

how to calculate row mean before and after a given index for each row - pyspark?

I have a data frame of multiple columns and an index and I have to calculate mean of those columns before the index and after.
this is my pandas code:
for i in range(len(res.index)):
i=int(i)
m=int(res['index'].ix[i])
n = len(res.columns[1:m])
if n == 0:
res['mean'].ix[i]=0
else:
res['mean'].ix[i]=int(res.ix[i,1:m].sum()) / n
and i want to do it in pyspark?
any help please!!
You can calculate this using UDF in pyspark. Here is an example:-
from pyspark.sql import functions as F
from pyspark.sql import types as T
import numpy as np
sample_data = sqlContext.createDataFrame([
range(10)+[4],
range(50, 60)+[2],
range(9, 19)+[4],
range(19, 29)+[3],
], ["col_"+str(i) for i in range(10)]+["index"])
sample_data.show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
|col_0|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|index|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9| 4|
| 50| 51| 52| 53| 54| 55| 56| 57| 58| 59| 2|
| 9| 10| 11| 12| 13| 14| 15| 16| 17| 18| 4|
| 19| 20| 21| 22| 23| 24| 25| 26| 27| 28| 3|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
def def_mn(data, index, mean="pre"):
if mean == "pre":
return sum(data[:index])/float(len(data[:index]))
elif mean == "post":
return sum(data[index:])/float(len(data[index:]))
mn_udf = F.udf(def_mn)
sample_data.withColumn(
"index_pre_mean",
mn_udf(F.array([cl for cl in sample_data.columns[:-1]]), "index")
).withColumn(
"index_post_mean",
mn_udf(F.array([cl for cl in sample_data.columns[:-1]]), "index", F.lit("post"))
).show()
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--------------+---------------+
|col_0|col_1|col_2|col_3|col_4|col_5|col_6|col_7|col_8|col_9|index|index_pre_mean|index_post_mean|
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--------------+---------------+
|0 |1 |2 |3 |4 |5 |6 |7 |8 |9 |4 |1.5 |6.5 |
|50 |51 |52 |53 |54 |55 |56 |57 |58 |59 |2 |50.5 |55.5 |
|9 |10 |11 |12 |13 |14 |15 |16 |17 |18 |4 |10.5 |15.5 |
|19 |20 |21 |22 |23 |24 |25 |26 |27 |28 |3 |20.0 |25.0 |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+--------------+---------------+

Forward-fill missing data in PySpark not working

I have a simple dataset as shown under.
| id| name| country| languages|
|1 | Bob| USA| Spanish|
|2 | Angelina| France| null|
|3 | Carl| Brazil| null|
|4 | John| Australia| English|
|5 | Anne| Nepal| null|
I am trying to impute the null values in languages with the last non-null value using pyspark.sql.window to create a window over certain rows but nothing is happening. The column which is supposed to be have null values filled, temp_filled_spark, remains unchanged i.e a copy of original languages column.
from pyspark.sql import Window
from pyspark.sql.functions import last
window = Window.partitionBy('name').orderBy('country').rowsBetween(-sys.maxsize, 0)
filled_column = last(df['languages'], ignorenulls=True).over(window)
df = df.withColumn('temp_filled_spark', filled_column)
df.orderBy('name', 'country').show(100)
I expect the output column to be:
|temp_filled_spark|
| Spanish|
| Spanish|
| Spanish|
| English|
| English|
Could anybody help pointing out the mistake?
we can create window considering entire dataframe as one partition as,
from pyspark.sql import functions as F
>>> df1.show()
+---+--------+---------+---------+
| id| name| country|languages|
+---+--------+---------+---------+
| 1| Bob| USA| Spanish|
| 2|Angelina| France| null|
| 3| Carl| Brazil| null|
| 4| John|Australia| English|
| 5| Anne| Nepal| null|
+---+--------+---------+---------+
>>> w = Window.partitionBy(F.lit(1)).orderBy(F.lit(1)).rowsBetween(-sys.maxsize, 0)
>>> df1.select("*",F.last('languages',True).over(w).alias('newcol')).show()
+---+--------+---------+---------+-------+
| id| name| country|languages| newcol|
+---+--------+---------+---------+-------+
| 1| Bob| USA| Spanish|Spanish|
| 2|Angelina| France| null|Spanish|
| 3| Carl| Brazil| null|Spanish|
| 4| John|Australia| English|English|
| 5| Anne| Nepal| null|English|
+---+--------+---------+---------+-------+
Hope this helps.!