getting the previous month value - pyspark

I'm trying to get previous month data, for that I am using lag function but not getting the desired results.
ut cntr src Item section Year Period css fct ytd_1 ytd_1*fct aproach1 aproach2
49 52 179 f 84 2019 1 63 0.616580311 5578.092 3439.341699 0 0
e4 52 179 f 84 2019 1 31 0.248704663 5578.092 1387.297492 0 0
49 52 179 f 84 2019 1 31 0.248704663 5578.092 1387.297492 0 0
a5 52 179 f 84 2019 1 31 0.248704663 5578.092 1387.297492 0 0
49 52 179 f 84 2019 2 63 0.080405405 18506.982 1488.061391 3439.341 5578.092
49 52 179 f 84 2019 2 31 0.072297297 18506.982 1338.00478 1387.29 5578.092
e4 52 187 f 84 2019 2 31 0.072297297 18506.982 1338.00478 1387.29 5578.092
e4 52 179 f 84 2019 2 31 0.072297297 18506.982 1338.00478 1387.29 5578.092
code:
w_lag = Window.partitionBy(['Item','section','section','css','Year']).orderBy(spark_func.asc('Period'))
df_lag = df_unit.withColumn('approach', spark_func.lead(df_unit['ytd_1']).over(w_lag))
can I get the help to get the values of the previous month which I mentioned in approach2 column(expected results)

Check if below works for you.
First creating your Dataframe(Added Period 3 for validation of result, not caring about other columns)
l1 = [('49',52,179,'f',84,2019,1,63,0.616580311,5578.092,3439.341699),
('e4',52,179,'f',84,2019,1,31,0.248704663,5578.092,1387.297492),
('49',52,179,'f',84,2019,1,31,0.248704663,5578.092,1387.297492),
('a5',52,179,'f',84,2019,1,31,0.248704663,5578.092,1387.297492),
('49',52,179,'f',84,2019,2,63,0.080405405,18506.982,1488.061391),
('49',52,179,'f',84,2019,2,31,0.072297297,18506.982,1338.00478),
('e4',52,187,'f',84,2019,2,31,0.072297297,18506.982,1338.00478),
('e4',52,179,'f',84,2019,2,31,0.072297297,18506.982,1338.00478),
('e4',52,179,'f',84,2019,3,31,0.072297297,10006.982,1338.00478),
('e4',52,179,'f',84,2019,3,31,0.072297297,10006.982,1338.00478)]
Create Dataframe
dfl1 = spark.createDataFrame(l1).toDF('ut','cntr','src','Item','section','Year','Period','css','fct','ytd_1','ytd_1*fct')
dfl1.show()
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+
| ut|cntr|src|Item|section|Year|Period|css| fct| ytd_1| ytd_1*fct|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+
| 49| 52|179| f| 84|2019| 1| 63|0.616580311| 5578.092|3439.341699|
| e4| 52|179| f| 84|2019| 1| 31|0.248704663| 5578.092|1387.297492|
| 49| 52|179| f| 84|2019| 1| 31|0.248704663| 5578.092|1387.297492|
| a5| 52|179| f| 84|2019| 1| 31|0.248704663| 5578.092|1387.297492|
| 49| 52|179| f| 84|2019| 2| 63|0.080405405|18506.982|1488.061391|
| 49| 52|179| f| 84|2019| 2| 31|0.072297297|18506.982| 1338.00478|
| e4| 52|187| f| 84|2019| 2| 31|0.072297297|18506.982| 1338.00478|
| e4| 52|179| f| 84|2019| 2| 31|0.072297297|18506.982| 1338.00478|
| e4| 52|179| f| 84|2019| 3| 31|0.072297297|10006.982| 1338.00478|
| e4| 52|179| f| 84|2019| 3| 31|0.072297297|10006.982| 1338.00478|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+
Define Window. Here is the trick, we are giving range of -1 to 0 so it will always check one range above and then take the first value of previous range.
Range description from official document
A range-based boundary is based on the actual value of the ORDER BY expression(s)
wl1 = Window.partitionBy(['Item','section','Year','css']).orderBy('Period').rangeBetween( -1, 0)
Now First value will be same for Period 1 so adding when function and marking as 0
dfl2 = dfl1.withColumn('Result', func.when(func.first(dfl1['ytd_1']).over(wl1) == dfl1['ytd_1'], func.lit(0)).otherwise(func.first(dfl1['ytd_1']).over(wl1)))
dfl2.orderBy('Period').show()
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+---------+
| ut|cntr|src|Item|section|Year|Period|css| fct| ytd_1| ytd_1*fct| Result|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+---------+
| e4| 52|179| f| 84|2019| 1| 31|0.248704663| 5578.092|1387.297492| 0.0|
| a5| 52|179| f| 84|2019| 1| 31|0.248704663| 5578.092|1387.297492| 0.0|
| 49| 52|179| f| 84|2019| 1| 63|0.616580311| 5578.092|3439.341699| 0.0|
| 49| 52|179| f| 84|2019| 1| 31|0.248704663| 5578.092|1387.297492| 0.0|
| 49| 52|179| f| 84|2019| 2| 63|0.080405405|18506.982|1488.061391| 5578.092|
| e4| 52|179| f| 84|2019| 2| 31|0.072297297|18506.982| 1338.00478| 5578.092|
| 49| 52|179| f| 84|2019| 2| 31|0.072297297|18506.982| 1338.00478| 5578.092|
| e4| 52|187| f| 84|2019| 2| 31|0.072297297|18506.982| 1338.00478| 5578.092|
| e4| 52|179| f| 84|2019| 3| 31|0.072297297|10006.982| 1338.00478|18506.982|
| e4| 52|179| f| 84|2019| 3| 31|0.072297297|10006.982| 1338.00478|18506.982|
+---+----+---+----+-------+----+------+---+-----------+---------+-----------+---------+

Related

PySpark generating consecutive increasing index for each window

I would like to generate consecutively increasing index ids for each dataframe window, and the index start point can be customed, say 212 for the following example.
INPUT:
+---+-------------+
| id| component|
+---+-------------+
| a|1047972020224|
| b|1047972020224|
| c|1047972020224|
| d| 670014898176|
| e| 670014898176|
| f| 146028888064|
| g| 146028888064|
+---+-------------+
EXPECTED OUTPUT:
+---+-------------+-----------------------------+
| id| component| partition_index|
+---+-------------+-----------------------------+
| a|1047972020224| 212|
| b|1047972020224| 212|
| c|1047972020224| 212|
| d| 670014898176| 213|
| e| 670014898176| 213|
| f| 146028888064| 214|
| g| 146028888064| 214|
+---+-------------+-----------------------------+
Not sure if Window.partitionBy('component').rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing) can be helpful in this problem. Any ideas?
You don't have any obvious partitioning here, so you can use dense_rank with an unpartitioned window and add 211 to the result. e.g.
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'index',
F.dense_rank().over(Window.orderBy(F.desc('component'))) + 211
)
df2.show()
+---+-------------+-----+
| id| component|index|
+---+-------------+-----+
| a|1047972020224| 212|
| b|1047972020224| 212|
| c|1047972020224| 212|
| d| 670014898176| 213|
| e| 670014898176| 213|
| f| 146028888064| 214|
| g| 146028888064| 214|
+---+-------------+-----+

Weighted running total with conditional in Pyspark/Hive

I have product, brand, percentage and price columns. I want to calculate the sum of the percentage column for the rows above the current row for those with different brand than the current row and also for those with same brand as the current row. I want to weigh them by price. If the price of the products above the current row are more than the current row, I want to down-weigh it by multiplying it by 0.8. How can I do it in PySpark or using using spark.sql? The answer without using multiplying with weight is here.
df = pd.DataFrame({'a': ['a1','a2','a3','a4','a5','a6'],
'brand':['b1','b2','b1', 'b3', 'b2','b1'],
'pct': [40, 30, 10, 8,7,5],
'price':[0.6, 1, 0.5, 0.8, 1, 0.5]})
df = spark.createDataFrame(df)
What I am looking for
product brand pct pct_same_brand pct_different_brand
a1 b1 40 null null
a2 b2 30 null 40
a3 b1 10 32 30
a4 b3 8 null 80
a5 b2 7 24 58
a6 b1 5 40 45
Update:
I have added the below data points to help clarify the problem. As can be seen, one row can be multiplied by 0.8 in one row and by 1.0 in another row.
product brand pct price pct_same_brand pct_different_brand
a1 b1 30 0.6 null null
a2 b2 20 1.3 null 30
a3 b1 10 0.5 30*0.8 20
a4 b3 8 0.8 null 60
a5 b2 7 0.5 20*0.8 48
a6 b1 6 0.8 30*1 + 10*1 35
a7 b2 5 1.5 20*1 + 7*1 54
Update2: In the data that I provided above, the weight per row is the same number (0.8 or 1) but it can also be 1 and 0.8 (0.8 for some of the rows and 1 for other rows)
Example in the below data frame, the multiplier for the last row , for example, should be 0.8 for a6 and 1.0 for the rest of brand b1. :
df = pd.DataFrame({'a': ['a1','a2','a3','a4','a5','a6', 'a7', 'a8', 'a9', 'a10'],
'brand':['b1','b2','b1', 'b3', 'b2','b1','b2', 'b1', 'b1', 'b1'],
'pct': [30, 20, 10, 8, 7,6,5,4,3,2],
'price':[0.6, 1.3, 0.5, 0.8, 0.5, 0.8, 1.5, 0.5, 0.65, 0.7]
})
df = spark.createDataFrame(df)
You can add a weight column to facilitate calculation:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'weight',
F.when(
F.col('price') <= F.lag('price').over(
Window.partitionBy('brand')
.orderBy(F.desc('pct'))
),
0.8
).otherwise(1.0)
).withColumn(
'pct_same_brand',
F.col('weight')*F.sum('pct').over(
Window.partitionBy('brand')
.orderBy(F.desc('pct'))
.rowsBetween(Window.unboundedPreceding, -1)
)
).withColumn(
'pct_different_brand',
F.sum('pct').over(
Window.orderBy(F.desc('pct'))
.rowsBetween(Window.unboundedPreceding, -1)
) - F.coalesce(F.col('pct_same_brand'), F.lit(0)) / F.col('weight')
)
df2.show()
+---+-----+---+-----+------+--------------+-------------------+
| a|brand|pct|price|weight|pct_same_brand|pct_different_brand|
+---+-----+---+-----+------+--------------+-------------------+
| a1| b1| 40| 0.6| 1.0| null| null|
| a2| b2| 30| 1.0| 1.0| null| 40.0|
| a3| b1| 10| 0.5| 0.8| 32.0| 30.0|
| a4| b3| 8| 0.8| 1.0| null| 80.0|
| a5| b2| 7| 1.0| 0.8| 24.0| 58.0|
| a6| b1| 5| 0.5| 0.8| 40.0| 45.0|
+---+-----+---+-----+------+--------------+-------------------+
Output for the edited question:
+---+-----+---+-----+------+--------------+-------------------+
| a|brand|pct|price|weight|pct_same_brand|pct_different_brand|
+---+-----+---+-----+------+--------------+-------------------+
| a1| b1| 30| 0.6| 1.0| null| null|
| a2| b2| 20| 1.3| 1.0| null| 30.0|
| a3| b1| 10| 0.5| 0.8| 24.0| 20.0|
| a4| b3| 8| 0.8| 1.0| null| 60.0|
| a5| b2| 7| 0.5| 0.8| 16.0| 48.0|
| a6| b1| 6| 0.8| 1.0| 40.0| 35.0|
| a7| b2| 5| 1.5| 1.0| 27.0| 54.0|
+---+-----+---+-----+------+--------------+-------------------+
If anyone has a similar question, this worked for me.
Basically, I used outer join of the dataframe with itself and assigned the weights. Finally, I used window functions.
df_copy = df.withColumnRenamed('a', 'asin')\
.withColumnRenamed('brand', 'brandd')\
.withColumnRenamed('pct', 'pct2')\
.withColumnRenamed('price', 'price2')
df2 = df.join(df_copy, on = [df.brand == df_copy.brandd], how = 'outer').orderBy('brand')
df3 = df2.filter(~((df2.a == df2.asin) & (df2.brand == df2.brandd))
& (df2.pct <= df2.pct2))
df3 = df3.withColumn('weight', F.when(df3.price2 > df3.price, 0.8).otherwise(1))
df4 = df3.groupBy(['a', 'brand', 'pct', 'price']).agg(F.sum(df3.pct2*df3.weight)
.alias('same_brand_pct'))
df5 = df.join(df4, on = ['a', 'brand', 'pct', 'price'], how = 'left')
df6 = df5.withColumn(
'pct_same_brand_unscaled',
F.sum('pct').over(
Window.partitionBy('brand')
.orderBy(F.desc('pct'))
.rowsBetween(Window.unboundedPreceding, -1)
)
).withColumn(
'pct_different_brand',
F.sum('pct').over(
Window.orderBy(F.desc('pct'))
.rowsBetween(Window.unboundedPreceding, -1)
) - F.coalesce(F.col('pct_same_brand_unscaled'), F.lit(0))
).drop('pct_same_brand_unscaled')
gives:
+---+-----+---+-----+--------------+-------------------+
| a|brand|pct|price|same_brand_pct|pct_different_brand|
+---+-----+---+-----+--------------+-------------------+
| a1| b1| 30| 0.6| null| null|
| a2| b2| 20| 1.3| null| 30|
| a3| b1| 10| 0.5| 24.0| 20|
| a4| b3| 8| 0.8| null| 60|
| a5| b2| 7| 0.5| 16.0| 48|
| a6| b1| 6| 0.8| 40.0| 35|
| a7| b2| 5| 1.5| 27.0| 54|
| a8| b1| 4| 0.5| 38.8| 40|
| a9| b1| 3| 0.65| 48.8| 40|
|a10| b1| 2| 0.7| 51.8| 40|```

Set union over window in PySpark

I am trying to get the list of all unique values that appear for a certain group of IDs, and all of this over window. As for the data below, for each address I want to find all id that has common value and gather the set of all value that appear for a group of id.
For example, I have window.partitionBy('address'). For address: 1 I see that id: A, B, C have common value: x. I understand they are connected and want to create value_set with all value that correspond with id: A, B, C, which is x,y,z.
id: D does not have common value with any other id, so its value_set include only values of id: D.
My data
+-------+---+-----+
|address| id|value|
+-------+---+-----+
| 1| A| x|
| 1| A| у|
| 1| B| x|
| 1| C| x|
| 1| C| z|
| 1| D| v|
| 2| E| m|
| 2| E| n|
| 2| F| m|
| 2| F| p|
+-------+---+-----+
What I want
+-------+---+-----+---------+
|address| id|value|value_set|
+-------+---+-----+---------+
| 1| A| x| x,y,z|
| 1| A| у| x,y,z|
| 1| B| x| x,y,z|
| 1| C| x| x,y,z|
| 1| C| z| x,y,z|
| 1| D| x| v |
| 2| E| m| m,n,p|
| 2| E| n| m,n,p|
| 2| F| m| m,n,p|
| 2| F| p| m,n,p|
+-------+---+-----+---------+
something like this?
dfn = df.withColumn('set_collect',collect_set(df.value).over(Window.partitionBy('address')))
Output is:
+-------+---+-----+------------+
|address| id|value| set_collect|
+-------+---+-----+------------+
| 1| A| x|[у, v, z, x]|
| 1| A| у|[у, v, z, x]|
| 1| B| x|[у, v, z, x]|
| 1| C| x|[у, v, z, x]|
| 1| C| z|[у, v, z, x]|
| 1| D| v|[у, v, z, x]|
| 2| E| m| [n, m, p]|
| 2| E| n| [n, m, p]|
| 2| F| m| [n, m, p]|
| 2| F| p| [n, m, p]|
+-------+---+-----+------------+

Find continous data in pyspark dataframe

I have a dataframe that looks like
key | value | time | status
x | 10 | 0 | running
x | 15 | 1 | running
x | 30 | 2 | running
x | 15 | 3 | running
x | 0 | 4 | stop
x | 40 | 5 | running
x | 10 | 6 | running
y | 10 | 0 | running
y | 15 | 1 | running
y | 30 | 2 | running
y | 15 | 3 | running
y | 0 | 4 | stop
y | 40 | 5 | running
y | 10 | 6 | running
...
I want to end up with a table that looks like
key | start | end | status | max value
x | 0 | 3 | running| 30
x | 4 | 4 | stop | 0
x | 5 | 6 | running| 40
y | 0 | 3 | running| 30
y | 4 | 4 | stop | 0
y | 5 | 6 | running| 40
...
In other words, I want to split by key, sort by time, into windows that have the same status, keep the first and last time and do a calculation over that window i.e max of value
Using pyspark ideally.
Here is one approach you can take.
First create a column to determine if the status has changed for a given key:
from pyspark.sql.functions import col, lag
from pyspark.sql import Window
w = Window.partitionBy("key").orderBy("time")
df = df.withColumn(
"status_change",
(col("status") != lag("status").over(w)).cast("int")
)
df.show()
#+---+-----+----+-------+-------------+
#|key|value|time| status|status_change|
#+---+-----+----+-------+-------------+
#| x| 10| 0|running| null|
#| x| 15| 1|running| 0|
#| x| 30| 2|running| 0|
#| x| 15| 3|running| 0|
#| x| 0| 4| stop| 1|
#| x| 40| 5|running| 1|
#| x| 10| 6|running| 0|
#| y| 10| 0|running| null|
#| y| 15| 1|running| 0|
#| y| 30| 2|running| 0|
#| y| 15| 3|running| 0|
#| y| 0| 4| stop| 1|
#| y| 40| 5|running| 1|
#| y| 10| 6|running| 0|
#+---+-----+----+-------+-------------+
Next fill the nulls with 0 and take the cumulative sum of the status_change column, per key:
from pyspark.sql.functions import sum as sum_ # avoid shadowing builtin
df = df.fillna(0).withColumn(
"status_group",
sum_("status_change").over(w)
)
df.show()
#+---+-----+----+-------+-------------+------------+
#|key|value|time| status|status_change|status_group|
#+---+-----+----+-------+-------------+------------+
#| x| 10| 0|running| 0| 0|
#| x| 15| 1|running| 0| 0|
#| x| 30| 2|running| 0| 0|
#| x| 15| 3|running| 0| 0|
#| x| 0| 4| stop| 1| 1|
#| x| 40| 5|running| 1| 2|
#| x| 10| 6|running| 0| 2|
#| y| 10| 0|running| 0| 0|
#| y| 15| 1|running| 0| 0|
#| y| 30| 2|running| 0| 0|
#| y| 15| 3|running| 0| 0|
#| y| 0| 4| stop| 1| 1|
#| y| 40| 5|running| 1| 2|
#| y| 10| 6|running| 0| 2|
#+---+-----+----+-------+-------------+------------+
Now you can aggregate over the key and status_group. You can also include status in the groupBy since it will be the same for each status_group. Finally select only the columns you want in your output.
from pyspark.sql.functions import min as min_, max as max_
df_agg = df.groupBy("key", "status", "status_group")\
.agg(
min_("time").alias("start"),
max_("time").alias("end"),
max_("value").alias("max_value")
)\
.select("key", "start", "end", "status", "max_value")\
.sort("key", "start")
df_agg.show()
#+---+-----+---+-------+---------+
#|key|start|end| status|max_value|
#+---+-----+---+-------+---------+
#| x| 0| 3|running| 30|
#| x| 4| 4| stop| 0|
#| x| 5| 6|running| 40|
#| y| 0| 3|running| 30|
#| y| 4| 4| stop| 0|
#| y| 5| 6|running| 40|
#+---+-----+---+-------+---------+

Add column to Pyspark which assign number of groups to regaridng rows

i have a dataframe:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('').getOrCreate()
df = spark.createDataFrame([("a", "65"), ("b", "23"),("c", "65"), ("d", "23"),
("a", "66"), ("b", "46"),("c", "23"), ("d", "66"),
("b", "5"), ("b", "3"),("c", "3")], ["column2", "value"])
df.show()
+-------+-----+
|column2|value|
+-------+-----+
| a| 65 |
| b| 23 |
| c| 65 |
| d| 23 |
| a| 66 |
| b| 46 |
| c| 23 |
| d| 66 |
| b| 5 |
| b| 3 |
| c| 3 |
+-------+-----+
And I wanted to make each 4 row as an one group. Then regarding to that group create new column where i can assign the number of groups to the corresponding rows. So the desired output is as following:
+-------+-----+------+
|column2|value|gr_val|
+-------+-----+ -----+
| a| 65 | 1 |
| b| 23 | 1 |
| c| 65 | 1 |
| d| 23 | 1 |
| a| 66 | 2 |
| b| 46 | 2 |
| c| 23 | 2 |
| d| 66 | 2 |
| b| 5 | 3 |
| b| 3 | 3 |
| c| 3 | 3 |
+-------+-----+------+
I would appreciate any helps!
Try this approach -
(1) Create a new column (dummy) that will hold sequentially increasing number to each row. lit('a') used to create static value to generate sequentially increasing row number.
(2) Devide the dummy column with number or records you want in each group (eg. 4) and take ceil. Ceil return the smallest integer not less than the value.
Here is detailed example -
from pyspark.sql.functions import *
from pyspark.sql.window import *
w = Window().partitionBy(lit('a')).orderBy(lit('a'))
df.withColumn("row_num", row_number().over(w))\
.selectExpr('column2 AS column2','value AS value','ceil(row_num/4) as gr_val')\
.show()
#+-------+-----+------+
#|column2|value|gr_val|
#+-------+-----+------+
#| a| 65| 1|
#| b| 23| 1|
#| c| 65| 1|
#| d| 23| 1|
#| a| 66| 2|
#| b| 46| 2|
#| c| 23| 2|
#| d| 66| 2|
#| b| 5| 3|
#| b| 3| 3|
#| c| 3| 3|
#+-------+-----+------+