pct_change function in PySpark

pct_change function in PySpark - pyspark

I need to calculate percentage change of values in value column for each ID separately in a new pct_change column. Example df below.
Some sources in the internet say that there is a pct_change() function available in pyspark 2.4+ which would make this easy but I am on 3.0.1 and I am not able to import it from pyspark.sql.functions.
ID value pct_change
1 1 nan
1 2 1
1 4 1
2 1 nan
2 1 0
2 0.5 -0.5
3 5 nan
3 5 0
3 7 0.4

Use Window function in pyspark
Code and logic below
w =Window.partitionBy('ID').orderBy('index')#.rowsBetween(-1,0)
(df.withColumn('index', monotonically_increasing_id())#Create an index to OrderBy
.withColumn('pct_change', (col('value')-lag('value').over(w))#Calculate change in consecutive rows
/lag('value').over(w))#Find rate of change in consecutive row
.drop('index')#Drop the ordering column
).show()
+---+-----+----------+
| ID|value|pct_change|
+---+-----+----------+
| 1| 1.0| null|
| 1| 2.0| 1.0|
| 1| 4.0| 1.0|
| 2| 1.0| null|
| 2| 1.0| 0.0|
| 2| 0.5| -0.5|
| 3| 5.0| null|
| 3| 5.0| 0.0|
| 3| 7.0| 0.4|
+---+-----+----------+

Related

Spark dataframe - transform rows with same ID to columns

I want to transform below source dataframe (using pyspark):
Key
ID
segment
1
A
m1
2
A
m1
3
B
m1
4
C
m2
1
D
m1
2
E
m1
3
F
m1
4
G
m2
1
J
m1
2
J
m1
3
J
m1
4
K
m2
Into below result dataframe:
ID
key1
key2
A
1
2
B
3
-
C
4
-
D
1
-
F
3
-
G
4
-
J
1
2
J
1
3
J
2
3
K
4
-
In other words: I want to highlight the "pairs" in the dataframe - If I have more than one key for the same ID, I would like to point each relation in diferents lines.
Thank you for your help

Use window functions. I assume - means a one man group. If not you can use when/otherwise contion to blank the 1s out.
w =Window.partitionBy('ID').orderBy(desc('Key'))
df= (df.withColumn('key2', lag('segment').over(w))# create new column with value of preceding segment for each row
.withColumn('key2', col('key2').isNotNull())# query to create boolean selection
.withColumn('key2',F.sum(F.col('key2').cast('integer')).over(w.rowsBetween(Window.currentRow, sys.maxsize))+1)#Create cumulative groups
.orderBy('ID', 'key')#Reorder frame
)
df.show()
+---+---+-------+----+
|Key| ID|segment|key2|
+---+---+-------+----+
| 1| A| m1| 2|
| 2| A| m1| 2|
| 3| B| m1| 1|
| 4| C| m2| 1|
| 1| D| m1| 1|
| 2| E| m1| 1|
| 3| F| m1| 1|
| 4| G| m2| 1|
| 1| J| m1| 2|
| 2| J| m1| 3|
| 3| J| m1| 3|
| 4| K| m2| 1|
+---+---+-------+----+

Conditional values based on the amount of rows for each id

Table:
id
1
1
1
2
2
3
3
3
3
3
Here is the logic that needs to be applied:
For any ID >= 3 rows, the weight should be 1.15. If it doesn't satisfy this condition, the weight should be 1.0
Updated Table:
id weight
1 1.15
1 1.15
1 1.15
2 1.0
2 1.0
3 1.15
3 1.15
3 1.15
3 1.15
3 1.15
Now, my first and only thought was:
Leverage groupBy to get counts for each ID
Merge this new dataframe with my original table.
Add a new column using a when clause.
I'm rather new to Scala so I'm not sure if there is a more efficient way to do this.

Performing a groupBy will require a table join (or re-flattening via explode) before assigning weights for what you need. It would be easier to just assign weights based on the count by "id" over a Window partition:
import org.apache.spark.sql.expressions.Window
val df = Seq(1, 1, 1, 2, 2, 3, 3, 3, 3, 3).toDF("id")
df.withColumn("count", count("*").over(Window.partitionBy("id")))
.withColumn("weight", when($"count" >= 3, 1.15).otherwise(1.0))
.show
+---+-----+------+
| id|count|weight|
+---+-----+------+
| 1| 3| 1.15|
| 1| 3| 1.15|
| 1| 3| 1.15|
| 3| 5| 1.15|
| 3| 5| 1.15|
| 3| 5| 1.15|
| 3| 5| 1.15|
| 3| 5| 1.15|
| 2| 2| 1.0|
| 2| 2| 1.0|
+---+-----+------+

Scala Spark use Window function to find max value

I have a data set that looks like this:
+------------------------|-----+
| timestamp| zone|
+------------------------+-----+
| 2019-01-01 00:05:00 | A|
| 2019-01-01 00:05:00 | A|
| 2019-01-01 00:05:00 | B|
| 2019-01-01 01:05:00 | C|
| 2019-01-01 02:05:00 | B|
| 2019-01-01 02:05:00 | B|
+------------------------+-----+
For each hour I need to count which zone had the most rows and end up with a table that looks like this:
+-----|-----+-----+
| hour| zone| max |
+-----+-----+-----+
| 0| A| 2|
| 1| C| 1|
| 2| B| 2|
+-----+-----+-----+
My instructions say that I need to use the Window function along with "group by" to find my max count.
I've tried a few things but I'm not sure if I'm close. Any help would be appreciated.

You can use 2 subsequent window-functions to get your result:
df
.withColumn("hour",hour($"timestamp"))
.withColumn("cnt",count("*").over(Window.partitionBy($"hour",$"zone")))
.withColumn("rnb",row_number().over(Window.partitionBy($"hour").orderBy($"cnt".desc)))
.where($"rnb"===1)
.select($"hour",$"zone",$"cnt".as("max"))

You can use Windowing functions and group by with dataframes.
In your case you could use rank() over(partition by) window function.
import org.apache.spark.sql.function._
// first group by hour and zone
val df_group = data_tms.
select(hour(col("timestamp")).as("hour"), col("zone"))
.groupBy(col("hour"), col("zone"))
.agg(count("zone").as("max"))
// second rank by hour order by max in descending order
val df_rank = df_group.
select(col("hour"),
col("zone"),
col("max"),
rank().over(Window.partitionBy(col("hour")).orderBy(col("max").desc)).as("rank"))
// filter by col rank = 1
df_rank
.select(col("hour"),
col("zone"),
col("max"))
.where(col("rank") === 1)
.orderBy(col("hour"))
.show()
/*
+----+----+---+
|hour|zone|max|
+----+----+---+
| 0| A| 2|
| 1| C| 1|
| 2| B| 2|
+----+----+---+
*/

Pyspark Crosstab Pivot Challenge / Problem

I unfortunately could not find a solution for my exact problem. It is related to pivot and crosstab but I could not solve it with these functions.
I have the feeling I am missing an in-between-table, but I somehow cannot come up with a solution.
Problem description:
A table with customers indicating from which category they have bought a product. If the customer bought a product from the category, the category ID will be shown next to his name.
There are 4 categories 1 - 4 and 3 customers A, B, C
+--------+----------+
|customer| category |
+--------+----------+
| A| 1|
| A| 2|
| A| 3|
| B| 1|
| B| 4|
| C| 1|
| C| 3|
| C| 4|
+--------+----------+
The table is DISTINCT meaning there is only one combination of custmer and category
What I want is a crosstab by category where I can easily read e.g. how many of those who bought from category 1 also bought from category 4?
Desired results table:
+--------+---+---+---+---+
| | 1 | 2 | 3 | 4 |
+--------+---+---+---+---+
| 1| 3| 1| 2| 2|
| 2| 1| 1| 1| 0|
| 3| 2| 1| 2| 1|
| 4| 2| 0| 1| 1|
+--------+---+---+---+---+
Reading examples:
row1 column1 : total number of customers who bought product 1 (A, B, C)
row1 column2 : number of customers who bought product 1 and 2 (A)
row1 column3 : number of customers who bought product 1 and 3 (A, C)
etc.
As you can see the table is mirrored by its diagonal.
Any suggestions how to created the desired table?
Additional challenge:
How to get the results as %?
For the first row the results wold be then: | 100% | 33% | 66% | 66% |
Many thanks in advance!

You can join the input data with itself using customer as join criterium. This returns all combinations of categories that exist for a given customer. After that you can use crosstab to get the result.
df2 = df.withColumnRenamed("category", "cat1").join(df.withColumnRenamed("category", "cat2"), "customer") \
.crosstab("cat1", "cat2") \
.orderBy("cat1_cat2")
df2.show()
Output:
+---------+---+---+---+---+
|cat1_cat2| 1| 2| 3| 4|
+---------+---+---+---+---+
| 1| 3| 1| 2| 2|
| 2| 1| 1| 1| 0|
| 3| 2| 1| 2| 1|
| 4| 2| 0| 1| 2|
+---------+---+---+---+---+
To get the relative frequency you can sum over each row and then divide each element by this sum.
df2.withColumn("sum", sum(df2[col] for col in df2.columns if col != "cat1_cat2")) \
.select("cat1_cat2", *(F.round(df2[col]/F.col("sum"),2).alias(col) for col in df2.columns if col != "cat1_cat2")) \
.show()
Output:
+---------+----+----+----+----+
|cat1_cat2| 1| 2| 3| 4|
+---------+----+----+----+----+
| 1|0.38|0.13|0.25|0.25|
| 2|0.33|0.33|0.33| 0.0|
| 3|0.33|0.17|0.33|0.17|
| 4| 0.4| 0.0| 0.2| 0.4|
+---------+----+----+----+----+

spark withcolumn create a column duplicating values from existining column

I am having problem figuring this. Here is the problem statement
lets say I have a dataframe, I want to select value for column c where column b value is foo and create a new column D and repeat the vale "3" for all rows
+---+----+---+
| A| B| C|
+---+----+---+
| 4|blah| 2|
| 2| | 3|
| 56| foo| 3|
|100|null| 5|
+---+----+---+
want it to become:
+---+----+---+-----+
| A| B| C| D |
+---+----+---+-----+
| 4|blah| 2| 3 |
| 2| | 3| 3 |
| 56| foo| 3| 3 |
|100|null| 5| 3 |
+---+----+---+-----+

You will have to extract the column C value i.e. 3 with foo in column B
import org.apache.spark.sql.functions._
val value = df.filter(col("B") === "foo").select("C").first()(0)
Then use that value using withColumn to create a new column D using lit function
df.withColumn("D", lit(value)).show(false)
You should get your desired output.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

pct_change function in PySpark - pyspark

Related

Spark dataframe - transform rows with same ID to columns

Conditional values based on the amount of rows for each id

Scala Spark use Window function to find max value

Pyspark Crosstab Pivot Challenge / Problem

spark withcolumn create a column duplicating values from existining column

Categories

Resources