pct_change function in PySpark - pyspark

I need to calculate percentage change of values in value column for each ID separately in a new pct_change column. Example df below.
Some sources in the internet say that there is a pct_change() function available in pyspark 2.4+ which would make this easy but I am on 3.0.1 and I am not able to import it from pyspark.sql.functions.
ID value pct_change
1 1 nan
1 2 1
1 4 1
2 1 nan
2 1 0
2 0.5 -0.5
3 5 nan
3 5 0
3 7 0.4

Use Window function in pyspark
Code and logic below
w =Window.partitionBy('ID').orderBy('index')#.rowsBetween(-1,0)
(df.withColumn('index', monotonically_increasing_id())#Create an index to OrderBy
.withColumn('pct_change', (col('value')-lag('value').over(w))#Calculate change in consecutive rows
/lag('value').over(w))#Find rate of change in consecutive row
.drop('index')#Drop the ordering column
).show()
+---+-----+----------+
| ID|value|pct_change|
+---+-----+----------+
| 1| 1.0| null|
| 1| 2.0| 1.0|
| 1| 4.0| 1.0|
| 2| 1.0| null|
| 2| 1.0| 0.0|
| 2| 0.5| -0.5|
| 3| 5.0| null|
| 3| 5.0| 0.0|
| 3| 7.0| 0.4|
+---+-----+----------+

Related

Spark dataframe - transform rows with same ID to columns

I want to transform below source dataframe (using pyspark):
Key
ID
segment
1
A
m1
2
A
m1
3
B
m1
4
C
m2
1
D
m1
2
E
m1
3
F
m1
4
G
m2
1
J
m1
2
J
m1
3
J
m1
4
K
m2
Into below result dataframe:
ID
key1
key2
A
1
2
B
3
-
C
4
-
D
1
-
F
3
-
G
4
-
J
1
2
J
1
3
J
2
3
K
4
-
In other words: I want to highlight the "pairs" in the dataframe - If I have more than one key for the same ID, I would like to point each relation in diferents lines.
Thank you for your help
Use window functions. I assume - means a one man group. If not you can use when/otherwise contion to blank the 1s out.
w =Window.partitionBy('ID').orderBy(desc('Key'))
df= (df.withColumn('key2', lag('segment').over(w))# create new column with value of preceding segment for each row
.withColumn('key2', col('key2').isNotNull())# query to create boolean selection
.withColumn('key2',F.sum(F.col('key2').cast('integer')).over(w.rowsBetween(Window.currentRow, sys.maxsize))+1)#Create cumulative groups
.orderBy('ID', 'key')#Reorder frame
)
df.show()
+---+---+-------+----+
|Key| ID|segment|key2|
+---+---+-------+----+
| 1| A| m1| 2|
| 2| A| m1| 2|
| 3| B| m1| 1|
| 4| C| m2| 1|
| 1| D| m1| 1|
| 2| E| m1| 1|
| 3| F| m1| 1|
| 4| G| m2| 1|
| 1| J| m1| 2|
| 2| J| m1| 3|
| 3| J| m1| 3|
| 4| K| m2| 1|
+---+---+-------+----+

Conditional values based on the amount of rows for each id

Table:
id
1
1
1
2
2
3
3
3
3
3
Here is the logic that needs to be applied:
For any ID >= 3 rows, the weight should be 1.15. If it doesn't satisfy this condition, the weight should be 1.0
Updated Table:
id weight
1 1.15
1 1.15
1 1.15
2 1.0
2 1.0
3 1.15
3 1.15
3 1.15
3 1.15
3 1.15
Now, my first and only thought was:
Leverage groupBy to get counts for each ID
Merge this new dataframe with my original table.
Add a new column using a when clause.
I'm rather new to Scala so I'm not sure if there is a more efficient way to do this.
Performing a groupBy will require a table join (or re-flattening via explode) before assigning weights for what you need. It would be easier to just assign weights based on the count by "id" over a Window partition:
import org.apache.spark.sql.expressions.Window
val df = Seq(1, 1, 1, 2, 2, 3, 3, 3, 3, 3).toDF("id")
df.withColumn("count", count("*").over(Window.partitionBy("id")))
.withColumn("weight", when($"count" >= 3, 1.15).otherwise(1.0))
.show
+---+-----+------+
| id|count|weight|
+---+-----+------+
| 1| 3| 1.15|
| 1| 3| 1.15|
| 1| 3| 1.15|
| 3| 5| 1.15|
| 3| 5| 1.15|
| 3| 5| 1.15|
| 3| 5| 1.15|
| 3| 5| 1.15|
| 2| 2| 1.0|
| 2| 2| 1.0|
+---+-----+------+

Scala Spark use Window function to find max value

I have a data set that looks like this:
+------------------------|-----+
| timestamp| zone|
+------------------------+-----+
| 2019-01-01 00:05:00 | A|
| 2019-01-01 00:05:00 | A|
| 2019-01-01 00:05:00 | B|
| 2019-01-01 01:05:00 | C|
| 2019-01-01 02:05:00 | B|
| 2019-01-01 02:05:00 | B|
+------------------------+-----+
For each hour I need to count which zone had the most rows and end up with a table that looks like this:
+-----|-----+-----+
| hour| zone| max |
+-----+-----+-----+
| 0| A| 2|
| 1| C| 1|
| 2| B| 2|
+-----+-----+-----+
My instructions say that I need to use the Window function along with "group by" to find my max count.
I've tried a few things but I'm not sure if I'm close. Any help would be appreciated.
You can use 2 subsequent window-functions to get your result:
df
.withColumn("hour",hour($"timestamp"))
.withColumn("cnt",count("*").over(Window.partitionBy($"hour",$"zone")))
.withColumn("rnb",row_number().over(Window.partitionBy($"hour").orderBy($"cnt".desc)))
.where($"rnb"===1)
.select($"hour",$"zone",$"cnt".as("max"))
You can use Windowing functions and group by with dataframes.
In your case you could use rank() over(partition by) window function.
import org.apache.spark.sql.function._
// first group by hour and zone
val df_group = data_tms.
select(hour(col("timestamp")).as("hour"), col("zone"))
.groupBy(col("hour"), col("zone"))
.agg(count("zone").as("max"))
// second rank by hour order by max in descending order
val df_rank = df_group.
select(col("hour"),
col("zone"),
col("max"),
rank().over(Window.partitionBy(col("hour")).orderBy(col("max").desc)).as("rank"))
// filter by col rank = 1
df_rank
.select(col("hour"),
col("zone"),
col("max"))
.where(col("rank") === 1)
.orderBy(col("hour"))
.show()
/*
+----+----+---+
|hour|zone|max|
+----+----+---+
| 0| A| 2|
| 1| C| 1|
| 2| B| 2|
+----+----+---+
*/

Pyspark Crosstab Pivot Challenge / Problem

I unfortunately could not find a solution for my exact problem. It is related to pivot and crosstab but I could not solve it with these functions.
I have the feeling I am missing an in-between-table, but I somehow cannot come up with a solution.
Problem description:
A table with customers indicating from which category they have bought a product. If the customer bought a product from the category, the category ID will be shown next to his name.
There are 4 categories 1 - 4 and 3 customers A, B, C
+--------+----------+
|customer| category |
+--------+----------+
| A| 1|
| A| 2|
| A| 3|
| B| 1|
| B| 4|
| C| 1|
| C| 3|
| C| 4|
+--------+----------+
The table is DISTINCT meaning there is only one combination of custmer and category
What I want is a crosstab by category where I can easily read e.g. how many of those who bought from category 1 also bought from category 4?
Desired results table:
+--------+---+---+---+---+
| | 1 | 2 | 3 | 4 |
+--------+---+---+---+---+
| 1| 3| 1| 2| 2|
| 2| 1| 1| 1| 0|
| 3| 2| 1| 2| 1|
| 4| 2| 0| 1| 1|
+--------+---+---+---+---+
Reading examples:
row1 column1 : total number of customers who bought product 1 (A, B, C)
row1 column2 : number of customers who bought product 1 and 2 (A)
row1 column3 : number of customers who bought product 1 and 3 (A, C)
etc.
As you can see the table is mirrored by its diagonal.
Any suggestions how to created the desired table?
Additional challenge:
How to get the results as %?
For the first row the results wold be then: | 100% | 33% | 66% | 66% |
Many thanks in advance!
You can join the input data with itself using customer as join criterium. This returns all combinations of categories that exist for a given customer. After that you can use crosstab to get the result.
df2 = df.withColumnRenamed("category", "cat1").join(df.withColumnRenamed("category", "cat2"), "customer") \
.crosstab("cat1", "cat2") \
.orderBy("cat1_cat2")
df2.show()
Output:
+---------+---+---+---+---+
|cat1_cat2| 1| 2| 3| 4|
+---------+---+---+---+---+
| 1| 3| 1| 2| 2|
| 2| 1| 1| 1| 0|
| 3| 2| 1| 2| 1|
| 4| 2| 0| 1| 2|
+---------+---+---+---+---+
To get the relative frequency you can sum over each row and then divide each element by this sum.
df2.withColumn("sum", sum(df2[col] for col in df2.columns if col != "cat1_cat2")) \
.select("cat1_cat2", *(F.round(df2[col]/F.col("sum"),2).alias(col) for col in df2.columns if col != "cat1_cat2")) \
.show()
Output:
+---------+----+----+----+----+
|cat1_cat2| 1| 2| 3| 4|
+---------+----+----+----+----+
| 1|0.38|0.13|0.25|0.25|
| 2|0.33|0.33|0.33| 0.0|
| 3|0.33|0.17|0.33|0.17|
| 4| 0.4| 0.0| 0.2| 0.4|
+---------+----+----+----+----+

spark withcolumn create a column duplicating values from existining column

I am having problem figuring this. Here is the problem statement
lets say I have a dataframe, I want to select value for column c where column b value is foo and create a new column D and repeat the vale "3" for all rows
+---+----+---+
| A| B| C|
+---+----+---+
| 4|blah| 2|
| 2| | 3|
| 56| foo| 3|
|100|null| 5|
+---+----+---+
want it to become:
+---+----+---+-----+
| A| B| C| D |
+---+----+---+-----+
| 4|blah| 2| 3 |
| 2| | 3| 3 |
| 56| foo| 3| 3 |
|100|null| 5| 3 |
+---+----+---+-----+
You will have to extract the column C value i.e. 3 with foo in column B
import org.apache.spark.sql.functions._
val value = df.filter(col("B") === "foo").select("C").first()(0)
Then use that value using withColumn to create a new column D using lit function
df.withColumn("D", lit(value)).show(false)
You should get your desired output.