How to count change in row values in pyspark - pyspark

Logic to count the change in the row values of a given column
Input
df22 = spark.createDataFrame(
[(1, 1.0), (1,22.0), (1,22.0), (1,21.0), (1,20.0), (2, 3.0), (2,3.0),
(2, 5.0), (2, 10.0), (2,3.0), (3,11.0), (4, 11.0), (4,15.0), (1,22.0)],
("id", "v"))
+---+----+
| id| v|
+---+----+
| 1| 1.0|
| 1|22.0|
| 1|22.0|
| 1|21.0|
| 1|20.0|
| 2| 3.0|
| 2| 3.0|
| 2| 5.0|
| 2|10.0|
| 2| 3.0|
| 3|11.0|
| 4|11.0|
| 4|15.0|
+---+----+
Expect output
+---+----+---+
| id| v| c|
+---+----+---+
| 1| 1.0| 0|
| 1|22.0| 1|
| 1|22.0| 1|
| 1|21.0| 2|
| 1|20.0| 3|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 5.0| 1|
| 2|10.0| 2|
| 2| 3.0| 3|
| 3|11.0| 0|
| 4|11.0| 0|
| 4|15.0| 1|
+---+----+---+
Any help on this will be greatly appreciated
Thanks in advance
Ramabadran

Before adding answer, I would like to ask you ,"what you have tried ??". Please try something from your end and then seek for support in this platform. Also your question is not clear. You have not provided if you are looking for a delta capture count per 'id' or as a whole. Just giving an expected output is not going to make the question clear.
And now comes to your question , if I understood it correctly from the sample input and output,you need delta capture count per 'id'. So one way to achieve it as below
#Capture the incremented count using lag() and sum() over below mentioned window
import pyspark.sql.functions as F
from pyspark.sql.window import Window
winSpec=Window.partitionBy('id').orderBy('v') # Your Window for capturing the incremented count
df22.\
withColumn('prev',F.coalesce(F.lag('v').over(winSpec),F.col('v'))).\
withColumn('c',F.sum(F.expr("case when v-prev<>0 then 1 else 0 end")).over(winSpec)).\
drop('prev').\
orderBy('id','v').\
show()
+---+----+---+
| id| v| c|
+---+----+---+
| 1| 1.0| 0|
| 1|20.0| 1|
| 1|21.0| 2|
| 1|22.0| 3|
| 1|22.0| 3|
| 1|22.0| 3|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 5.0| 1|
| 2|10.0| 2|
| 3|11.0| 0|
| 4|11.0| 0|
| 4|15.0| 1|
+---+----+---+

Related

Get % of rows that have a unique value by id

I have a pyspark dataframe that looks like this
import pandas as pd
spark.createDataFrame(
pd.DataFrame({'ch_id': [1,1,1,1,1,
2,2,2,2],
'e_id': [0,0,1,2,2,
0,0,1,1],
'seg': ['h','s','s','a','s',
'h','s','s','h']})
).show()
+-----+----+---+
|ch_id|e_id|seg|
+-----+----+---+
| 1| 0| h|
| 1| 0| s|
| 1| 1| s|
| 1| 2| a|
| 1| 2| s|
| 2| 0| h|
| 2| 0| s|
| 2| 1| s|
| 2| 1| h|
+-----+----+---+
I would like for every c_id to get:
the % of e_id for which there is one unique value of s
The output would like like this:
+----+-------+
|c_id|%_major|
+----+-------+
| 1| 66.6|
| 2| 0.0|
+----+-------+
How could I achieve that in pyspark ?

Add another column after groupBy and agg

I have a df looks like this:
+-----+-------+-----+
|docId|vocabId|count|
+-----+-------+-----+
| 3| 3| 600|
| 2| 3| 702|
| 1| 2| 120|
| 2| 5| 200|
| 2| 2| 500|
| 3| 1| 100|
| 3| 5| 2000|
| 3| 4| 122|
| 1| 3| 1200|
| 1| 1| 1000|
+-----+-------+-----+
I want to output the max count of vocabId and the docId it belongs to. I did this:
val wordCounts = docwords.groupBy("vocabId").agg(max($"count") as ("count"))
and got this:
+-------+----------+
|vocabId| count |
+-------+----------+
| 1| 1000|
| 3| 1200|
| 5| 2000|
| 4| 122|
| 2| 500|
+-------+----------+
How do I add the docId at the front???
It should looks something like this(the order is not important):
+-----+-------+-----+
|docId|vocabId|count|
+-----+-------+-----+
| 2| 2| 500|
| 3| 5| 2000|
| 3| 4| 122|
| 1| 3| 1200|
| 1| 1| 1000|
+-----+-------+-----+
You can do self join with docwords over count and vocabId something like below
val wordCounts = docwords.groupBy("vocabId").agg(max($"count") as ("count")).join(docwords,Seq("vocabId","count"))

Unable to get the result from the window function

+---------------+--------+
|YearsExperience| Salary|
+---------------+--------+
| 1.1| 39343.0|
| 1.3| 46205.0|
| 1.5| 37731.0|
| 2.0| 43525.0|
| 2.2| 39891.0|
| 2.9| 56642.0|
| 3.0| 60150.0|
| 3.2| 54445.0|
| 3.2| 64445.0|
| 3.7| 57189.0|
| 3.9| 63218.0|
| 4.0| 55794.0|
| 4.0| 56957.0|
| 4.1| 57081.0|
| 4.5| 61111.0|
| 4.9| 67938.0|
| 5.1| 66029.0|
| 5.3| 83088.0|
| 5.9| 81363.0|
| 6.0| 93940.0|
| 6.8| 91738.0|
| 7.1| 98273.0|
| 7.9|101302.0|
| 8.2|113812.0|
| 8.7|109431.0|
| 9.0|105582.0|
| 9.5|116969.0|
| 9.6|112635.0|
| 10.3|122391.0|
| 10.5|121872.0|
+---------------+--------+
I want to find the top highest salary from the above data which is 122391.0
My Code
val top= Window.partitionBy("id").orderBy(col("Salary").desc)
val res= df1.withColumn("top", rank().over(top))
Result
+---------------+--------+---+---+
|YearsExperience| Salary| id|top|
+---------------+--------+---+---+
| 1.1| 39343.0| 0| 1|
| 1.3| 46205.0| 1| 1|
| 1.5| 37731.0| 2| 1|
| 2.0| 43525.0| 3| 1|
| 2.2| 39891.0| 4| 1|
| 2.9| 56642.0| 5| 1|
| 3.0| 60150.0| 6| 1|
| 3.2| 54445.0| 7| 1|
| 3.2| 64445.0| 8| 1|
| 3.7| 57189.0| 9| 1|
| 3.9| 63218.0| 10| 1|
| 4.0| 55794.0| 11| 1|
| 4.0| 56957.0| 12| 1|
| 4.1| 57081.0| 13| 1|
| 4.5| 61111.0| 14| 1|
| 4.9| 67938.0| 15| 1|
| 5.1| 66029.0| 16| 1|
| 5.3| 83088.0| 17| 1|
| 5.9| 81363.0| 18| 1|
| 6.0| 93940.0| 19| 1|
| 6.8| 91738.0| 20| 1|
| 7.1| 98273.0| 21| 1|
| 7.9|101302.0| 22| 1|
| 8.2|113812.0| 23| 1|
| 8.7|109431.0| 24| 1|
| 9.0|105582.0| 25| 1|
| 9.5|116969.0| 26| 1|
| 9.6|112635.0| 27| 1|
| 10.3|122391.0| 28| 1|
| 10.5|121872.0| 29| 1|
+---------------+--------+---+---+
Also I have choosed partioned by salary and orderby id.
<br>
But the result was same.
As you can see 122391 is coming just below the above but it should come in first position as i have done ascending.
Please help anybody find any things
Are you sure you need a window function here? The window you defined partitions the data by id, which I assume is unique, so each group produced by the window will only have one row. It looks like you want a window over the entire dataframe, which means you don't actually need one. If you just want to add a column with the max, you can get the max using an aggregation on your original dataframe and cross join with it:
val maxDF = df1.agg(max("salary").as("top"))
val res = df1.crossJoin(maxDF)

Pyspark combine different rows base on a column

I have a dataframe
+----------------+------------+-----+
| Sport|Total_medals|count|
+----------------+------------+-----+
| Alpine Skiing| 3| 4|
| Alpine Skiing| 2| 18|
| Alpine Skiing| 4| 1|
| Alpine Skiing| 1| 38|
| Archery| 2| 12|
| Archery| 1| 72|
| Athletics| 2| 50|
| Athletics| 1| 629|
| Athletics| 3| 8|
| Badminton| 2| 5|
| Badminton| 1| 86|
| Baseball| 1| 216|
| Basketball| 1| 287|
|Beach Volleyball| 1| 48|
| Biathlon| 4| 1|
| Biathlon| 3| 9|
| Biathlon| 1| 61|
| Biathlon| 2| 23|
| Bobsleigh| 2| 6|
| Bobsleigh| 1| 60|
+----------------+------------+-----+
Is there a way for me to combine the value of counts from multiple rows if they are from the same sport?
For example, if Sport = Alpine Skiing I would have something like this:
+----------------+-----+
| Sport|count|
+----------------+-----+
| Alpine Skiing| 61|
+----------------+-----+
where count is equal to 4+18+1+38 = 61. I would like to do this for all sports
any help would be appreciated
You need to groupby on the Sport column and then aggregate the count column with the sum() function.
Example:
import pyspark.sql.functions as F
grouped_df = df.groupby('Sport').agg(F.sum('count'))

Row count broken up by a focal value

I have the following DataFrame in Spark using Scala:
val df = List(
("random", 0),
("words", 1),
("in", 1),
("a", 1),
("column", 1),
("are", 0),
("what", 0),
("have", 1),
("been", 1),
("placed", 0),
("here", 1),
("now", 1)
).toDF(Seq("words", "numbers"): _*)
df.show()
+------+-------+
| words|numbers|
+------+-------+
|random| 0|
| words| 1|
| in| 1|
| a| 1|
|column| 1|
| are| 0|
| what| 0|
| have| 1|
| been| 1|
|placed| 0|
| here| 1|
| now| 1|
+------+-------+
I'd like to add a column that contains the count of rows which is started over at every 0 in the numbers column. It would look like this:
+------+-------+-----+
| words|numbers|count|
+------+-------+-----+
|random| 0| 5|
| words| 1| 5|
| in| 1| 5|
| a| 1| 5|
|column| 1| 5|
| are| 0| 1|
| what| 0| 3|
| have| 1| 3|
| been| 1| 3|
|placed| 0| 3|
| here| 1| 3|
| now| 1| 3|
+------+-------+-----+
Here is a method using selectExpr with SQL window functions sum and count; sum of 1-numbers generates the group id which increases by 1 when a zero is encountered, then count the number of rows by this group id:
This might be inefficient since you don't have any partition column.
df.selectExpr(
"words", "numbers",
"count(*) over(partition by sum(1-numbers) over (order by monotonically_increasing_id())) as count"
).show
+------+-------+-----+
| words|numbers|count|
+------+-------+-----+
|random| 0| 5|
| words| 1| 5|
| in| 1| 5|
| a| 1| 5|
|column| 1| 5|
| are| 0| 1|
| what| 0| 3|
| have| 1| 3|
| been| 1| 3|
|placed| 0| 3|
| here| 1| 3|
| now| 1| 3|
+------+-------+-----+