Find next different value from lag in pyspark - pyspark

I have a pyspark dataframe like this,
+-----+----------+
|value|val_joined|
+-----+----------+
| 3| 3|
| 4| 3+4|
| 5| 3+4+5|
| 5| 3+4+5|
| 5| 3+4+5|
| 2| 3+4+5+2|
+-----+----------+
From this, I have to create another column that goes like this,
+-----+----------+------+
|value|val_joined|result|
+-----+----------+------+
| 3| 3| 4.0|
| 4| 3+4| 5.0|
| 5| 3+4+5| 2.0|
| 5| 3+4+5| 2.0|
| 5| 3+4+5| 2.0|
| 2| 3+4+5+2| NaN|
+-----+----------+------+
The result column is to be made like this, For an item in column named value, find the next item coming in order. So for value 3 it will be 4 and for value 4 it will be 5.
But when there are duplicates like the value 5 that repeats 3 times simple lag won't work. As the lag for first 5 will result in 5. I basically want to repeat taking lag till the value != lag(value) or lag(value) is null.
How can I do this in pyspark without udf and joins?

We can take 2 windows and find the next row value once with 1st window by assigning a monotonically_increasing_id and the last value in the other window like below:
import pyspark.sql.functions as F
w = Window.orderBy('idx')
w1 = Window.partitionBy('value')
(df.withColumn('idx',F.monotonically_increasing_id())
.withColumn("result",F.last(F.lead("value").over(w)).over(w1)).orderBy('idx')
.drop('idx')).show()
+-----+----------+------+
|value|val_joined|result|
+-----+----------+------+
| 3| 3| 4|
| 4| 3+4| 5|
| 5| 3+4+5| 2|
| 5| 3+4+5| 2|
| 5| 3+4+5| 2|
| 2| 3+4+5+2| null|
+-----+----------+------+
If numbers in value can repeat later example below:
+-----+----------+
|value|val_joined|
+-----+----------+
|3 |3 |
|4 |3+4 |
|5 |3+4+5 |
|5 |3+4+5 |
|5 |3+4+5 |
|2 |3+4+5+2 |
|5 |3+4+5+2+5 | <- this value is repeated later
+-----+----------+
Then we will have to create a seperate group and take the group as window:
w = Window.orderBy('idx')
w1 = Window.partitionBy('group')
(df.withColumn('idx',F.monotonically_increasing_id())
.withColumn("lag", F.when(F.lag("value").over(w)!=F.col("value"), F.lit(1))
.otherwise(F.lit(0)))
.withColumn("group", F.sum("lag").over(w) + 1).drop("lag")
.withColumn("result",F.last(F.lead("value").over(w)).over(w1)).orderBy('idx')
.drop('idx',"group")).show()
+-----+----------+------+
|value|val_joined|result|
+-----+----------+------+
| 3| 3| 4|
| 4| 3+4| 5|
| 5| 3+4+5| 2|
| 5| 3+4+5| 2|
| 5| 3+4+5| 2|
| 2| 3+4+5+2| 5|
| 5| 3+4+5+2+5| null|
+-----+----------+------+

Related

Pyspark combine different rows base on a column

I have a dataframe
+----------------+------------+-----+
| Sport|Total_medals|count|
+----------------+------------+-----+
| Alpine Skiing| 3| 4|
| Alpine Skiing| 2| 18|
| Alpine Skiing| 4| 1|
| Alpine Skiing| 1| 38|
| Archery| 2| 12|
| Archery| 1| 72|
| Athletics| 2| 50|
| Athletics| 1| 629|
| Athletics| 3| 8|
| Badminton| 2| 5|
| Badminton| 1| 86|
| Baseball| 1| 216|
| Basketball| 1| 287|
|Beach Volleyball| 1| 48|
| Biathlon| 4| 1|
| Biathlon| 3| 9|
| Biathlon| 1| 61|
| Biathlon| 2| 23|
| Bobsleigh| 2| 6|
| Bobsleigh| 1| 60|
+----------------+------------+-----+
Is there a way for me to combine the value of counts from multiple rows if they are from the same sport?
For example, if Sport = Alpine Skiing I would have something like this:
+----------------+-----+
| Sport|count|
+----------------+-----+
| Alpine Skiing| 61|
+----------------+-----+
where count is equal to 4+18+1+38 = 61. I would like to do this for all sports
any help would be appreciated
You need to groupby on the Sport column and then aggregate the count column with the sum() function.
Example:
import pyspark.sql.functions as F
grouped_df = df.groupby('Sport').agg(F.sum('count'))

How to count change in row values in pyspark

Logic to count the change in the row values of a given column
Input
df22 = spark.createDataFrame(
[(1, 1.0), (1,22.0), (1,22.0), (1,21.0), (1,20.0), (2, 3.0), (2,3.0),
(2, 5.0), (2, 10.0), (2,3.0), (3,11.0), (4, 11.0), (4,15.0), (1,22.0)],
("id", "v"))
+---+----+
| id| v|
+---+----+
| 1| 1.0|
| 1|22.0|
| 1|22.0|
| 1|21.0|
| 1|20.0|
| 2| 3.0|
| 2| 3.0|
| 2| 5.0|
| 2|10.0|
| 2| 3.0|
| 3|11.0|
| 4|11.0|
| 4|15.0|
+---+----+
Expect output
+---+----+---+
| id| v| c|
+---+----+---+
| 1| 1.0| 0|
| 1|22.0| 1|
| 1|22.0| 1|
| 1|21.0| 2|
| 1|20.0| 3|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 5.0| 1|
| 2|10.0| 2|
| 2| 3.0| 3|
| 3|11.0| 0|
| 4|11.0| 0|
| 4|15.0| 1|
+---+----+---+
Any help on this will be greatly appreciated
Thanks in advance
Ramabadran
Before adding answer, I would like to ask you ,"what you have tried ??". Please try something from your end and then seek for support in this platform. Also your question is not clear. You have not provided if you are looking for a delta capture count per 'id' or as a whole. Just giving an expected output is not going to make the question clear.
And now comes to your question , if I understood it correctly from the sample input and output,you need delta capture count per 'id'. So one way to achieve it as below
#Capture the incremented count using lag() and sum() over below mentioned window
import pyspark.sql.functions as F
from pyspark.sql.window import Window
winSpec=Window.partitionBy('id').orderBy('v') # Your Window for capturing the incremented count
df22.\
withColumn('prev',F.coalesce(F.lag('v').over(winSpec),F.col('v'))).\
withColumn('c',F.sum(F.expr("case when v-prev<>0 then 1 else 0 end")).over(winSpec)).\
drop('prev').\
orderBy('id','v').\
show()
+---+----+---+
| id| v| c|
+---+----+---+
| 1| 1.0| 0|
| 1|20.0| 1|
| 1|21.0| 2|
| 1|22.0| 3|
| 1|22.0| 3|
| 1|22.0| 3|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 5.0| 1|
| 2|10.0| 2|
| 3|11.0| 0|
| 4|11.0| 0|
| 4|15.0| 1|
+---+----+---+

Scala Spark Incrementing a column based on another column in dataframe without for loops

I have a dataframe like the one below. I want a new column called cutofftype - which instead of the current monotonically increasing number should reset to 1 every time the ID column changes .
df = df.orderBy("ID","date").withColumn("cutofftype",monotonically_increasing_id()+1)
+------+---------------+----------+
| ID | date |cutofftype|
+------+---------------+----------+
| 54441| 2016-06-20| 1|
| 54441| 2016-06-27| 2|
| 54441| 2016-07-04| 3|
| 54441| 2016-07-11| 4|
| 54500| 2016-05-02| 5|
| 54500| 2016-05-09| 6|
| 54500| 2016-05-16| 7|
| 54500| 2016-05-23| 8|
| 54500| 2016-06-06| 9|
| 54500| 2016-06-13| 10|
+------+---------------+----------+
Target is this as below :
+------+---------------+----------+
| ID | date |cutofftype|
+------+---------------+----------+
| 54441| 2016-06-20| 1|
| 54441| 2016-06-27| 2|
| 54441| 2016-07-04| 3|
| 54441| 2016-07-11| 4|
| 54500| 2016-05-02| 1|
| 54500| 2016-05-09| 2|
| 54500| 2016-05-16| 3|
| 54500| 2016-05-23| 4|
| 54500| 2016-06-06| 5|
| 54500| 2016-06-13| 6|
+------+---------------+----------+
I know this can be done with for loops - i want to do it without for loops >> Is there a way out ?
Simple partition by problem. You should use the window.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("ID").orderBy("date")
df.withColumn("cutofftype", row_number().over(w)).show()
+-----+----------+----------+
| ID| date|cutofftype|
+-----+----------+----------+
|54500|2016-05-02| 1|
|54500|2016-05-09| 2|
|54500|2016-05-16| 3|
|54500|2016-05-23| 4|
|54500|2016-06-06| 5|
|54500|2016-06-13| 6|
|54441|2016-06-20| 1|
|54441|2016-06-27| 2|
|54441|2016-07-04| 3|
|54441|2016-07-11| 4|
+-----+----------+----------+

PySpark: counting rows based on current row value

I have a DataFrame with a column "Speed". Can I efficiently add a column with, for each row, the number of rows in the DataFrame such that their "Speed" is within +/2 from the row "Speed"?
results = spark.createDataFrame([[1],[2],[3],[4],[5],
[4],[5],[4],[5],[6],
[5],[6],[1],[3],[8],
[2],[5],[6],[10],[12]],
['Speed'])
results.show()
+-----+
|Speed|
+-----+
| 1|
| 2|
| 3|
| 4|
| 5|
| 4|
| 5|
| 4|
| 5|
| 6|
| 5|
| 6|
| 1|
| 3|
| 8|
| 2|
| 5|
| 6|
| 10|
| 12|
+-----+
You could use a window function :
# Order the window by speed, and look at range [0;+2]
w = Window.orderBy('Speed').rangeBetween(0,2)
# Define a column counting the number of rows containing value Speed+2
results = results.withColumn('count+2',F.count('Speed').over(w)).orderBy('Speed')
results.show()
+-----+-----+
|Speed|count|
+-----+-----+
| 1| 6|
| 1| 6|
| 2| 7|
| 2| 7|
| 3| 10|
| 3| 10|
| 4| 11|
| 4| 11|
| 4| 11|
| 5| 8|
| 5| 8|
| 5| 8|
| 5| 8|
| 5| 8|
| 6| 4|
| 6| 4|
| 6| 4|
| 8| 2|
| 10| 2|
| 12| 1|
+-----+-----+
Note : The window function counts the studied row itself. You could correct this by adding a -1 in the count column
results = results.withColumn('count+2',F.count('Speed').over(w)-1).orderBy('Speed')

Join two data frames as input for Machine Learning with Spark

I have 2 data frames in Apache Spark, each with a column named "JoinValue". JoinValue is numeric and has the same semantics and meaning in both data frames.
I need the combination of both data frames as input (training and test set) for a Machine Learning algorithm. Is it correct that I first need to combine both DataFrames into a single DataFrame before using it in an ML Pipeline?
Example:
df1.show()
+---------+---------+
| a|JoinValue|
+---------+---------+
|A value 0| 0|
|A value 1| 5|
|A value 2| 10|
|A value 3| 15|
|A value 4| 20|
|A value 5| 25|
|A value 6| 30|
+---------+---------+
and
> df2.show()
+---------+---------+
| b|JoinValue|
+---------+---------+
|B value 0| 0|
|B value 1| 7|
|B value 2| 14|
|B value 3| 21|
|B value 4| 28|
+---------+---------+
An outer join followed by an orderBy yields the following results:
> df1.join(df2, 'JoinValue', 'outer').orderBy('JoinValue').show()
+---------+---------+---------+
|JoinValue| a| b|
+---------+---------+---------+
| 0|A value 0|B value 0|
| 5|A value 1| null|
| 7| null|B value 1|
| 10|A value 2| null|
| 14| null|B value 2|
| 15|A value 3| null|
| 20|A value 4| null|
| 21| null|B value 3|
| 25|A value 5| null|
| 28| null|B value 4|
| 30|A value 6| null|
+---------+---------+---------+
What I actually want is this, without nulls:
+---------+---------+---------+
|JoinValue| a| b|
+---------+---------+---------+
| 0|A value 0|B value 0|
| 5|A value 1|B value 0|
| 7|A value 1|B value 1|
| 10|A value 2|B value 1|
| 14|A value 2|B value 2|
| 15|A value 3|B value 2|
| 20|A value 4|B value 2|
| 21|A value 4|B value 3|
| 25|A value 5|B value 3|
| 28|A value 5|B value 4|
| 30|A value 6|B value 4|
+---------+---------+---------+
What is the best way to use the JoinValue, a and b, coming from multiple data frames as features and labels in a machine learning algorithm?