Get last value of previous partition/group in pyspark - pyspark

I have a dataframe looking like this (just some example values):
| id | timestamp | mode | trip | journey | value |
1 2021-09-12 23:59:19.717000 walking 1 1 1.21
1 2021-09-12 23:59:38.617000 walking 1 1 1.36
1 2021-09-12 23:59:38.617000 driving 2 1 1.65
2 2021-09-11 23:52:09.315000 walking 4 6 1.04
I want to create new columns which I fill with the previous and next mode. Something like this:
| id | timestamp | mode | trip | journey | value | prev | next
1 2021-09-12 23:59:19.717000 walking 1 1 1.21 bus driving
1 2021-09-12 23:59:38.617000 walking 1 1 1.36 bus driving
1 2021-09-12 23:59:38.617000 driving 2 1 1.65 walking walking
2 2021-09-11 23:52:09.315000 walking 4 6 1.0 walking driving
I have tried to partition by id, trip, journey and mode and ordered by timestamp. Then I tried to use lag() and lead() but I am not sure these work on other partitions. I came across the Window.unboundedPreceding and Window.unboundedFollowing, however I am not sure I completely understand how these work. In my mind I think that if I partition the data as explained above I will always just need the last value of mode from the previous partition and to fill the next I could reorder the partition from ascending to descending on the timestamp and then do the same to fill the next column. However, I am unsure how I get the last value of the previous partition.
I have tried this:
w = Window.partitionBy("id", "journey", "trip").orderBy(col("timestamp").asc())
w_prev = w.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
df = df.withColumn("prev", first("mode").over(w_prev))
Code examples and explainations using pyspark will be very appreciated!

So, based on what I could understand you could do something like this,
Create a partition based on ID and their journey, within each journey there are multiple trips, so order by trip and lastly the timestamp, and then simply use the lead and lag to get the output!
w = Window().partitionBy('id', 'journey').orderBy('trip', 'timestamp')
df.withColumn('prev', F.lag('mode', 1).over(w)) \
.withColumn('next', F.lead('mode', 1).over(w)) \
.show(truncate=False)
Output:
+---+--------------------------+-------+----+-------+-----+-------+-------+
|id |timestamp |mode |trip|journey|value|prev |next |
+---+--------------------------+-------+----+-------+-----+-------+-------+
|1 |2021-09-12 23:59:19.717000|walking|1 |1 |1.21 |null |walking|
|1 |2021-09-12 23:59:38.617000|walking|1 |1 |1.36 |walking|driving|
|1 |2021-09-12 23:59:38.617000|driving|2 |1 |1.65 |walking|null |
|2 |2021-09-11 23:52:09.315000|walking|4 |6 |1.04 |null |null |
+---+--------------------------+-------+----+-------+-----+-------+-------+
EDIT:
Okay as OP asked, you can do this to achieve it,
# Used for taking the latest record from same id, trip, journey
w = Window().partitionBy('id', 'trip', 'journey').orderBy(F.col('timestamp').desc())
# Used to calculate prev and next mode
w1 = Window().partitionBy('id', 'journey').orderBy('trip')
# First take only the latest rows for a particular combination of id, trip, journey
# Second, use the filtered rows to get prev and next modes
df2 = df.withColumn('rn', F.row_number().over(w)) \
.filter(F.col('rn') == 1) \
.withColumn('prev', F.lag('mode', 1).over(w1)) \
.withColumn('next', F.lead('mode', 1).over(w1)) \
.drop('rn')
df2.show(truncate=False)
Output:
+---+--------------------------+-------+----+-------+-----+-------+-------+
|id |timestamp |mode |trip|journey|value|prev |next |
+---+--------------------------+-------+----+-------+-----+-------+-------+
|1 |2021-09-12 23:59:38.617000|walking|1 |1 |1.36 |null |driving|
|1 |2021-09-12 23:59:38.617000|driving|2 |1 |1.65 |walking|null |
|2 |2021-09-11 23:52:09.315000|walking|4 |6 |1.04 |null |null |
+---+--------------------------+-------+----+-------+-----+-------+-------+
# Finally, join the calculated DF with the original DF to get prev and next mode
final_df = df.alias('a').join(df2.alias('b'), ['id', 'trip', 'journey'], how='left') \
.select('a.*', 'b.prev', 'b.next')
final_df.show(truncate=False)
Output:
+---+----+-------+--------------------------+-------+-----+-------+-------+
|id |trip|journey|timestamp |mode |value|prev |next |
+---+----+-------+--------------------------+-------+-----+-------+-------+
|1 |1 |1 |2021-09-12 23:59:19.717000|walking|1.21 |null |driving|
|1 |1 |1 |2021-09-12 23:59:38.617000|walking|1.36 |null |driving|
|1 |2 |1 |2021-09-12 23:59:38.617000|driving|1.65 |walking|null |
|2 |4 |6 |2021-09-11 23:52:09.315000|walking|1.04 |null |null |
+---+----+-------+--------------------------+-------+-----+-------+-------+

Related

How to get the two nearest values in spark scala DataFrame

Hi EveryOne I'm new in Spark scala. I want to find the nearest values by partition using spark scala. My input is something like this:
first row for example: value 1 is between 2 and 7 in the value2 columns
+--------+----------+----------+
|id |value1 |value2 |
+--------+----------+----------+
|1 |3 |1 |
|1 |3 |2 |
|1 |3 |7 |
|2 |4 |2 |
|2 |4 |3 |
|2 |4 |8 |
|3 |5 |3 |
|3 |5 |6 |
|3 |5 |7 |
|3 |5 |8 |
My output should like this:
+--------+----------+----------+
|id |value1 |value2 |
+--------+----------+----------+
|1 |3 |2 |
|1 |3 |7 |
|2 |4 |3 |
|2 |4 |8 |
|3 |5 |3 |
|3 |5 |6 |
Can someone guide me how to resolve this please.
Instead of providing a code answer as you appear to want to learn I've provided you pseudo code and references to allow you to find the answers for yourself.
Group the elements (select id, value1) (aggregate on value2
with collect_list) so you can collect all the value2 into an
array.
select (id, and (add(concat) value1 to the collect_list array)) Sorting the array .
find( array_position ) value1 in the array.
splice the array. retrieving value before and value after
the result of (array_position)
If the array is less than 3 elements do error handling
now the last value in the array and the first value in the array are your 'closest numbers'.
You will need window functions for this.
val window = Window
.partitionBy("id", "value1")
.orderBy(asc("value2"))
val result = df
.withColumn("prev", lag("value2").over(window))
.withColumn("next", lead("value2").over(window))
.withColumn("dist_prev", col("value2").minus(col("prev")))
.withColumn("dist_next", col("next").minus(col("value2")))
.withColumn("min", min(col("dist_prev")).over(window))
.filter(col("dist_prev") === col("min") || col("dist_next") === col("min"))
.drop("prev", "next", "dist_prev", "dist_next", "min")
I haven't tested it, so think about it more as an illustration of the concept than a working ready-to-use example.
Here is what's going on here:
First, create a window that describes your grouping rule: we want the rows grouped by the first two columns, and sorted by the third one within each group.
Next, add prev and next columns to the dataframe that contain the value of value2 column from previous and next row within the group respectively. (prev will be null for the first row in the group, and next will be null for the last row – that is ok).
Add dist_prev and dist_next to contain the distance between value2 and prev and next value respectively. (Note that dist_prev for each row will have the same value as dist_next for the previous row).
Find the minimum value for dist_prev within each group, and add it as min column (note, that the minimum value for dist_next is the same by construction, so we only need one column here).
Filter the rows, selecting those that have the minimum value in either dist_next or dist_prev. This finds the tightest pair unless there are multiple rows with the same distance from each other – this case was not accounted for in your question, so we don't know what kind of behavior you want in this case. This implementation will simply return all of these rows.
Finally, drop all extra columns that were added to the dataframe to return it to its original shape.

Join with uneven columns

I have two dataframes structured the following way:
|Source|#Users|#Clicks|Hour|Type
and
Type|Total # Users|Hour
I'd like to join these columns based on hour however the first dataframe is at a deeper granularity in the second and therefore has more rows. Basically I want a dataframe where I have
|Source|#Users|#Clicks|Hour|Type|Total # Users
where the total # users is from the second dataframe. Any suggestions? I think I maybe want to use map?
Edit:
Here's an example
DF1
|Source|#Users|#Clicks|Hour|Type
|Prod1 |50 |3 |01 |Internet
|Prod2 |10 |2 |07 |iOS
|Prod3 |1 |50 |07 |Internet
|Prod2 |3 |2 |07 |Internet
|Prod3 |8 |2 |05 |Internet
DF2
|Type |Total #Users|Hour
|Internet|100 |01
|iOS |500 |01
|Internet|300 |07
|Internet|15 |05
|iOS |20 |07
Result
|Source|#Users|#Clicks|Hour|Type |Total #Users
|Prod1 |50 |3 |01 |Internet|100
|Prod2 |10 |2 |07 |iOS |20
|Prod3 |1 |50 |07 |Internet|300
|Prod2 |3 |2 |07 |Internet|300
|Prod3 |8 |2 |05 |Internet|15
That's a left join you're trying to do :
df1.join(df2, (df1.Hour === df2.Hour) & (df1.Type === df2.Type), "left_outer")
Short version : a left join keep all the rows from df1 and join on condition with matching rows of df2 if there is a match (null if not, duplicate if multiple matches).
More info on Pyspark join
More info on SQL Joins types

Issue while applying aggregate functions like count,sum on columns having null values in PySpark

We have the following input dataframe.
df1
Dep |Gender|Salary|DOB |Place
Finance |Male |5000 |2009-02-02 00:00:00|UK
HR |Female|6000 |2006-02-02 00:00:00|null
HR |Male |14200 |null |US
IT |Male |null |2008-02-02 00:00:00|null
IT |Male |55555 |2008-02-02 00:00:00|UK
Marketing|Female|12200 |2005-02-02 00:00:00|UK
Used the following code to find the count:
df = df1.groupBy(df1['Dep'])
df2 = df.agg({'Salary':'count'})
df2.show()
The result is:
Dep |count(Salary)
Finance |1
HR |2
Marketing|1
IT |1
The expected result is shown below.
Dep |count(Salary)
Finance |1
HR |2
Marketing|1
IT |2
Here issue comes with 4-th row data, where Salary data is null. And count operation on null is not working.
Appreciate your help in solving this issue.
You can replace null values:
df \
.na.fill({'salary':0}) \
.groupBy('Dep') \
.agg({'Salary':'count'})

Display %ROWCOUNT value in a select statement

How is the result of %ROWCOUNT displayed in the SQL statement.
Example
Select top 10 * from myTable.
I would like the results to have a rowCount for each row returned in the result set
Ex
+----------+--------+---------+
|rowNumber |Column1 |Column2 |
+----------+--------+---------+
|1 |A |B |
|2 |C |D |
+----------+--------+---------+
There are no any simple way to do it. You can add Sql Procedure with this functionality and use it in your SQL statements.
For example, class:
Class Sample.Utils Extends %RegisteredObject
{
ClassMethod RowNumber(Args...) As %Integer [ SqlProc, SqlName = "ROW_NUMBER" ]
{
quit $increment(%rownumber)
}
}
and then, you can use it in this way:
SELECT TOP 10 Sample.ROW_NUMBER(id) rowNumber, id,name,dob
FROM sample.person
ORDER BY ID desc
You will get something like below
+-----------+-------+-------------------+-----------+
|rowNumber |ID |Name |DOB |
+-----------+-------+-------------------+-----------+
|1 |200 |Quigley,Neil I. |12/25/1999 |
|2 |199 |Zevon,Imelda U. |04/22/1955 |
|3 |198 |O'Brien,Frances I. |12/03/1944 |
|4 |197 |Avery,Bart K. |08/20/1933 |
|5 |196 |Ingleman,Angelo F. |04/14/1958 |
|6 |195 |Quilty,Frances O. |09/12/2012 |
|7 |194 |Avery,Susan N. |05/09/1935 |
|8 |193 |Hanson,Violet L. |05/01/1973 |
|9 |192 |Zemaitis,Andrew H. |03/07/1924 |
|10 |191 |Presley,Liza N. |12/27/1978 |
+-----------+-------+-------------------+-----------+
If you are willing to rewrite your query then you can use a view counter to do what you are looking for. Here is a link to the docs.
The short version is you move your query into a FROM clause sub query and use the special field %vid.
SELECT v.%vid AS Row_Counter, Name
FROM (SELECT TOP 10 Name FROM Sample.Person ORDER BY Name) v
Row_Counter Name
1 Adam,Thelma P.
2 Adam,Usha J.
3 Adams,Milhouse A.
4 Allen,Xavier O.
5 Avery,James R.
6 Avery,Kyra G.
7 Bach,Ted J.
8 Bachman,Brian R.
9 Basile,Angelo T.
10 Basile,Chad L.

SQL Joining two table

I am struggling, maybe the simplest problem ever. My SQL knowledge pretty much limits me from achieving this. I am trying to build an sql query that should show JobTitle, Note and NoteType. Here is the thing, First job doesn't have any note but we should see it in the results. System notes never and ever should be displayed. An expected result should look like this
Result:
--------------------------------------------
|ID |Title |Note |NoteType |
--------------------------------------------
|1 |FirstJob |NULL |NULL |
|2 |SecondJob |CustomNot1|1 |
|2 |SecondJob |CustomNot2|1 |
|3 |ThirdJob |NULL |NULL |
--------------------------------------------
.
My query (doesn't work, doesn't display third job)
SELECT J.ID, J.Title, N.Note, N.NoteType
FROM JOB J
LEFT OUTER JOIN NOTE N ON N.JobId = J.ID
WHERE N.NoteType IS NULL OR N.NoteType = 1
My Tables:
My JOB Table
----------------------
|ID |Title |
----------------------
|1 |FirstJob |
|2 |SecondJob |
|3 |ThirdJob |
----------------------
My NOTE Table
--------------------------------------------
|ID |JobId |Note |NoteType |
--------------------------------------------
|1 |2 |CustomNot1|1 |
|2 |2 |CustomNot2|1 |
|3 |2 |SystemNot1|2 |
|4 |2 |SystemNot3|2 |
|5 |3 |SystemNot1|2 |
--------------------------------------------
This can't be true together (NoteType can't be NULL as well as 1 at the same time):
WHERE N.NoteType IS NULL AND N.NoteType = 1
You may want to use OR instead to check if NoteType is either NULL or 1.
WHERE N.NoteType IS NULL OR N.NoteType = 1
EDIT: With corrected query, your third job will not be retrieved as JOB_ID is matching but its the row getting filtered out because of the where condition.
Try below as work around to get the third job with null values.
SELECT J.ID, J.Title, N.Note, N.NoteType
FROM JOB J
LEFT OUTER JOIN
( SELECT JOBID NOTE, NOTETYPE FROM NOTE
WHERE N.NoteType IS NULL OR N.NoteType = 1) N
ON N.JobId = J.ID
just exclude the systemNotes and use a sub-select:
select * from job j
left outer join (
select * from note where notetype!=2
) n
on j.id=n.jobid;
if you include the joined table into where then left outer join might work as an inner join.