Replace date value in pyspark by maximum of two column - pyspark

I'm using pyspark 3.0.1. I have a dataframe df with following details
ID Class dateEnrolled dateStarted
32 1 2016-01-09 2016-01-26
25 1 2016-01-09 2016-01-10
33 1 2016-01-16 2016-01-05
I need to replace dateEnrolled my latest of two date field & my data should look like
ID Class dateEnrolled dateStarted
32 1 2016-01-26 2016-01-26
25 1 2016-01-10 2016-01-10
33 1 2016-01-16 2016-01-05
Can you suggest me how to do that?

You can use greatest:
import pyspark.sql.functions as F
df2 = df.withColumn('dateEnrolled', F.greatest('dateEnrolled', 'dateStarted'))

Related

Converting (casting) columns into rows in Pyspark

I have a spark dataframe in the below format where each unique id can have maximum of 3 rows which is given by rank column.
id pred prob rank
485 9716 0.19205872 1
729 9767 0.19610429 1
729 9716 0.186840048 2
729 9748 0.173447074 3
818 9731 0.255104463 1
818 9748 0.215499913 2
818 9716 0.207307154 3
I want to convert (cast) into a row wise data such that each id has just one row and the pred & prob column have multiple columns differentiated by rank variable( column postfix).
id pred_1 prob_1 pred_2 prob_2 pred_3 prob_3
485 9716 0.19205872
729 9767 0.19610429 9716 0.186840048 9748 0.173447074
818 9731 0.255104463 9748 0.215499913 9716 0.207307154
I am not able to figure out how to o it in Pyspark
Sample code for input data creation:
# Loading the requisite packages
from pyspark.sql.functions import col, explode, array, struct, expr, sum, lit
# Creating the DataFrame
df = sqlContext.createDataFrame([(485,9716,19,1),(729,9767,19,1),(729,9716,18,2), (729,9748,17,3), (818,9731,25,1), (818,9748,21,2), (818,9716,20,3)],('id','pred','prob','rank'))
df.show()
This is the pivot on multiple columns problem.Try:
import pyspark.sql.functions as F
df_pivot = df.groupBy('id').pivot('rank').agg(F.first('pred').alias('pred'), F.first('prob').alias('prob')).orderBy('id')
df_pivot.show(truncate=False)

Group by hour in pyspark?

I have dataframe which contain time column which is in string format.
dataframe=
time value
00:00:00 10
00:23:00 5
00:59:00 23
01:23:34 34
01:56:00 34
Every time i try to group by hours on Time column it give output like below this :-
hour count
0 38
1 68
But I want Out put like this..
hour count
00 38
01 68
For this i wrote the query like below ;-
dataframe.groupBy(hour('time')).agg({'value':'count'})
Quoting substring multiple characters from the last index of a pyspark string column using negative indexing
Since your time column is in StringType, we can use substring to get the hour as you want, and group on it as StringType
from pyspark.sql.functions import substring, col
df = df.withColumn("hour", substring(F.col("time"), 0, 2))
group_df = df.groupby("hour").sum("value") # or whichever aggregation you want

Spark - Grouping 2 Dataframe Rows in only 1 row [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I have the following dataframe
id col1 col2 col3 col4
1 1 10 100 A
1 1 20 101 B
1 1 30 102 C
2 1 10 80 D
2 1 20 90 E
2 1 30 100 F
2 1 40 104 G
So, I want to return a new dataframe, in which I can have in olnly one row the values for the same (col1, col2), and also create a new column with some oeration over both col3 columns, for example
id(1) col1(1) col2(1) col3(1) col4(1) id(2) col1(2) col2(2) col3(3) col4(4) new_column
1 1 10 100 A 2 1 10 80 D (100-80)*100
1 1 20 101 B 2 1 20 90 E (101-90)*100
1 1 30 102 C 2 1 30 100 F (102-100)*100
- - - - - 2 1 40 104 G -
I tried ordering, grouping by (col1, col2) but the grouping returns a RelationalGroupedDataset that I cannot do anything appart of aggregation functions. SO I will appreciate any help. I'm using Scala 2.11 Thanks!
what about joining the df with itself?
something like:
df.as("left")
.join(df.as("right"), Seq("col1", "col2"), "outer")
.where($"left.id" =!= $"right.id")

Combing rows in a spark dataframe

If I have an input as below:
sno name time
1 hello 1
1 hello 2
1 hai 3
1 hai 4
1 hai 5
1 how 6
1 how 7
1 are 8
1 are 9
1 how 10
1 how 11
1 are 12
1 are 13
1 are 14
I want to combine the fields having similar values in name as the below output format:
sno name timestart timeend
1 hello 1 2
1 hai 3 5
1 how 6 7
1 are 8 9
1 how 10 11
1 are 12 14
The input will be sorted according to time and only the records which are having the same name for repeated time intervals must be merged.
I am trying to do using spark but I cannot figure out a way to do this using spark functions since I am new to spark. Any suggestions on the approach will be appreciated.
I tried thinking of writing a user-defined function and applying maps to the data frame but I could not come up with the right logic for the function.
PS: I am trying to do this using scala spark.
One way to do so would be to use a plain SQL query.
Let's say df is your input dataframe.
val viewName = s"dataframe"
df.createOrReplaceTempView(viewName)
spark.sql(query(viewName))
def query(viewName: String): String = s"SELECT sno, name, MAX(time) AS timeend, MIN(time) AS timestart FROM $viewName GROUP BY name"
You can of course use df set. This would be something like:
df.groupBy($"name")
.agg($"sno", $"name", max($"time").as("timeend"), min($"time").as("timestart"))

Pyspark dataframe create new column from other columns and from it

I have pyspark dataframe DF
Now I would like create a new column with below condition.
city customer sales orders checkpoint
a eee 20 20 1
b sfd 28 30 0
C sss 30 30 1
d zzz 35 40 0
DF = Df.withColumn("NewCol",func.when(DF.month == 1,DF.sales + DF.orders).otherwise(greatest(DF.sales,DF.orders))+ func.when(DF.checkpoint == 1,lit(0)).otherwise(func.lag("NewCol).over(Window.partitionBy(DF.city,DF.customer).orderBy(DF.city,DF.customer))))
I got an error like NewCol is not defined which is expected.
Please suggest me on this?
Created a column
df= df.withColumn("NewCol",lit(None))
for i in range(2):
if i<=2:
DF = Df.withColumn("NewCol",func.when(DF.month == 1,DF.sales + DF.orders).otherwise(greatest(DF.sales,DF.orders))+ func.when(DF.checkpoint == 1,lit(0)).otherwise(func.lag("NewCol).over(Window.partitionBy(DF.city,DF.customer).orderBy(DF.city,DF.customer))))</i)