Group by hour in pyspark? - pyspark

I have dataframe which contain time column which is in string format.
dataframe=
time value
00:00:00 10
00:23:00 5
00:59:00 23
01:23:34 34
01:56:00 34
Every time i try to group by hours on Time column it give output like below this :-
hour count
0 38
1 68
But I want Out put like this..
hour count
00 38
01 68
For this i wrote the query like below ;-
dataframe.groupBy(hour('time')).agg({'value':'count'})

Quoting substring multiple characters from the last index of a pyspark string column using negative indexing
Since your time column is in StringType, we can use substring to get the hour as you want, and group on it as StringType
from pyspark.sql.functions import substring, col
df = df.withColumn("hour", substring(F.col("time"), 0, 2))
group_df = df.groupby("hour").sum("value") # or whichever aggregation you want

Related

Converting (casting) columns into rows in Pyspark

I have a spark dataframe in the below format where each unique id can have maximum of 3 rows which is given by rank column.
id pred prob rank
485 9716 0.19205872 1
729 9767 0.19610429 1
729 9716 0.186840048 2
729 9748 0.173447074 3
818 9731 0.255104463 1
818 9748 0.215499913 2
818 9716 0.207307154 3
I want to convert (cast) into a row wise data such that each id has just one row and the pred & prob column have multiple columns differentiated by rank variable( column postfix).
id pred_1 prob_1 pred_2 prob_2 pred_3 prob_3
485 9716 0.19205872
729 9767 0.19610429 9716 0.186840048 9748 0.173447074
818 9731 0.255104463 9748 0.215499913 9716 0.207307154
I am not able to figure out how to o it in Pyspark
Sample code for input data creation:
# Loading the requisite packages
from pyspark.sql.functions import col, explode, array, struct, expr, sum, lit
# Creating the DataFrame
df = sqlContext.createDataFrame([(485,9716,19,1),(729,9767,19,1),(729,9716,18,2), (729,9748,17,3), (818,9731,25,1), (818,9748,21,2), (818,9716,20,3)],('id','pred','prob','rank'))
df.show()
This is the pivot on multiple columns problem.Try:
import pyspark.sql.functions as F
df_pivot = df.groupBy('id').pivot('rank').agg(F.first('pred').alias('pred'), F.first('prob').alias('prob')).orderBy('id')
df_pivot.show(truncate=False)

Split date into day of the week, month,year using Pyspark

I have very little experience in Pyspark and I am trying with no success to create 3 new columns from a column that contain the timestamp of each row.
The column containing the date has the following format:EEE MMM dd HH:mm:ss Z yyyy.
So it looks like this:
+--------------------+
| timestamp|
+--------------------+
|Fri Oct 18 17:07:...|
|Mon Oct 21 21:49:...|
|Thu Oct 31 18:03:...|
|Sun Oct 20 15:00:...|
|Mon Sep 30 23:35:...|
+--------------------+
The 3 columns have to contain: the day of the week as an integer (so 0 for monday, 1 for tuesday...), the number of the month and the year.
What is the most effective way to create these additional 3 columns and append them to the pyspark dataframe? Thanks in advance!!
Spark 1.5 and higher has many date processing functions. Here are some that maybe useful for you
from pyspark.sql.functions import *
from pyspark.sql.functions import year, month, dayofweek
df = df.withColumn('dayOfWeek', dayofweek(col('your_date_column')))
df = df.withColumn('month', month(col('your_date_column')))
df = df.withColumn('year', year(col('your_date_column')))

Replace date value in pyspark by maximum of two column

I'm using pyspark 3.0.1. I have a dataframe df with following details
ID Class dateEnrolled dateStarted
32 1 2016-01-09 2016-01-26
25 1 2016-01-09 2016-01-10
33 1 2016-01-16 2016-01-05
I need to replace dateEnrolled my latest of two date field & my data should look like
ID Class dateEnrolled dateStarted
32 1 2016-01-26 2016-01-26
25 1 2016-01-10 2016-01-10
33 1 2016-01-16 2016-01-05
Can you suggest me how to do that?
You can use greatest:
import pyspark.sql.functions as F
df2 = df.withColumn('dateEnrolled', F.greatest('dateEnrolled', 'dateStarted'))

How to get the latest date from listed dates along with the total count?

I have the below DataFrame, it has keys with different dates out of which I would like to display latest date together with the count for each of the key-id pairs.
Input data as below:
id key date
11 222 1/22/2017
11 222 1/22/2015
11 222 1/22/2016
11 223 9/22/2017
11 223 1/22/2010
11 223 1/22/2008
Code I have tried:
val counts = df.groupBy($"id",$"key").count()
I am getting the below output,
id key count
11 222 3
11 223 3
However, I want like the output to be as below:
id key count maxDate
11 222 3 1/22/2017
11 223 3 9/22/2017
One way would be to transform the date into unixtime, do the aggregation and then convert it back again. This conversions to and from unixtime can be performed with unix_timestamp and from_unixtime respectively. When the date is in unixtime, the latest date can be selected by finding the maximum value. The only possible down-side of this approach is that the date format must be explicitly given.
val dateFormat = "MM/dd/yyyy"
val df2 = df.withColumn("date", unix_timestamp($"date", dateFormat))
.groupBy($"id",$"key").agg(count("date").as("count"), max("date").as("maxDate"))
.withColumn("maxDate", from_unixtime($"maxDate", dateFormat))
Which will give you:
+---+---+-----+----------+
| id|key|count| maxDate|
+---+---+-----+----------+
| 11|222| 3|01/22/2017|
| 11|223| 3|09/22/2017|
+---+---+-----+----------+
Perform an agg on both fields
df.groupBy($"id", $"key").agg(count($"date"), max($"date"))
Output:
+---+---+-----------+-----------+
| _1| _2|count(date)| max(date)|
+---+---+-----------+-----------+
| 11|222| 3| 1/22/2017|
| 11|223| 3| 9/22/2017|
+---+---+-----------+-----------+
Edit: The as option proposed in the other answer is pretty good too.
Edit: Comment below is true. You need to convert to a proper date format. You can check the other answer wich converts to timestamp or use udf
import java.text.SimpleDateFormat
import org.apache.spark.sql.{SparkSession, functions}
val simpleDateFormatOriginal:SimpleDateFormat = new SimpleDateFormat("MM/dd/yyyy")
val simpleDateFormatDestination:SimpleDateFormat = new SimpleDateFormat("yyyy/MM/dd")
val toyyyymmdd = (s:String) => {
simpleDateFormatDestination.format(simpleDateFormatOriginal.parse(s))
}
val toddmmyyyy = (s:String) => {
simpleDateFormatOriginal.format(simpleDateFormatDestination.parse(s))
}
val toyyyymmddudf = functions.udf(toyyyymmdd)
val toddmmyyyyyudf = functions.udf(toddmmyyyy)
df.withColumn("date", toyyyymmddudf($"date"))
.groupBy($"id", $"key")
.agg(count($"date"), max($"date").as("maxDate"))
.withColumn("maxDate", toddmmyyyyyudf($"maxDate"))

Pyspark dataframe create new column from other columns and from it

I have pyspark dataframe DF
Now I would like create a new column with below condition.
city customer sales orders checkpoint
a eee 20 20 1
b sfd 28 30 0
C sss 30 30 1
d zzz 35 40 0
DF = Df.withColumn("NewCol",func.when(DF.month == 1,DF.sales + DF.orders).otherwise(greatest(DF.sales,DF.orders))+ func.when(DF.checkpoint == 1,lit(0)).otherwise(func.lag("NewCol).over(Window.partitionBy(DF.city,DF.customer).orderBy(DF.city,DF.customer))))
I got an error like NewCol is not defined which is expected.
Please suggest me on this?
Created a column
df= df.withColumn("NewCol",lit(None))
for i in range(2):
if i<=2:
DF = Df.withColumn("NewCol",func.when(DF.month == 1,DF.sales + DF.orders).otherwise(greatest(DF.sales,DF.orders))+ func.when(DF.checkpoint == 1,lit(0)).otherwise(func.lag("NewCol).over(Window.partitionBy(DF.city,DF.customer).orderBy(DF.city,DF.customer))))</i)