Spark Scala - Null when trying to extract time from timestamp - scala

I'm trying to extract the time from a timestamp using the below code, but it is returning a null value instead of the time. I have already filtered my dataset to have the records I need, so I can ignore AM/PM that comes from the input column.
I did some reading and it seems using date_format should work in this circumstance.
Any thoughts?
Current Output:
+----------------------+----------------------+---------+------------+------------+
|tpep_pickup_datetime |tpep_dropoff_datetime |timestamp|total_amount|pickupWindow|
+----------------------+----------------------+---------+------------+------------+
|05/18/2018 09:56:20 PM|05/18/2018 10:50:38 PM|35780 |52.87 |null |
|05/18/2018 10:52:49 PM|05/18/2018 11:08:47 PM|39169 |14.76 |null |
|05/18/2018 09:01:22 PM|05/18/2018 09:05:36 PM|32482 |6.3 |null |
|05/18/2018 09:00:29 PM|05/18/2018 09:05:31 PM|32429 |7.56 |null |
+----------------------+----------------------+---------+------------+------------+
Current Code:
val taxiSub = spark.read.format("csv").option("header", true).option("inferSchema", true).load("/user/zeppelin/taxi/TaxiSubset.csv") //read Data
taxiSub.createOrReplaceTempView("taxiSub") //Create View
val stamp = taxiSub.withColumn("timestamp", unix_timestamp($"tpep_pickup_datetime", "MM/dd/yyyy hh:mm:ss")) //create timestamp
val h = hour(unix_timestamp($"tpep_pickup_datetime","MM/dd/yyyy hh:mm:ss").cast("timestamp"))
val subset= stamp.withColumn("hour",h).filter("hour BETWEEN 9 AND 10").where($"tpep_pickup_datetime".contains("PM")).filter($"total_amount" < 200.00) //filter records between 9pm and 11pm and < 200 total amount
val myData = subset.withColumn("tmp",to_timestamp(col("tpep_pickup_datetime"),"MM/dd/yyyy HH:mm:ss")).//add new timestamp type field
withColumn("timestamp", unix_timestamp(concat_ws(":",hour(col("tmp")),minute(col("tmp")),second(col("tmp"))),"hh:mm:ss")). //extract hour,minute and convert to epoch timestamp value
drop("tmp").select("tpep_pickup_datetime","tpep_dropoff_datetime","timestamp","total_amount")
val testing = myData.withColumn("pickupWindow",date_format($"tpep_pickup_datetime","hh:mm:ss"))
testing.show(false)

.dateformat() expects col value in yyyy-MM-dd [hh|HH]:mm:ss format but the input data having MM/dd/yyyy..etc.
first we need to convert tpep_pickup_datetime to timestamp using to_timestamp function then apply date_format to extract hh:mm:ss.
Example:
df.show(false)
//+----------------------+
//|tpep_pickup_datetime |
//+----------------------+
//|05/18/2018 09:56:20 PM|
//|05/18/2018 10:52:49 PM|
//+----------------------+
//to get 24hr format HH value
df.withColumn("pickupWindow",date_format(to_timestamp(col("tpep_pickup_datetime"),"MM/dd/yyyy hh:mm:ss a"),"HH:mm:ss")).
show()
//using from_unixtime,unix_timestamp
df.withColumn("pickupWindow",from_unixtime(unix_timestamp(col("tpep_pickup_datetime"),"MM/dd/yyyy hh:mm:ss a"),"HH:mm:ss")).show()
//+--------------------+------------+
//|tpep_pickup_datetime|pickupWindow|
//+--------------------+------------+
//|05/18/2018 09:56:...| 21:56:20|
//|05/18/2018 10:52:...| 22:52:49|
//+--------------------+------------+
//to get 12hr format hh value
df.withColumn("pickupWindow",date_format(to_timestamp(col("tpep_pickup_datetime"),"MM/dd/yyyy hh:mm:ss a"),"hh:mm:ss")).
show()
//Or using unix_timestamp,from_unixtime
df.withColumn("pickupWindow",from_unixtime(unix_timestamp(col("tpep_pickup_datetime"),"MM/dd/yyyy hh:mm:ss a"),"hh:mm:ss")).show()
//+--------------------+------------+
//|tpep_pickup_datetime|pickupWindow|
//+--------------------+------------+
//|05/18/2018 09:56:...| 09:56:20|
//|05/18/2018 10:52:...| 10:52:49|
//+--------------------+------------+

Related

How to get 1st day of the year in pyspark

I have a date variable that I need to pass to various functions.
For e.g, if I have the date in a variable as 12/09/2021, it should return me 01/01/2021
How do I get 1st day of the year in PySpark
You can use the trunc-function which truncates parts of a date.
df = spark.createDataFrame([()], [])
(
df
.withColumn('current_date', f.current_date())
.withColumn("year_start", f.trunc("current_date", "year"))
.show()
)
# Output
+------------+----------+
|current_date|year_start|
+------------+----------+
| 2022-02-23|2022-01-01|
+------------+----------+
x = '12/09/2021'
'01/01/' + x[-4:]
output: '01/01/2021'
You can achieve this with date_trunc with to_date as the later returns a Timestamp rather than a Date
Data Preparation
df = pd.DataFrame({
'Date':['2021-01-23','2002-02-09','2009-09-19'],
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+----------+
| Date|
+----------+
|2021-01-23|
|2002-02-09|
|2009-09-19|
+----------+
Date Trunc & To Date
sparkDF = sparkDF.withColumn('first_day_year_dt',F.to_date(F.date_trunc('year',F.col('Date')),'yyyy-MM-dd'))\
.withColumn('first_day_year_timestamp',F.date_trunc('year',F.col('Date')))
sparkDF.show()
+----------+-----------------+------------------------+
| Date|first_day_year_dt|first_day_year_timestamp|
+----------+-----------------+------------------------+
|2021-01-23| 2021-01-01| 2021-01-01 00:00:00|
|2002-02-09| 2002-01-01| 2002-01-01 00:00:00|
|2009-09-19| 2009-01-01| 2009-01-01 00:00:00|
+----------+-----------------+------------------------+

Converting string time to day timestamp

I have just started working for Pyspark, and need some help converting a column datatype.
My dataframe has a string column, which stores the time of day in AM/PM, and I need to convert this into datetime for further processing/analysis.
fd = spark.createDataFrame([(['0143A'])], ['dt'])
fd.show()
+-----+
| dt|
+-----+
|0143A|
+-----+
from pyspark.sql.functions import date_format, to_timestamp
#fd.select(date_format('dt','hhmma')).show()
fd.select(to_timestamp('dt','hhmmaa')).show()
+----------------------------+
|to_timestamp(`dt`, 'hhmmaa')|
+----------------------------+
| null|
+----------------------------+
Expected output: 01:43
How can I get the proper datetime format in the above scenario?
Thanks for your help!
If we look at the doc for to_timestamp (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.to_timestamp) we see that the format must be specified as a SimpleDateFormat (https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).
In order to retrieve the time of the day in AM/PM, we must use hhmma. But in SimpleDateFormat, a catches AM or PM, and not A or P. So we need to change our string :
import pyspark.sql.functions as F
df = spark.createDataFrame([(['0143A'])], ['dt'])
df2 = df.withColumn('dt', F.concat(F.col('dt'), F.lit('M')))
df3 = df2.withColumn('ts', F.to_timestamp('dt','hhmma'))
df3.show()
+------+-------------------+
| dt| ts|
+------+-------------------+
|0143AM|1970-01-01 01:43:00|
+------+-------------------+
If you want to retrieve it as a string in the format you mentionned, you can use date_format :
df4 = df3.withColumn('time', F.date_format(F.col('ts'), format='HH:mm'))
df4.show()
+------+-------------------+-----+
| dt| ts| time|
+------+-------------------+-----+
|0143AM|1970-01-01 01:43:00|01:43|
+------+-------------------+-----+

How to find the time difference between 2 date-times in Scala?

I have a dataframe
+-----+----+----------+------------+----------+------------+
|empId| lId| date1| time1 | date2 | time2 |
+-----+----+----------+------------+----------+------------+
| 1234|1212|2018-04-20|21:40:29.077|2018-04-20|22:40:29.077|
| 1235|1212|2018-04-20|22:40:29.077|2018-04-21|00:40:29.077|
+-----+----+----------+------------+----------+------------+
Need to find the time difference between the 2 date-times(in minutes) for each empId and save as a new column.
Required output :
+-----+----+----------+------------+----------+------------+---------+
|empId| lId| date1| time1 | date2 | time2 |TimeDiff |
+-----+----+----------+------------+----------+------------+---------+
| 1234|1212|2018-04-20|21:40:29.077|2018-04-20|22:40:29.077|60 |
| 1235|1212|2018-04-20|22:40:29.077|2018-04-21|00:40:29.077|120 |
+-----+----+----------+------------+----------+------------+---------+
You can concat date and time and convert it to timestamp and find the difference in minutes as below
import org.apache.spark.sql.functions._
val format = "yyyy-MM-dd HH:mm:ss.SSS" //datetime format after concat
val newDF = df1.withColumn("TimeDiffInMinute",
abs(unix_timestamp(concat_ws(" ", $"date1", $"time1"), format).cast("long")
- (unix_timestamp(concat_ws(" ", $"date2", $"time2"), format)).cast("long") / 60D
)
unix_timestamp to convert the datetime to timestamp, Subtraction of timestamp results in seconds and divide by 60 results in minutes.
Output:
+-----+----+----------+------------+----------+------------+---------+
|empId| lId| date1| time1| date2| time2|dateTime1|
+-----+----+----------+------------+----------+------------+---------+
| 1234|1212|2018-04-20|21:40:29.077|2018-04-20|22:40:29.077| 60.0|
| 1235|1212|2018-04-20|22:40:29.077|2018-04-21|00:40:29.077| 120.0|
+-----+----+----------+------------+----------+------------+---------+
Hope this helped!

How to change date format in Spark?

I have the following DataFrame:
+----------+-------------------+
| timestamp| created|
+----------+-------------------+
|1519858893|2018-03-01 00:01:33|
|1519858950|2018-03-01 00:02:30|
|1519859900|2018-03-01 00:18:20|
|1519859900|2018-03-01 00:18:20|
How to create a timestamp correctly`?
I was able to create timestamp column which is epoch timestamp, but dates to not coincide:
df.withColumn("timestamp",unix_timestamp($"created"))
For example, 1519858893 points to 2018-02-28.
Just use date_format and to_utc_timestamp inbuilt functions
import org.apache.spark.sql.functions._
df.withColumn("timestamp", to_utc_timestamp(date_format(col("created"), "yyy-MM-dd"), "Asia/Kathmandu"))
Try below code
df.withColumn("dateColumn", df("timestamp").cast(DateType))
You can check one solution here https://stackoverflow.com/a/46595413
To elaborate more on that with the dataframe having different formats of timestamp/date in string, you can do this -
val df = spark.sparkContext.parallelize(Seq("2020-04-21 10:43:12.000Z", "20-04-2019 10:34:12", "11-30-2019 10:34:12", "2020-05-21 21:32:43", "20-04-2019", "2020-04-21")).toDF("ts")
def strToDate(col: Column): Column = {
val formats: Seq[String] = Seq("dd-MM-yyyy HH:mm:SS", "yyyy-MM-dd HH:mm:SS", "dd-MM-yyyy", "yyyy-MM-dd")
coalesce(formats.map(f => to_timestamp(col, f).cast(DateType)): _*)
}
val formattedDF = df.withColumn("dt", strToDate(df.col("ts")))
formattedDF.show()
+--------------------+----------+
| ts| dt|
+--------------------+----------+
|2020-04-21 10:43:...|2020-04-21|
| 20-04-2019 10:34:12|2019-04-20|
| 2020-05-21 21:32:43|2020-05-21|
| 20-04-2019|2019-04-20|
| 2020-04-21|2020-04-21|
+--------------------+----------+
Note: - This code assumes that data does not contain any column of format -> MM-dd-yyyy, MM-dd-yyyy HH:mm:SS

substract current date with another date in dataframe scala

First of all, thank you for the time in reading my question :)
My question is the following: In Spark with Scala, i have a dataframe that there contains a string with a date in format dd/MM/yyyy HH:mm, for example df
+----------------+
|date |
+----------------+
|8/11/2017 15:00 |
|9/11/2017 10:00 |
+----------------+
i want to get the difference of currentDate with date of dataframe in second, for example
df.withColumn("difference", currentDate - unix_timestamp(col(date)))
+----------------+------------+
|date | difference |
+----------------+------------+
|8/11/2017 15:00 | xxxxxxxxxx |
|9/11/2017 10:00 | xxxxxxxxxx |
+----------------+------------+
I try
val current = current_timestamp()
df.withColumn("difference", current - unix_timestamp(col(date)))
but get this error
org.apache.spark.sql.AnalysisException: cannot resolve '(current_timestamp() - unix_timestamp(date, 'yyyy-MM-dd HH:mm:ss'))' due to data type mismatch: differing types in '(current_timestamp() - unix_timestamp(date, 'yyyy-MM-dd HH:mm:ss'))' (timestamp and bigint).;;
I try too
val current = BigInt(System.currenttimeMillis / 1000)
df.withColumn("difference", current - unix_timestamp(col(date)))
and
val current = unix_timestamp(current_timestamp())
but the col "difference" is null
Thanks
You have to use correct format for unix_timestamp:
df.withColumn("difference", current_timestamp().cast("long") - unix_timestamp(col("date"), "dd/mm/yyyy HH:mm"))
or with recent version:
to_timestamp(col("date"), "dd/mm/yyyy HH:mm") - current_timestamp())
to get Interval column.