substract current date with another date in dataframe scala - scala

First of all, thank you for the time in reading my question :)
My question is the following: In Spark with Scala, i have a dataframe that there contains a string with a date in format dd/MM/yyyy HH:mm, for example df
+----------------+
|date |
+----------------+
|8/11/2017 15:00 |
|9/11/2017 10:00 |
+----------------+
i want to get the difference of currentDate with date of dataframe in second, for example
df.withColumn("difference", currentDate - unix_timestamp(col(date)))
+----------------+------------+
|date | difference |
+----------------+------------+
|8/11/2017 15:00 | xxxxxxxxxx |
|9/11/2017 10:00 | xxxxxxxxxx |
+----------------+------------+
I try
val current = current_timestamp()
df.withColumn("difference", current - unix_timestamp(col(date)))
but get this error
org.apache.spark.sql.AnalysisException: cannot resolve '(current_timestamp() - unix_timestamp(date, 'yyyy-MM-dd HH:mm:ss'))' due to data type mismatch: differing types in '(current_timestamp() - unix_timestamp(date, 'yyyy-MM-dd HH:mm:ss'))' (timestamp and bigint).;;
I try too
val current = BigInt(System.currenttimeMillis / 1000)
df.withColumn("difference", current - unix_timestamp(col(date)))
and
val current = unix_timestamp(current_timestamp())
but the col "difference" is null
Thanks

You have to use correct format for unix_timestamp:
df.withColumn("difference", current_timestamp().cast("long") - unix_timestamp(col("date"), "dd/mm/yyyy HH:mm"))
or with recent version:
to_timestamp(col("date"), "dd/mm/yyyy HH:mm") - current_timestamp())
to get Interval column.

Related

to_date in selectexpr of pyspark is truncating date time to year by default. how to avoid this?

I have a requirement where I derive the year to be loaded and then have to load the first day and last day of that year in date format to a table.
Here is what I'm doing:
boy = str(nxt_yr)+'01-01'
eoy = str(nxt_yr)+'12-31'
df_final = df_demo.selectExpr("to_date('{}','yyyy-MM-dd') as strt_dt".format(boy),"to_date('{}','yyyy-MM-dd') as end_dt".format(eoy))
spark.sql("set spark.sql.legacy.timeParserPolicy = LEGACY")
df_final.show(1)
this is giving me 2023-01-01 in both the fields in date datatype.
Is this expected behavior and if yes is there any workaround?
Note: I tried hardcoding the date as 2022-11-30 also in the code but still received the beginning of the year in the output.
Its working as expected , additionally you are missing a - within your dates you create for conversions
nxt_yr = 2022
boy = str(nxt_yr)+'-01-01'
# |
# /|\
# |
# |
eoy = str(nxt_yr)+'-12-31'
sql.sql("set spark.sql.legacy.timeParserPolicy = LEGACY")
sql.sql(f"""
SELECT
to_date('{boy}','yyyy-MM-dd') as strt_dt
,to_date('{eoy}','yyyy-MM-dd') as end_dt
"""
).show()
+----------+----------+
| strt_dt| end_dt|
+----------+----------+
|2022-01-01|2022-12-31|
+----------+----------+

How to convert one time zone to another in Spark Dataframe

I am reading from PostgreSQL into Spark Dataframe and have date column in PostgreSQL like below:
last_upd_date
---------------------
"2021-04-21 22:33:06.308639-05"
But in spark dataframe it's adding the hour interval.
eg: 2020-04-22 03:33:06.308639
Here it is adding 5 hours to the last_upd_date column.
But I want output as 2021-04-21 22:33:06.308639
Can anyone help me how to fix this spark dataframe.
You can create an udf that formats the timestamp with the required timezone:
import java.time.{Instant, ZoneId}
val formatTimestampWithTz = udf((i: Instant, zone: String)
=> i.atZone(ZoneId.of(zone))
.format(DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSSSSS")))
val df = Seq(("2021-04-21 22:33:06.308639-05")).toDF("dateString")
.withColumn("date", to_timestamp('dateString, "yyyy-MM-dd HH:mm:ss.SSSSSSx"))
.withColumn("date in Berlin", formatTimestampWithTz('date, lit("Europe/Berlin")))
.withColumn("date in Anchorage", formatTimestampWithTz('date, lit("America/Anchorage")))
.withColumn("date in GMT-5", formatTimestampWithTz('date, lit("-5")))
df.show(numRows = 10, truncate = 50, vertical = true)
Result:
-RECORD 0------------------------------------------
dateString | 2021-04-21 22:33:06.308639-05
date | 2021-04-22 05:33:06.308639
date in Berlin | 2021-04-22 05:33:06.308639
date in Anchorage | 2021-04-21 19:33:06.308639
date in GMT-5 | 2021-04-21 22:33:06.308639

Spark Scala - Null when trying to extract time from timestamp

I'm trying to extract the time from a timestamp using the below code, but it is returning a null value instead of the time. I have already filtered my dataset to have the records I need, so I can ignore AM/PM that comes from the input column.
I did some reading and it seems using date_format should work in this circumstance.
Any thoughts?
Current Output:
+----------------------+----------------------+---------+------------+------------+
|tpep_pickup_datetime |tpep_dropoff_datetime |timestamp|total_amount|pickupWindow|
+----------------------+----------------------+---------+------------+------------+
|05/18/2018 09:56:20 PM|05/18/2018 10:50:38 PM|35780 |52.87 |null |
|05/18/2018 10:52:49 PM|05/18/2018 11:08:47 PM|39169 |14.76 |null |
|05/18/2018 09:01:22 PM|05/18/2018 09:05:36 PM|32482 |6.3 |null |
|05/18/2018 09:00:29 PM|05/18/2018 09:05:31 PM|32429 |7.56 |null |
+----------------------+----------------------+---------+------------+------------+
Current Code:
val taxiSub = spark.read.format("csv").option("header", true).option("inferSchema", true).load("/user/zeppelin/taxi/TaxiSubset.csv") //read Data
taxiSub.createOrReplaceTempView("taxiSub") //Create View
val stamp = taxiSub.withColumn("timestamp", unix_timestamp($"tpep_pickup_datetime", "MM/dd/yyyy hh:mm:ss")) //create timestamp
val h = hour(unix_timestamp($"tpep_pickup_datetime","MM/dd/yyyy hh:mm:ss").cast("timestamp"))
val subset= stamp.withColumn("hour",h).filter("hour BETWEEN 9 AND 10").where($"tpep_pickup_datetime".contains("PM")).filter($"total_amount" < 200.00) //filter records between 9pm and 11pm and < 200 total amount
val myData = subset.withColumn("tmp",to_timestamp(col("tpep_pickup_datetime"),"MM/dd/yyyy HH:mm:ss")).//add new timestamp type field
withColumn("timestamp", unix_timestamp(concat_ws(":",hour(col("tmp")),minute(col("tmp")),second(col("tmp"))),"hh:mm:ss")). //extract hour,minute and convert to epoch timestamp value
drop("tmp").select("tpep_pickup_datetime","tpep_dropoff_datetime","timestamp","total_amount")
val testing = myData.withColumn("pickupWindow",date_format($"tpep_pickup_datetime","hh:mm:ss"))
testing.show(false)
.dateformat() expects col value in yyyy-MM-dd [hh|HH]:mm:ss format but the input data having MM/dd/yyyy..etc.
first we need to convert tpep_pickup_datetime to timestamp using to_timestamp function then apply date_format to extract hh:mm:ss.
Example:
df.show(false)
//+----------------------+
//|tpep_pickup_datetime |
//+----------------------+
//|05/18/2018 09:56:20 PM|
//|05/18/2018 10:52:49 PM|
//+----------------------+
//to get 24hr format HH value
df.withColumn("pickupWindow",date_format(to_timestamp(col("tpep_pickup_datetime"),"MM/dd/yyyy hh:mm:ss a"),"HH:mm:ss")).
show()
//using from_unixtime,unix_timestamp
df.withColumn("pickupWindow",from_unixtime(unix_timestamp(col("tpep_pickup_datetime"),"MM/dd/yyyy hh:mm:ss a"),"HH:mm:ss")).show()
//+--------------------+------------+
//|tpep_pickup_datetime|pickupWindow|
//+--------------------+------------+
//|05/18/2018 09:56:...| 21:56:20|
//|05/18/2018 10:52:...| 22:52:49|
//+--------------------+------------+
//to get 12hr format hh value
df.withColumn("pickupWindow",date_format(to_timestamp(col("tpep_pickup_datetime"),"MM/dd/yyyy hh:mm:ss a"),"hh:mm:ss")).
show()
//Or using unix_timestamp,from_unixtime
df.withColumn("pickupWindow",from_unixtime(unix_timestamp(col("tpep_pickup_datetime"),"MM/dd/yyyy hh:mm:ss a"),"hh:mm:ss")).show()
//+--------------------+------------+
//|tpep_pickup_datetime|pickupWindow|
//+--------------------+------------+
//|05/18/2018 09:56:...| 09:56:20|
//|05/18/2018 10:52:...| 10:52:49|
//+--------------------+------------+

How to find the time difference between 2 date-times in Scala?

I have a dataframe
+-----+----+----------+------------+----------+------------+
|empId| lId| date1| time1 | date2 | time2 |
+-----+----+----------+------------+----------+------------+
| 1234|1212|2018-04-20|21:40:29.077|2018-04-20|22:40:29.077|
| 1235|1212|2018-04-20|22:40:29.077|2018-04-21|00:40:29.077|
+-----+----+----------+------------+----------+------------+
Need to find the time difference between the 2 date-times(in minutes) for each empId and save as a new column.
Required output :
+-----+----+----------+------------+----------+------------+---------+
|empId| lId| date1| time1 | date2 | time2 |TimeDiff |
+-----+----+----------+------------+----------+------------+---------+
| 1234|1212|2018-04-20|21:40:29.077|2018-04-20|22:40:29.077|60 |
| 1235|1212|2018-04-20|22:40:29.077|2018-04-21|00:40:29.077|120 |
+-----+----+----------+------------+----------+------------+---------+
You can concat date and time and convert it to timestamp and find the difference in minutes as below
import org.apache.spark.sql.functions._
val format = "yyyy-MM-dd HH:mm:ss.SSS" //datetime format after concat
val newDF = df1.withColumn("TimeDiffInMinute",
abs(unix_timestamp(concat_ws(" ", $"date1", $"time1"), format).cast("long")
- (unix_timestamp(concat_ws(" ", $"date2", $"time2"), format)).cast("long") / 60D
)
unix_timestamp to convert the datetime to timestamp, Subtraction of timestamp results in seconds and divide by 60 results in minutes.
Output:
+-----+----+----------+------------+----------+------------+---------+
|empId| lId| date1| time1| date2| time2|dateTime1|
+-----+----+----------+------------+----------+------------+---------+
| 1234|1212|2018-04-20|21:40:29.077|2018-04-20|22:40:29.077| 60.0|
| 1235|1212|2018-04-20|22:40:29.077|2018-04-21|00:40:29.077| 120.0|
+-----+----+----------+------------+----------+------------+---------+
Hope this helped!

How to change date format in Spark?

I have the following DataFrame:
+----------+-------------------+
| timestamp| created|
+----------+-------------------+
|1519858893|2018-03-01 00:01:33|
|1519858950|2018-03-01 00:02:30|
|1519859900|2018-03-01 00:18:20|
|1519859900|2018-03-01 00:18:20|
How to create a timestamp correctly`?
I was able to create timestamp column which is epoch timestamp, but dates to not coincide:
df.withColumn("timestamp",unix_timestamp($"created"))
For example, 1519858893 points to 2018-02-28.
Just use date_format and to_utc_timestamp inbuilt functions
import org.apache.spark.sql.functions._
df.withColumn("timestamp", to_utc_timestamp(date_format(col("created"), "yyy-MM-dd"), "Asia/Kathmandu"))
Try below code
df.withColumn("dateColumn", df("timestamp").cast(DateType))
You can check one solution here https://stackoverflow.com/a/46595413
To elaborate more on that with the dataframe having different formats of timestamp/date in string, you can do this -
val df = spark.sparkContext.parallelize(Seq("2020-04-21 10:43:12.000Z", "20-04-2019 10:34:12", "11-30-2019 10:34:12", "2020-05-21 21:32:43", "20-04-2019", "2020-04-21")).toDF("ts")
def strToDate(col: Column): Column = {
val formats: Seq[String] = Seq("dd-MM-yyyy HH:mm:SS", "yyyy-MM-dd HH:mm:SS", "dd-MM-yyyy", "yyyy-MM-dd")
coalesce(formats.map(f => to_timestamp(col, f).cast(DateType)): _*)
}
val formattedDF = df.withColumn("dt", strToDate(df.col("ts")))
formattedDF.show()
+--------------------+----------+
| ts| dt|
+--------------------+----------+
|2020-04-21 10:43:...|2020-04-21|
| 20-04-2019 10:34:12|2019-04-20|
| 2020-05-21 21:32:43|2020-05-21|
| 20-04-2019|2019-04-20|
| 2020-04-21|2020-04-21|
+--------------------+----------+
Note: - This code assumes that data does not contain any column of format -> MM-dd-yyyy, MM-dd-yyyy HH:mm:SS