Spark DataFrame convert milliseconds timestamp column in string format to human readable time with milliseconds

Spark DataFrame convert milliseconds timestamp column in string format to human readable time with milliseconds - scala

I have a Spark DataFrame with a timestamp column in milliseconds since the epoche. The column is a string. I now want to transform the column to a readable human time but keep the milliseconds.
For example:
1614088453671 -> 23-2-2021 13:54:13.671
Every example i found transforms the timestamp to a normal human readable time without milliseconds.
What i have:
+------------------+
|epoch_time_seconds|
+------------------+
|1614088453671 |
+------------------+
What i want to reach:
+------------------+------------------------+
|epoch_time_seconds|human_date |
+------------------+------------------------+
|1614088453671 |23-02-2021 13:54:13.671 |
+------------------+------------------------+

The time before the milliseconds can be obtained using date_format from_unixtime, while the milliseconds can be obtained using a modulo. Combine them using format_string.
val df2 = df.withColumn(
"human_date",
format_string(
"%s.%s",
date_format(
from_unixtime(col("epoch_time_seconds")/1000),
"dd-MM-yyyy HH:mm:ss"
),
col("epoch_time_seconds") % 1000
)
)
df2.show(false)
+------------------+-----------------------+
|epoch_time_seconds|human_date |
+------------------+-----------------------+
|1614088453671 |23-02-2021 13:54:13.671|
+------------------+-----------------------+

Related

How to convert one time zone to another in Spark Dataframe

I am reading from PostgreSQL into Spark Dataframe and have date column in PostgreSQL like below:
last_upd_date
---------------------
"2021-04-21 22:33:06.308639-05"
But in spark dataframe it's adding the hour interval.
eg: 2020-04-22 03:33:06.308639
Here it is adding 5 hours to the last_upd_date column.
But I want output as 2021-04-21 22:33:06.308639
Can anyone help me how to fix this spark dataframe.

You can create an udf that formats the timestamp with the required timezone:
import java.time.{Instant, ZoneId}
val formatTimestampWithTz = udf((i: Instant, zone: String)
=> i.atZone(ZoneId.of(zone))
.format(DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSSSSS")))
val df = Seq(("2021-04-21 22:33:06.308639-05")).toDF("dateString")
.withColumn("date", to_timestamp('dateString, "yyyy-MM-dd HH:mm:ss.SSSSSSx"))
.withColumn("date in Berlin", formatTimestampWithTz('date, lit("Europe/Berlin")))
.withColumn("date in Anchorage", formatTimestampWithTz('date, lit("America/Anchorage")))
.withColumn("date in GMT-5", formatTimestampWithTz('date, lit("-5")))
df.show(numRows = 10, truncate = 50, vertical = true)
Result:
-RECORD 0------------------------------------------
dateString | 2021-04-21 22:33:06.308639-05
date | 2021-04-22 05:33:06.308639
date in Berlin | 2021-04-22 05:33:06.308639
date in Anchorage | 2021-04-21 19:33:06.308639
date in GMT-5 | 2021-04-21 22:33:06.308639

spark sql datediff in days

I am trying to calculate the number of days between current_timestamp() and max(timestamp_field) from a table.
maxModifiedDate = spark.sql("select date_format(max(lastmodifieddate), 'MM/dd/yyyy hh:mm:ss') as maxModifiedDate,date_format(current_timestamp(),'MM/dd/yyyy hh:mm:ss') as CurrentTimeStamp, datediff(current_timestamp(), date_format(max(lastmodifieddate), 'MM/dd/yyyy hh:mm:ss')) as daysDiff from db.tbl")
but I get null for daysDiff. Why is that and how can I fix it?
------------------+-------------------+--------+
| maxModifiedDate| CurrentTimeStamp|daysDiff|
+-------------------+-------------------+--------+
|01/29/2020 05:07:51|06/29/2020 08:36:28| null|
+-------------------+-------------------+--------+

Check this out: I used to_timestamp to convert into dateformat and used datediff function to calculate the time difference.
from pyspark.sql import functions as F
# InputDF
# +-------------------+-------------------+
# | maxModifiedDate| CurrentTimeStamp|
# +-------------------+-------------------+
# |01/29/2020 05:07:51|06/29/2020 08:36:28|
# +-------------------+-------------------+
df.select("maxModifiedDate","CurrentTimeStamp",F.datediff( F.to_timestamp("CurrentTimeStamp", format= 'MM/dd/yyyy'), F.to_timestamp("maxModifiedDate", format= 'MM/dd/yyyy')).alias("datediff")).show()
# +-------------------+-------------------+--------+
# | maxModifiedDate| CurrentTimeStamp|datediff|
# +-------------------+-------------------+--------+
# |01/29/2020 05:07:51|06/29/2020 08:36:28| 152|
# +-------------------+-------------------+--------+
Using sparksql
spark.sql("select maxModifiedDate,CurrentTimeStamp, datediff(to_timestamp(CurrentTimeStamp, 'MM/dd/yyyy'), to_timestamp(maxModifiedDate, 'MM/dd/yyyy')) as datediff from table ").show()

date_format is used to change timestamp formats instead use to_date(col,'fmt'), unix_timestamp+from_unixtime,to_timestamp functions with datediff.
df.show()
#+-------------------+-------------------+
#| maxModifiedDate| CurrentTimeStamp|
#+-------------------+-------------------+
#|01/29/2020 05:07:51|06/29/2020 08:36:28|
#+-------------------+-------------------+
spark.sql("select maxModifiedDate,CurrentTimeStamp,datediff(to_date(maxModifiedDate, 'MM/dd/yyyy'),to_date(CurrentTimeStamp,'MM/dd/yyyy')) as daysDiff from tmp").show()
#+-------------------+-------------------+--------+
#| maxModifiedDate| CurrentTimeStamp|daysDiff|
#+-------------------+-------------------+--------+
#|01/29/2020 05:07:51|06/29/2020 08:36:28| -152|
#+-------------------+-------------------+--------+

I think you could try to define your own function to solve your problem, since datediff() is only able to compute difference between dates and not datetimes.
I suggest you something like this, casting your datetime to long:
diff_datetime = col("end_time").cast("long") - col("start_time").cast("long")
df = df.withColumn("diff", diff/60)
Or casting your result to timestamp using SQL
SELECT datediff(F.to_timestamp(end_date), F.to_timestamp(start_date))
In this case, I'm going to get the difference in seconds between two datetimes, but you can edit this result changing the scale factor (60 for seconds, 60*60 for minutes...)
Alternatively, if you want to use that function, you have to cast your datetime column to a date column (without hours, minutes and seconds) using to_date() and then apply datediff().

How to convert timestamp column to epoch seconds?

How do you convert a timestamp column to epoch seconds?
var df = sc.parallelize(Seq("2018-07-01T00:00:00Z")).toDF("date_string")
df = df.withColumn("timestamp", $"date_string".cast("timestamp"))
df.show(false)
DataFrame:
+--------------------+---------------------+
|date_string |timestamp |
+--------------------+---------------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|
+--------------------+---------------------+

If you have a timestamp you can cast it to a long to get the epoch seconds
df = df.withColumn("epoch_seconds", $"timestamp".cast("long"))
df.show(false)
DataFrame
+--------------------+---------------------+-------------+
|date_string |timestamp |epoch_seconds|
+--------------------+---------------------+-------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|1530403200 |
+--------------------+---------------------+-------------+

Use unix_timestamp from org.apache.spark.functions. It can a timestamp column or from a string column where it is possible to specify the format. From the documentation:
public static Column unix_timestamp(Column s)
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return null if fail.
public static Column unix_timestamp(Column s, String p)
Convert time string with given pattern (see http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) to Unix time stamp (in seconds), return null if fail.
Use as follows:
import org.apache.spark.functions._
df.withColumn("epoch_seconds", unix_timestamp($"timestamp")))
or if the column is a string with other format:
df.withColumn("epoch_seconds", unix_timestamp($"date_string", "yyyy-MM-dd'T'HH:mm:ss'Z'")))

It can be easily done with unix_timestamp function in spark SQL like this:
spark.sql("SELECT unix_timestamp(inv_time) AS time_as_long FROM agg_counts LIMIT 10").show()
Hope this helps.

You can use the function unix_timestamp and cast it into any datatype.
Example:
val df1 = df.select(unix_timestamp($"date_string", "yyyy-MM-dd HH:mm:ss").cast(LongType).as("epoch_seconds"))

convert date to integer scala spark

I have a dataframe, that contain, 2 columns of date start_date and finish_date; and I created a new column to add the moyen between the 2 dates.
+-----+--------+-------+---------+-----+--------------------+-------------------
start_date| finish_date| moyen_date|
+-----+--------+-------+---------+-----+--------------------+-------------------
2010-11-03 15:56:... |2010-11-03 17:43:...| 0|
2010-11-03 17:43:... |2010-11-05 13:21:...| 2|
2010-11-05 13:21:... |2010-11-05 14:08:...| 0|
2010-11-05 14:08:... |2010-11-05 14:08:...| 0|
+-----+--------+-------+---------+-----+--------------------+-------------------
I calculated the difference between the 2 dates:
var result = sqlDF.withColumn("moyen_date",datediff(col("finish_date"), col("start_date")))
But I want to convert start_date and finish_date to integer, knowing that each column contain date + time.
Someone can help me please. ?
Thank you

Considering this as part of your dataframe:
df.show(false)
+---------------------+
|ts |
+---------------------+
|2010-11-03 15:56:34.0|
+---------------------+
unix_timestamp returns the number of milliseconds since epoch. The input column should be of type timestamp. The output column is of type long.
df.withColumn("unix_ts" , unix_timestamp($"ts").show(false)
+---------------------+----------+
|ts |unix_ts |
+---------------------+----------+
|2010-11-03 15:56:34.0|1288817794|
+---------------------+----------+
To convert it back to timestamp format of your choice, you can use from_unixtime which also takes an optional timestamp format as a parameter. You are using to_date, that's why you're only getting the date and not the time.
df.withColumn("unix_ts" , unix_timestamp($"ts") )
.withColumn("from_utime" , from_unixtime($"unix_ts" , "yyyy-MM-dd HH:mm:ss.S"))
.show(false)
+---------------------+----------+---------------------+
|ts |unix_ts |from_utime |
+---------------------+----------+---------------------+
|2010-11-03 15:56:34.0|1288817794|2010-11-03 15:56:34.0|
+---------------------+----------+---------------------+
The column from_utime here will be of type string though. To convert it to timestamp, you can simple use:
df.withColumn("from_utime" , $"from_utime".cast("timestamp") )
Since it's already in ISO date format, no specific conversion is needed. For any other format, you will need to use a combination of unix_timestamp and from_unixtime.

Does unix_timestamp truncate or round milliseconds?

From the reference:
Convert time string with given pattern (‘yyyy-MM-dd HH:mm:ss’, by default) to Unix time stamp (in seconds), using the default timezone and the default locale, return null if fail.
I find that this drops milliseconds off DataFrame timestamp columns. I am just wondering whether it simply truncates, or rounds the timestamp to the nearest second.

No documentation back up but in #spark 2.2.0, it's truncation, here is a demo:
from pyspark.sql import Row
import pyspark.sql.functions as F
r = Row('datetime')
lst = [r('2017-10-29 10:20:30.102'), r('2017-10-29 10:20:30.999')]
df = spark.createDataFrame(lst)
(df.withColumn('trunc_datetime', F.unix_timestamp(F.col('datetime')))
.withColumn('seconds', F.from_unixtime(F.col('trunc_datetime'), 'ss'))
.show(2, False))
+-----------------------+--------------+-------+
|datetime |trunc_datetime|seconds|
+-----------------------+--------------+-------+
|2017-10-29 10:20:30.102|1509286830 |30 |
|2017-10-29 10:20:30.999|1509286830 |30 |
+-----------------------+--------------+-------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark DataFrame convert milliseconds timestamp column in string format to human readable time with milliseconds - scala

Related

How to convert one time zone to another in Spark Dataframe

spark sql datediff in days

How to convert timestamp column to epoch seconds?

convert date to integer scala spark

Does unix_timestamp truncate or round milliseconds?

Categories

Resources