Convert timestamp column from UTC to EST in spark scala

Convert timestamp column from UTC to EST in spark scala - scala

I have a column in spark dataframe of timestamp type with date format like '2019-06-13T11:39:10.244Z'
My goal is to convert this column into EST time(subtracting 4 hours) keeping the same format.
I tried it using from_utc_timestamp api but it seems it is converting the UTC time to my local timezone (+5:30) and adding it to the timestamp then subtracting 4 hours from it. I tried to use Joda time but for some reason it is adding 33 days to the EST time
innput = 2019-06-13T11:39:10.244Z
using from_utc_timestamp api:
val tDf = df.withColumn("newTimeCol", to_utc_timestamp(col("timeCol"), "America/New_York"))
output = 2019-06-13T13:09:10.244Z+5:30
using Joda time package:
val coder : (String => String) = (arg: String) => {
new DateTime(arg, DateTimeZone.UTC).minusHours(4).toString("yyyy-mm-dd'T'HH:mm:s.SS'Z'")}
val sqlfunc = udf(coder)
val tDf = df.withColumn("newTime", sqlfunc(col("_c20")))
output = 2019-39-13T07:39:10.244Z
desired output = 2019-06-13T07:39:10.244Z
Kindly advise how should I proceed. Thanks in advance

There is a typo in your format string when creating the output.
Your format string should be yyyy-MM-dd'T'HH:mm:s.SS'Z' but it is yyyy-mm-dd'T'HH:mm:s.SS'Z'.
mm is the format char for minutes while MM is the format char for the months. You can check all format chars here.

Related

Convert date to another format Scala Spark

I am reading a CSV that contains two types of date:
dd-MMM-yyyy hh:mm:ss -> 13-Dec-2019 17:10:00
dd/MM/yyyy hh:mm -> 11/02/2020 17:33
I am trying to transform all dates of the first type into the second type but I can't find a good solution. I am trying this:
val pr_date = readeve.withColumn("Date", when(to_date(col("Date"),"dd-MMM-yyyy hh:mm:ss").isNotNull,
to_date(col("Date"),"dd/MM/yyyy hh:mm")))
pr_date.show(25)
And I get the entire Date column as null values:
I am trying with this function:
def to_date_(col: Column,
formats: Seq[String] = Seq("dd-MMM-yyyy hh:mm:ss", "dd/MM/yyyy hh:mm")) = {
coalesce(formats.map(f => to_date(col, f)): _*)
}
val p2 = readeve.withColumn("Date",to_date_(readeve.col(("Date")))).show(125)
And in the first type of date i get nulls too:
What am I doing wrong? (new with Scala Spark)
Scala version: 2.11.7
Spark version: 2.4.3

Try code below? Note that 17 is HH, not hh. Also try to_timestamp instead of to_date because you want to keep the time.
val pr_date = readeve.withColumn(
"Date",
coalesce(
date_format(to_timestamp(col("Date"),"dd-MMM-yyyy HH:mm:ss"),"dd/MM/yyyy HH:mm"),
date_format(to_timestamp(col("Date"),"dd/MM/yyyy HH:mm"),"dd/MM/yyyy HH:mm")
)
)

Convert time string into timestamp/date time in scala

I am receiving time data into my source as a csv file in the format (HHMMSSHS). I am not sure about what HS in the format stands for. example data will be like 15110708.
I am creating table in databricks table with received columns and data. I want to convert this field to time while processing in scala.
I am using UDF to do formating on any data on the go. But for this i am totally stuck while writing a UDF for parsing only time.
The final output should be 15:11:07:08 or any time format suitable for this string.
I tried with java.text.SimpleDateFormat and faced issue with unparsable string.
Is there any way to convert the above given string to a time format?
I am storing this value as acolumn in databricks notebook table. Is there any other format other than string to save only time values?

Have you tried?:
import java.time.LocalTime
val dtf : DateTimeFormatter = DateTimeFormatter.ofPattern("HHmmssSS")
val localTime = udf { str : String =>
LocalTime.parse(str, dtf).toString
}
that gives:
+---------+------------+
|Timestamp|converted |
+---------+------------+
|15110708 |15:11:07.080|
|15110708 |15:11:07.080|
+---------+------------+

How to convert a DateTime with milliseconds into epoch time with milliseconds

I have data in hive table in the below format.
2019-11-21 18:19:15.817
I wrote a sql query as below to get the above column value into epoch format.
val newDF = spark.sql(f"""select TRIM(id) as ID, unix_timestamp(sig_ts) as SIG_TS from table""")
And I am getting the output column SIG_TS as 1574360296 which is not having milliseconds.
How to get the epoch timestamp of a date with milliseconds?

Simple way: Create an UDF since spark's built-in function truncates at seconds.
import java.sql.Timestamp
val fullTimestampUDF = udf{t: Timestamp => t.getTime}
val df = Seq("2019-11-21 18:19:15.817").toDF("sig_ts")
.withColumn("sig_ts_ut", unix_timestamp($"sig_ts"))
.withColumn("sig_ts_ut_long", fullTimestampUDF($"sig_ts"))
df.show(false)
+-----------------------+----------+--------------+
|sig_ts |sig_ts_ut |sig_ts_ut_long|
+-----------------------+----------+--------------+
|2019-11-21 18:19:15.817|1574356755|1574356755817 |
+-----------------------+----------+--------------+

How to convert timestamp column to epoch seconds?

How do you convert a timestamp column to epoch seconds?
var df = sc.parallelize(Seq("2018-07-01T00:00:00Z")).toDF("date_string")
df = df.withColumn("timestamp", $"date_string".cast("timestamp"))
df.show(false)
DataFrame:
+--------------------+---------------------+
|date_string |timestamp |
+--------------------+---------------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|
+--------------------+---------------------+

If you have a timestamp you can cast it to a long to get the epoch seconds
df = df.withColumn("epoch_seconds", $"timestamp".cast("long"))
df.show(false)
DataFrame
+--------------------+---------------------+-------------+
|date_string |timestamp |epoch_seconds|
+--------------------+---------------------+-------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|1530403200 |
+--------------------+---------------------+-------------+

Use unix_timestamp from org.apache.spark.functions. It can a timestamp column or from a string column where it is possible to specify the format. From the documentation:
public static Column unix_timestamp(Column s)
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return null if fail.
public static Column unix_timestamp(Column s, String p)
Convert time string with given pattern (see http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) to Unix time stamp (in seconds), return null if fail.
Use as follows:
import org.apache.spark.functions._
df.withColumn("epoch_seconds", unix_timestamp($"timestamp")))
or if the column is a string with other format:
df.withColumn("epoch_seconds", unix_timestamp($"date_string", "yyyy-MM-dd'T'HH:mm:ss'Z'")))

It can be easily done with unix_timestamp function in spark SQL like this:
spark.sql("SELECT unix_timestamp(inv_time) AS time_as_long FROM agg_counts LIMIT 10").show()
Hope this helps.

You can use the function unix_timestamp and cast it into any datatype.
Example:
val df1 = df.select(unix_timestamp($"date_string", "yyyy-MM-dd HH:mm:ss").cast(LongType).as("epoch_seconds"))

Scala : how to convert integer to time stamp

I am facing an issue when i am trying to find the number of months between two dates using 'months_between'function. when my input date format is 'dd/mm/yyyy' or any other date format then the function is returning the correct output. however when i am passing the input date format as yyyymmdd then i am getting the below error.
Code:
val df = spark.read.option("header", "true").option("dateFormat", "yyyyMMdd").option("inferSchema", "true").csv("MyFile.csv")
val filteredMemberDF = df.withColumn("monthsBetween", functions.months_between(col("toDate"), col("fromDT")))
error:
cannot resolve 'months_between(toDate, fromDT)' due to data type mismatch: argument 1 requires timestamp type,
however, 'toDate' is of int type. argument 2 requires timestamp type, however, 'fromDT' is of int type.;
When my input is as below,
id fromDT toDate
11 16/06/2008 16/08/2008
12 13/07/2008 13/10/2008
getting expected output,
id fromDT toDate monthsBetween
11 16/6/2008 16/8/2008 2
12 13/7/2008 13/10/2008 3
when i am passing the below data, facing the above said error.
id fromDT toDate
11 20150930 20150930
12 20150930 20150930

You first need to use to_date function to convert those numbers to DateTimes.
import org.apache.spark.sql.functions._
val df = spark.read
.option("header", "true")
.option("dateFormat", "yyyyMMdd")
.option("inferSchema", "true")
.csv("MyFile.csv")
val dfWithDates = df
.withColumn("toDateReal", to_date(concat(col("toDate")), "yyyyMMdd"))
.withColumn("fromDateReal", to_date(concat(col("fromDT")), "yyyyMMdd"))
val filteredMemberDF = dfWithDates
.withColumn("monthsBetween", months_between(col("toDateReal"), col("fromDateReal")))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Convert timestamp column from UTC to EST in spark scala - scala

There is a typo in your format string when creating the output. Your format string should be yyyy-MM-dd'T'HH:mm:s.SS'Z' but it is yyyy-mm-dd'T'HH:mm:s.SS'Z'. mm is the format char for minutes while MM is the format char for the months. You can check all format chars here.

Related

Convert date to another format Scala Spark

Convert time string into timestamp/date time in scala

How to convert a DateTime with milliseconds into epoch time with milliseconds

How to convert timestamp column to epoch seconds?

Scala : how to convert integer to time stamp

Categories

Resources