Converting CDT timestamp into UTC format in spark scala - scala

My Dataframe, myDF is like bellow -
DATE_TIME
Wed Sep 6 15:24:27 CDT 2017
Wed Sep 6 15:30:05 CDT 2017
Expected output in format :
2017-09-06 15:24:27
2017-09-06 15:30:05
Need to convert DATE_TIME timestamp to UTC.
Tried the below code in databricks notebook but it's not working.
%scala
val df = Seq(("Wed Sep 6 15:24:27 CDT 2017")).toDF("times")
df.withColumn("times2",date_format(to_timestamp('times,"ddd MMM dd hh:mm:ss CDT yyyy"),"yyyy-MM-dd HH:mm:ss")).show(false)
times | times2
Wed Sep 6 15:24:27 CDT 2017 | null

I think we need to remove wed from your string then use to_timestamp() function.
Example:
df.show(false)
/*
+---------------------------+
|times |
+---------------------------+
|Wed Sep 6 15:24:27 CDT 2017|
+---------------------------+
*/
df.withColumn("times2",expr("""to_timestamp(substring(times,5,length(times)),"MMM d HH:mm:ss z yyyy")""")).
show(false)
/*
+---------------------------+-------------------+
|times |times2 |
+---------------------------+-------------------+
|Wed Sep 6 15:24:27 CDT 2017|2017-09-06 15:24:27|
+---------------------------+-------------------+
*/

Related

Value of date changes in scala when reading from yaml file

I have a YAML file, which has data -
time:
- days : 50
- date : 2020-02-30
I am reading both the things in my scala program -
val t = yml.get(time).asInstanceOf[util.Arraylist[util.LinkedHashMap[String, Any]]]
val date = t.get(1)
this output of date is - {date=Sat Feb 29 19:00:00 EST 2020}
or if I put the value in yaml file as date: 2020-02-25 the output will be {date=Mon Feb 24 19:00:00 EST 2020} and not {date=Tue Feb 25 19:00:00 EST 2020}.
Why is it always reducing the date value by 1?
Any help is highly appreciated.
PS - I want to validate the date; that is why the input is date: 2020-02-30

convert Tue Jul 07 2019 12:30:42 to timestamp scala /spark

I need to transform convert Tue Jul 07 2020 12:30:42 to timestamp with scala for spark.
So the expected result will be : 2020-07-07 12:30:42
Any idea, how to make this please ?
You can use the to_timestamp function.
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY") <-- Spark 3.0 Only.
df.withColumn("date", to_timestamp('string, "E MMM dd yyyy HH:mm:ss"))
.show(false)
+------------------------+-------------------+
|string |date |
+------------------------+-------------------+
|Tue Jul 07 2020 12:30:42|2020-07-07 12:30:42|
+------------------------+-------------------+

Unexpected date when converting string to timestamp in pyspark

The following example:
import pyspark.sql.functions as F
df = sqlContext.createDataFrame([('Feb 4 1997 10:30:00',), ('Jan 14 2000 13:33:00',), ('Jan 13 2020 01:20:12',)], ['t'])
ts_format = "MMM dd YYYY HH:mm:ss"
df.select(df.t,
F.to_timestamp(df.t, ts_format),
F.date_format(F.current_timestamp(), ts_format))\
.show(truncate=False)
Outputs:
+--------------------+-----------------------------------------+------------------------------------------------------+
|t |to_timestamp(`t`, 'MMM dd YYYY HH:mm:ss')|date_format(current_timestamp(), MMM dd YYYY HH:mm:ss)|
+--------------------+-----------------------------------------+------------------------------------------------------+
|Feb 4 1997 10:30:00 |1996-12-29 10:30:00 |Jan 22 2020 14:38:28 |
|Jan 14 2000 13:33:00|1999-12-26 13:33:00 |Jan 22 2020 14:38:28 |
|Jan 22 2020 14:29:12|2019-12-29 14:29:12 |Jan 22 2020 14:38:28 |
+--------------------+-----------------------------------------+------------------------------------------------------+
Question:
The conversion from current_timestamp() to string works with the given format. Why the other way (String to Timestamp) doesn't?
Notes:
pyspark 2.4.4 docs point to simpleDateFormat patterns
Changing the year's format to lowercase fixed the issue
ts_format = "MMM dd yyyy HH:mm:ss"

How to get the lag of a column in a Spark streaming dataframe?

I have data streaming into my Spark Scala application in this format
id mark1 mark2 mark3 time
uuid1 100 200 300 Tue Aug 8 14:06:02 PDT 2017
uuid1 100 200 300 Tue Aug 8 14:06:22 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:32 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:52 PDT 2017
uuid2 150 250 350 Tue Aug 8 14:06:58 PDT 2017
I have it read into columns id, mark1, mark2, mark3 and time. The time is converted to datetime format as well.
I want to get this grouped by id and get the lag for mark1 which gives the previous row's mark1 value.
Something like this:
id mark1 mark2 mark3 prev_mark time
uuid1 100 200 300 null Tue Aug 8 14:06:02 PDT 2017
uuid1 100 200 300 100 Tue Aug 8 14:06:22 PDT 2017
uuid2 150 250 350 null Tue Aug 8 14:06:32 PDT 2017
uuid2 150 250 350 150 Tue Aug 8 14:06:52 PDT 2017
uuid2 150 250 350 150 Tue Aug 8 14:06:58 PDT 2017
Consider the dataframe to be markDF. I have tried:
val window = Window.partitionBy("uuid").orderBy("timestamp")
val newerDF = newDF.withColumn("prev_mark", lag("mark1", 1, null).over(window))`
which says non time windows cannot be applied on streaming/appending datasets/frames.
I have also tried:
val window = Window.partitionBy("uuid").orderBy("timestamp").rowsBetween(-10, 10)
val newerDF = newDF.withColumn("prev_mark", lag("mark1", 1, null).over(window))
To get a window for few rows which did not work either. The streaming window something like:
window("timestamp", "10 minutes")
cannot be used to send over the lag. I am super confused on how to do this. Any help would be awesome!!
I would advise you to change the time column into String as
+-----+-----+-----+-----+----------------------------+
|id |mark1|mark2|mark3|time |
+-----+-----+-----+-----+----------------------------+
|uuid1|100 |200 |300 |Tue Aug 8 14:06:02 PDT 2017|
|uuid1|100 |200 |300 |Tue Aug 8 14:06:22 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:32 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:52 PDT 2017|
|uuid2|150 |250 |350 |Tue Aug 8 14:06:58 PDT 2017|
+-----+-----+-----+-----+----------------------------+
root
|-- id: string (nullable = true)
|-- mark1: integer (nullable = false)
|-- mark2: integer (nullable = false)
|-- mark3: integer (nullable = false)
|-- time: string (nullable = true)
After that doing the following should work
df.withColumn("prev_mark", lag("mark1", 1).over(Window.partitionBy("id").orderBy("time")))
Which will give you output as
+-----+-----+-----+-----+----------------------------+---------+
|id |mark1|mark2|mark3|time |prev_mark|
+-----+-----+-----+-----+----------------------------+---------+
|uuid1|100 |200 |300 |Tue Aug 8 14:06:02 PDT 2017|null |
|uuid1|100 |200 |300 |Tue Aug 8 14:06:22 PDT 2017|100 |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:32 PDT 2017|null |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:52 PDT 2017|150 |
|uuid2|150 |250 |350 |Tue Aug 8 14:06:58 PDT 2017|150 |
+-----+-----+-----+-----+----------------------------+---------+

How to find Minimum in a sequence

A = ... //tuples of Medication(patientid,date,medicine)
B = A.groupby(x => x.patientid)
Example B would look like below - now I need to find the minimum date, how to do that in scala??
(
478009505-01,
CompactBuffer(
Some(Medication(478009505-01,Fri Jun 12 10:30:00 EDT 2009,glimepiride)),
Some(Medication(478009505-01,Fri Jun 12 10:30:00 EDT 2009,glimepiride)),
Some(Medication(478009505-01,Fri Jun 12 10:30:00 EDT 2009,glimepiride))
)
)
Making some assumptions on the types:
case class Medication(id: Int, date: String, medicine: String)
val l = List(
Some(Medication(478009505, "Fri Jun 12 10:30:00 EDT 2010", "glimepiride")),
Some(Medication(478009505, "Fri Jun 12 10:30:00 EDT 2008", "glimepiride")),
None,
Some(Medication(478009505, "Fri Jun 12 10:30:00 EDT 2011", "glimepiride"))
)
You can use a for comprehension to extract all the dates, then get the min with minBy:
import java.text.SimpleDateFormat
val format = new SimpleDateFormat("EEE MMM dd hh:mm:ss zzz yyyy")
def createDateTime(s: String) = new Date(format.parse(s).getTime))
val dates = for {
optMed <- l // foreach item
med <- optMed // if it contains some value
} yield createDateTime(med.date) // create a comparable date
dates.minBy(_.getTime) // get the minimum date
The result is the oldest date(2008-06-12)