Timestamp formats and time zones in Spark (scala API) - scala

******* UPDATE ********
As suggested in the comments I eliminated the irrelevant part of the code:
My requirements:
Unify number of milliseconds to 3
Transform string to timestamp and keep the value in UTC
Create dataframe:
val df = Seq("2018-09-02T05:05:03.456Z","2018-09-02T04:08:32.1Z","2018-09-02T05:05:45.65Z").toDF("Timestamp")
Here the reults using the spark shell:
************ END UPDATE *********************************
I am having a nice headache trying to deal with time zones and timestamp formats in Spark using scala.
This is a simplification of my script to explain my problem:
import org.apache.spark.sql.functions._
val jsonRDD = sc.wholeTextFiles("file:///data/home2/phernandez/vpp/Test_Message.json")
val jsonDF = spark.read.json(jsonRDD.map(f => f._2))
This is the resulting schema:
root
|-- MeasuredValues: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- MeasuredValue: double (nullable = true)
| | |-- Status: long (nullable = true)
| | |-- Timestamp: string (nullable = true)
Then I just select the Timestamp field as follows
jsonDF.select(explode($"MeasuredValues").as("Values")).select($"Values.Timestamp").show(5,false)
First thing I want to fix is the number of milliseconds of every timestamp and unify it to three.
I applied the date_format as follows
jsonDF.select(explode($"MeasuredValues").as("Values")).select(date_format($"Values.Timestamp","yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")).show(5,false)
Milliseconds format was fixed but timestamp is converted from UTC to local time.
To tackle this issue, I applied the to_utc_timestamp together with my local time zone.
jsonDF.select(explode($"MeasuredValues").as("Values")).select(to_utc_timestamp(date_format($"Values.Timestamp","yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"),"Europe/Berlin").as("Timestamp")).show(5,false)
Even worst, UTC value is not returned, and the milliseconds format is lost.
Any Ideas how to deal with this? I will appreciated it 😊
BR. Paul

The cause of the problem is the time format string used for conversion:
yyyy-MM-dd'T'HH:mm:ss.SSS'Z'
As you may see, Z is inside single quotes, which means that it is not interpreted as the zone offset marker, but only as a character like T in the middle.
So, the format string should be changed to
yyyy-MM-dd'T'HH:mm:ss.SSSX
where X is the Java standard date time formatter pattern (Z being the offset value for 0).
Now, the source data can be converted to UTC timestamps:
val srcDF = Seq(
("2018-04-10T13:30:34.45Z"),
("2018-04-10T13:45:55.4Z"),
("2018-04-10T14:00:00.234Z"),
("2018-04-10T14:15:04.34Z"),
("2018-04-10T14:30:23.45Z")
).toDF("Timestamp")
val convertedDF = srcDF.select(to_utc_timestamp(date_format($"Timestamp", "yyyy-MM-dd'T'HH:mm:ss.SSSX"), "Europe/Berlin").as("converted"))
convertedDF.printSchema()
convertedDF.show(false)
/**
root
|-- converted: timestamp (nullable = true)
+-----------------------+
|converted |
+-----------------------+
|2018-04-10 13:30:34.45 |
|2018-04-10 13:45:55.4 |
|2018-04-10 14:00:00.234|
|2018-04-10 14:15:04.34 |
|2018-04-10 14:30:23.45 |
+-----------------------+
*/
If you need to convert the timestamps back to strings and normalize the values to have 3 trailing zeros, there should be another date_format call, similar to what you have already applied in the question.

Related

Pyspark date_trunc without modifying actual value

Consider the below dataframe
df:
time
2022-02-21T11:23:54
I have to convert it to
time
2022-02-21T11:23:00
After using the below code
df.withColumn("time_updated", date_trunc("minute", col("time"))).show(truncate = False)
My output
time
2022-02-21 11:23:00
By desired output is
time
2022-02-21T11:23:00
Is there anyway I can keep the data same and just update/truncate the seconds??
you simply have a format issue. the output that you see is the string representation of a timestamp. check your output formats :
from pyspark.sql import functions as F, Window as W, types as T
df = df.withColumn(
"time_updated",
F.date_format(F.col("time").cast("timestamp"), "YYYY-MM-dd'T'HH:mm:00"),
)
df.show(truncate=False)
+-------------------+-------------------+
|time |time_updated |
+-------------------+-------------------+
|2022-02-21T11:23:54|2022-02-21T11:23:00|
+-------------------+-------------------+
df.printSchema()
root
|-- time: string (nullable = true)
|-- time_updated: string (nullable = true)

EDIT: spark scala inbuilt udf : to_timestamp() ignores the millisecond part of the timestamp value

Sample Code:
val sparkSession = SparkUtil.getSparkSession("timestamp_format_test")
import sparkSession.implicits._
val format = "yyyy/MM/dd HH:mm:ss.SSS"
val time = "2018/12/21 08:07:36.927"
val df = sparkSession.sparkContext.parallelize(Seq(time)).toDF("in_timestamp")
val df2 = df.withColumn("out_timestamp", to_timestamp(df.col("in_timestamp"), format))
Output:
df2.show(false)
plz notice: out_timestamp loses the milli-second part from the original value
+-----------------------+-------------------+
|in_timestamp |out_timestamp |
+-----------------------+-------------------+
|2018/12/21 08:07:36.927|2018-12-21 08:07:36|
+-----------------------+-------------------+
df2.printSchema()
root
|-- in_timestamp: string (nullable = true)
|-- out_timestamp: timestamp (nullable = true)
In the above result: in_timestamp is of string type, and I would like to convert to timestamp data type, it does get convert but the millisecond part gets lost. Any idea.? Thanks.!
Sample code for preserving millisecond during conversion from string to timestamp.
val df2 = df.withColumn("out_timestamp", to_timestamp(df.col("in_timestamp")))
df2.show(false)
+-----------------------+-----------------------+
|in_timestamp |out_timestamp |
+-----------------------+-----------------------+
|2018-12-21 08:07:36.927|2018-12-21 08:07:36.927|
+-----------------------+-----------------------+
scala> df2.printSchema
root
|-- in_timestamp: string (nullable = true)
|-- out_timestamp: timestamp (nullable = true)
You just need to remove format parameter from to_timestamp. This will save your result with data type timestamp similar to String value.

Pyspark from_unixtime (unix_timestamp) does not convert to timestamp

I am using Pyspark with Python 2.7. I have a date column in string (with ms) and would like to convert to timestamp
This is what I have tried so far
df = df.withColumn('end_time', from_unixtime(unix_timestamp(df.end_time, '%Y-%M-%d %H:%m:%S.%f')) )
printSchema() shows
end_time: string (nullable = true)
when I expended timestamp as the type of variable
Try using from_utc_timestamp:
from pyspark.sql.functions import from_utc_timestamp
df = df.withColumn('end_time', from_utc_timestamp(df.end_time, 'PST'))
You'd need to specify a timezone for the function, in this case I chose PST
If this does not work please give us an example of a few rows showing df.end_time
Create a sample dataframe with Time-stamp formatted as string:
import pyspark.sql.functions as F
df = spark.createDataFrame([('22-Jul-2018 04:21:18.792 UTC', ),('23-Jul-2018 04:21:25.888 UTC',)], ['TIME'])
df.show(2,False)
df.printSchema()
Output:
+----------------------------+
|TIME |
+----------------------------+
|22-Jul-2018 04:21:18.792 UTC|
|23-Jul-2018 04:21:25.888 UTC|
+----------------------------+
root
|-- TIME: string (nullable = true)
Converting string time-format (including milliseconds ) to unix_timestamp(double). Since unix_timestamp() function excludes milliseconds we need to add it using another simple hack to include milliseconds. Extracting milliseconds from string using substring method (start_position = -7, length_of_substring=3) and Adding milliseconds seperately to unix_timestamp. (Cast to substring to float for adding)
df1 = df.withColumn("unix_timestamp",F.unix_timestamp(df.TIME,'dd-MMM-yyyy HH:mm:ss.SSS z') + F.substring(df.TIME,-7,3).cast('float')/1000)
Converting unix_timestamp(double) to timestamp datatype in Spark.
df2 = df1.withColumn("TimestampType",F.to_timestamp(df1["unix_timestamp"]))
df2.show(n=2,truncate=False)
This will give you following output
+----------------------------+----------------+-----------------------+
|TIME |unix_timestamp |TimestampType |
+----------------------------+----------------+-----------------------+
|22-Jul-2018 04:21:18.792 UTC|1.532233278792E9|2018-07-22 04:21:18.792|
|23-Jul-2018 04:21:25.888 UTC|1.532319685888E9|2018-07-23 04:21:25.888|
+----------------------------+----------------+-----------------------+
Checking the Schema:
df2.printSchema()
root
|-- TIME: string (nullable = true)
|-- unix_timestamp: double (nullable = true)
|-- TimestampType: timestamp (nullable = true)
in current version of spark , we do not have to do much with respect to timestamp conversion.
using to_timestamp function works pretty well in this case. only thing we need to take care is input the format of timestamp according to the original column.
in my case it was in format yyyy-MM-dd HH:mm:ss.
other format can be like MM/dd/yyyy HH:mm:ss or a combination as such.
from pyspark.sql.functions import to_timestamp
df=df.withColumn('date_time',to_timestamp('event_time','yyyy-MM-dd HH:mm:ss'))
df.show()
Following might help:-
from pyspark.sql import functions as F
df = df.withColumn("end_time", F.from_unixtime(F.col("end_time"), 'yyyy-MM-dd HH:mm:ss.SS').cast("timestamp"))
[Updated]

Spark dataframe convert integer to timestamp and find date difference

I have this DataFrame org.apache.spark.sql.DataFrame:
|-- timestamp: integer (nullable = true)
|-- checkIn: string (nullable = true)
| timestamp| checkIn|
+----------+----------+
|1521710892|2018-05-19|
|1521710892|2018-05-19|
Desired result: obtain a new column with day difference between date checkIn and timestamp (2018-03-03 23:59:59 and 2018-03-04 00:00:01 should have a difference of 1)
Thus, i need to
convert timestamp to date (This is where i'm stuck)
take out one date from another
use some function to extract day(Have not found this function yet)
You can use from_unixtime to convert your timestamp to date and datediff to calculate the difference in days:
val df = Seq(
(1521710892, "2018-05-19"),
(1521730800, "2018-01-01")
).toDF("timestamp", "checkIn")
df.withColumn("tsDate", from_unixtime($"timestamp")).
withColumn("daysDiff", datediff($"tsDate", $"checkIn")).
show
// +----------+----------+-------------------+--------+
// | timestamp| checkIn| tsDate|daysDiff|
// +----------+----------+-------------------+--------+
// |1521710892|2018-05-19|2018-03-22 02:28:12| -58|
// |1521730800|2018-01-01|2018-03-22 08:00:00| 80|
// +----------+----------+-------------------+--------+

SpHow to merge two fields(string type) of a DataFrame's column to generate a Date

I have a DataFrame which simplified schema has got two columns with 3 fields each column:
root
|-- npaDetails: struct (nullable = true)
| |-- additionalInformation: struct (nullable = true)
| |-- npaStatus: struct (nullable = true)
| |-- npaDetails: struct (nullable = true)
|-- npaHeaderData: struct (nullable = true)
| |-- npaNumber: string (nullable = true)
| |-- npaDownloadDate: string (nullable = true)
| |-- npaDownloadTime: string (nullable = true)
Possible values:
npaDownloadDate - "30JAN17"
npaDownloadTime - "19.50.00"
I need to compare two rows in a DataFrame with this schema, to determine which one is "fresher". To do so I need to merge the fields npaDownloadDate and npaDownloadTime to generate a Date that I can compare easily.
Below its the code I have written so far. It works, but I think it takes more steps than necessary and I'm sure that Scala offers better solutions than my approach.
val parquetFileDF = sqlContext.read.parquet("MyParquet.parquet")
val relevantRows = parquetFileDF.filter($"npaHeaderData.npaNumber" === "123456")
val date = relevantRows .select($"npaHeaderData.npaDownloadDate").head().get(0)
val time = relevantRows .select($"npaHeaderData.npaDownloadTime").head().get(0)
val dateTime = new SimpleDateFormat("ddMMMyykk.mm.ss").(date+time)
//I would replicate the previous steps to get dateTime2
if(dateTime.before(dateTime2))
println("dateTime is before dateTime2")
So the output of "30JAN17" and "19.50.00" would be Mon Jan 30 19:50:00 GST 2017
Is there another way to generate a Date from two fields of a column, without extracting and merging them as strings? Or even better, is it possible to compare directly both values (date and time) between two different rows in a dataframe to know which one has an older date
In spark 2.2,
df.filter(
to_date(
concat(
$"npaHeaderData.npaDownloadDate",
$"npaHeaderData.npaDownloadTime"),
fmt = "[your format here]")
) < lit(some date))
I'd use
import org.apache.spark.sql.functions._
df.withColumn("some_name", date_format(unix_timestamp(
concat($"npaHeaderData.npaDownloadDate", $"npaHeaderData.npaDownloadTime"),
"ddMMMyykk.mm.ss").cast("timestamp"),
"EEE MMM d HH:mm:ss z yyyy"))