Spark dataframe convert integer to timestamp and find date difference - scala

I have this DataFrame org.apache.spark.sql.DataFrame:
|-- timestamp: integer (nullable = true)
|-- checkIn: string (nullable = true)
| timestamp| checkIn|
+----------+----------+
|1521710892|2018-05-19|
|1521710892|2018-05-19|
Desired result: obtain a new column with day difference between date checkIn and timestamp (2018-03-03 23:59:59 and 2018-03-04 00:00:01 should have a difference of 1)
Thus, i need to
convert timestamp to date (This is where i'm stuck)
take out one date from another
use some function to extract day(Have not found this function yet)

You can use from_unixtime to convert your timestamp to date and datediff to calculate the difference in days:
val df = Seq(
(1521710892, "2018-05-19"),
(1521730800, "2018-01-01")
).toDF("timestamp", "checkIn")
df.withColumn("tsDate", from_unixtime($"timestamp")).
withColumn("daysDiff", datediff($"tsDate", $"checkIn")).
show
// +----------+----------+-------------------+--------+
// | timestamp| checkIn| tsDate|daysDiff|
// +----------+----------+-------------------+--------+
// |1521710892|2018-05-19|2018-03-22 02:28:12| -58|
// |1521730800|2018-01-01|2018-03-22 08:00:00| 80|
// +----------+----------+-------------------+--------+

Related

Spark fails to convert String to TIMESTAMP

I have a hive table that contains a String column: this is an example:
| DT |
|-------------------------------|
| 2019-05-07 00:03:53.837000000 |
when I try to import the table inside a Spark-Scala DF transforming the String to a timestamp I only have null values:
val df = spark.sql(s"""select to_timestamp(dt_maj, 'yyyy-MM-dd HH:mm:ss.SSS') from ${use_database}.pz_send_demande_diffusion""").show()
| DT |
|------|
| null |
Doing
val df = spark.sql(s"""select dt from ${use_database}.pz_send_demande_diffusion""").show()
gives a good result (column with the String values). So Spark is importing te column normally.
I also tried:
val df = spark.sql(s"""select to_timestamp('2005-05-04 11:12:54.297', 'yyyy-MM-dd HH:mm:ss.SSS') from ${use_database}.pz_send_demande_diffusion""").show()
And it worked! It returns a TIMESTAMPs column.
What is the problem ?
Trim your extra 0s. Then,
df.withColumn("new", to_timestamp($"date".substr(lit(1),length($"date") - 6), "yyyy-MM-dd HH:mm:ss.SSS")).show(false)
the result is:
+-----------------------------+-------------------+
|date |new |
+-----------------------------+-------------------+
|2019-05-07 00:03:53.837000000|2019-05-07 00:03:53|
+-----------------------------+-------------------+
The schema:
root
|-- date: string (nullable = true)
|-- new: timestamp (nullable = true)
I think you should use following format yyyy-MM-dd HH:mm:ss.SSSSSSSSS for this type of data 2019-05-07 00:03:53.837000000

Pyspark from_unixtime (unix_timestamp) does not convert to timestamp

I am using Pyspark with Python 2.7. I have a date column in string (with ms) and would like to convert to timestamp
This is what I have tried so far
df = df.withColumn('end_time', from_unixtime(unix_timestamp(df.end_time, '%Y-%M-%d %H:%m:%S.%f')) )
printSchema() shows
end_time: string (nullable = true)
when I expended timestamp as the type of variable
Try using from_utc_timestamp:
from pyspark.sql.functions import from_utc_timestamp
df = df.withColumn('end_time', from_utc_timestamp(df.end_time, 'PST'))
You'd need to specify a timezone for the function, in this case I chose PST
If this does not work please give us an example of a few rows showing df.end_time
Create a sample dataframe with Time-stamp formatted as string:
import pyspark.sql.functions as F
df = spark.createDataFrame([('22-Jul-2018 04:21:18.792 UTC', ),('23-Jul-2018 04:21:25.888 UTC',)], ['TIME'])
df.show(2,False)
df.printSchema()
Output:
+----------------------------+
|TIME |
+----------------------------+
|22-Jul-2018 04:21:18.792 UTC|
|23-Jul-2018 04:21:25.888 UTC|
+----------------------------+
root
|-- TIME: string (nullable = true)
Converting string time-format (including milliseconds ) to unix_timestamp(double). Since unix_timestamp() function excludes milliseconds we need to add it using another simple hack to include milliseconds. Extracting milliseconds from string using substring method (start_position = -7, length_of_substring=3) and Adding milliseconds seperately to unix_timestamp. (Cast to substring to float for adding)
df1 = df.withColumn("unix_timestamp",F.unix_timestamp(df.TIME,'dd-MMM-yyyy HH:mm:ss.SSS z') + F.substring(df.TIME,-7,3).cast('float')/1000)
Converting unix_timestamp(double) to timestamp datatype in Spark.
df2 = df1.withColumn("TimestampType",F.to_timestamp(df1["unix_timestamp"]))
df2.show(n=2,truncate=False)
This will give you following output
+----------------------------+----------------+-----------------------+
|TIME |unix_timestamp |TimestampType |
+----------------------------+----------------+-----------------------+
|22-Jul-2018 04:21:18.792 UTC|1.532233278792E9|2018-07-22 04:21:18.792|
|23-Jul-2018 04:21:25.888 UTC|1.532319685888E9|2018-07-23 04:21:25.888|
+----------------------------+----------------+-----------------------+
Checking the Schema:
df2.printSchema()
root
|-- TIME: string (nullable = true)
|-- unix_timestamp: double (nullable = true)
|-- TimestampType: timestamp (nullable = true)
in current version of spark , we do not have to do much with respect to timestamp conversion.
using to_timestamp function works pretty well in this case. only thing we need to take care is input the format of timestamp according to the original column.
in my case it was in format yyyy-MM-dd HH:mm:ss.
other format can be like MM/dd/yyyy HH:mm:ss or a combination as such.
from pyspark.sql.functions import to_timestamp
df=df.withColumn('date_time',to_timestamp('event_time','yyyy-MM-dd HH:mm:ss'))
df.show()
Following might help:-
from pyspark.sql import functions as F
df = df.withColumn("end_time", F.from_unixtime(F.col("end_time"), 'yyyy-MM-dd HH:mm:ss.SS').cast("timestamp"))
[Updated]

Timestamp formats and time zones in Spark (scala API)

******* UPDATE ********
As suggested in the comments I eliminated the irrelevant part of the code:
My requirements:
Unify number of milliseconds to 3
Transform string to timestamp and keep the value in UTC
Create dataframe:
val df = Seq("2018-09-02T05:05:03.456Z","2018-09-02T04:08:32.1Z","2018-09-02T05:05:45.65Z").toDF("Timestamp")
Here the reults using the spark shell:
************ END UPDATE *********************************
I am having a nice headache trying to deal with time zones and timestamp formats in Spark using scala.
This is a simplification of my script to explain my problem:
import org.apache.spark.sql.functions._
val jsonRDD = sc.wholeTextFiles("file:///data/home2/phernandez/vpp/Test_Message.json")
val jsonDF = spark.read.json(jsonRDD.map(f => f._2))
This is the resulting schema:
root
|-- MeasuredValues: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- MeasuredValue: double (nullable = true)
| | |-- Status: long (nullable = true)
| | |-- Timestamp: string (nullable = true)
Then I just select the Timestamp field as follows
jsonDF.select(explode($"MeasuredValues").as("Values")).select($"Values.Timestamp").show(5,false)
First thing I want to fix is the number of milliseconds of every timestamp and unify it to three.
I applied the date_format as follows
jsonDF.select(explode($"MeasuredValues").as("Values")).select(date_format($"Values.Timestamp","yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")).show(5,false)
Milliseconds format was fixed but timestamp is converted from UTC to local time.
To tackle this issue, I applied the to_utc_timestamp together with my local time zone.
jsonDF.select(explode($"MeasuredValues").as("Values")).select(to_utc_timestamp(date_format($"Values.Timestamp","yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"),"Europe/Berlin").as("Timestamp")).show(5,false)
Even worst, UTC value is not returned, and the milliseconds format is lost.
Any Ideas how to deal with this? I will appreciated it 😊
BR. Paul
The cause of the problem is the time format string used for conversion:
yyyy-MM-dd'T'HH:mm:ss.SSS'Z'
As you may see, Z is inside single quotes, which means that it is not interpreted as the zone offset marker, but only as a character like T in the middle.
So, the format string should be changed to
yyyy-MM-dd'T'HH:mm:ss.SSSX
where X is the Java standard date time formatter pattern (Z being the offset value for 0).
Now, the source data can be converted to UTC timestamps:
val srcDF = Seq(
("2018-04-10T13:30:34.45Z"),
("2018-04-10T13:45:55.4Z"),
("2018-04-10T14:00:00.234Z"),
("2018-04-10T14:15:04.34Z"),
("2018-04-10T14:30:23.45Z")
).toDF("Timestamp")
val convertedDF = srcDF.select(to_utc_timestamp(date_format($"Timestamp", "yyyy-MM-dd'T'HH:mm:ss.SSSX"), "Europe/Berlin").as("converted"))
convertedDF.printSchema()
convertedDF.show(false)
/**
root
|-- converted: timestamp (nullable = true)
+-----------------------+
|converted |
+-----------------------+
|2018-04-10 13:30:34.45 |
|2018-04-10 13:45:55.4 |
|2018-04-10 14:00:00.234|
|2018-04-10 14:15:04.34 |
|2018-04-10 14:30:23.45 |
+-----------------------+
*/
If you need to convert the timestamps back to strings and normalize the values to have 3 trailing zeros, there should be another date_format call, similar to what you have already applied in the question.

Spark Scala -difference between current date and max(day)

Need to calculate the difference between two dates. The question is
Currentdate - max(day_id)
"Currentdate" is of simple date format - yyyyMMdd
"day_id" is of string format and its value is yyyy-mm-dd.
I have a dataframe which converted the date(string format) to date format (yyyy-mm-dd)
df1 = to_date(from_unixtime(unix_timestamp(day_id, 'yyyy-MM-dd')))
Normally, for finding the max(day_id), I would do
def daySince (columnName: String): Column = {
max(col(columnName))
I cannot figure out how to do the difference between
Currentdate - max(day_id)
Given input dataframe with schema as
+---+----------+
|id |day_id |
+---+----------+
|id1|2017-11-21|
|id1|2018-01-21|
|id2|2017-12-21|
+---+----------+
root
|-- id: string (nullable = true)
|-- day_id: string (nullable = true)
you can use current_date() and datediff() inbuilt functions to meet your requirement as
import org.apache.spark.sql.functions._
df.withColumn("diff", datediff(current_date(), to_date(col("day_id"), "yyyy-MM-dd")))
which should give you
+---+----------+----+
|id |day_id |diff|
+---+----------+----+
|id1|2017-11-21|167 |
|id1|2018-01-21|106 |
|id2|2017-12-21|137 |
+---+----------+----+

Converting pattern of date in spark dataframe

I have a column in spark dataframe of String datatype (with date in yyyy-MM-dd pattern)
I want to display the column value in MM/dd/yyyy pattern
My data is
val df = sc.parallelize(Array(
("steak", "1990-01-01", "2000-01-01", 150),
("steak", "2000-01-02", "2001-01-13", 180),
("fish", "1990-01-01", "2001-01-01", 100)
)).toDF("name", "startDate", "endDate", "price")
df.show()
+-----+----------+----------+-----+
| name| startDate| endDate|price|
+-----+----------+----------+-----+
|steak|1990-01-01|2000-01-01| 150|
|steak|2000-01-02|2001-01-13| 180|
| fish|1990-01-01|2001-01-01| 100|
+-----+----------+----------+-----+
root
|-- name: string (nullable = true)
|-- startDate: string (nullable = true)
|-- endDate: string (nullable = true)
|-- price: integer (nullable = false)
I want to show endDate in MM/dd/yyyy pattern. All I am able to do is convert the column to DateType from String
val df2 = df.select($"endDate".cast(DateType).alias("endDate"))
df2.show()
+----------+
| endDate|
+----------+
|2000-01-01|
|2001-01-13|
|2001-01-01|
+----------+
df2.printSchema()
root
|-- endDate: date (nullable = true)
I want to show endDate in MM/dd/yyyy pattern. Only reference I found is this which doesn't solve the problem
You can use date_format function.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(
("steak", "1990-01-01", "2000-01-01", 150),
("steak", "2000-01-02", "2001-01-13", 180),
("fish", "1990-01-01", "2001-01-01", 100))).toDF("name", "startDate", "endDate", "price")
df.show()
df.select(date_format(col("endDate"), "MM/dd/yyyy")).show
Output :
+-------------------------------+
|date_format(endDate,MM/dd/yyyy)|
+-------------------------------+
| 01/01/2000|
| 01/13/2001|
| 01/01/2001|
+-------------------------------+
Use pyspark.sql.functions.date_format(date, format):
val df2 = df.select(date_format("endDate", "MM/dd/yyyy").alias("endDate"))
Dataframe/Dataset having a string column with date value in it and we need to change the date format.
For the query asked, date format can be changed as below:
val df1 = df.withColumn("startDate1", date_format(to_date(col("startDate"),"yyyy-MM-dd"),"MM/dd/yyyy" ))
In Spark, the default date format is "yyyy-MM-dd" hence it can be re-written as
val df1 = df.withColumn("startDate1", date_format(col("startDate"),"MM/dd/yyyy" ))
(i) By applying to_date, we are changing the datatype of this column (string) to Date datatype.
Also, we are informing to_date that the format in this string column is yyyy-MM-dd so read the column accordingly.
(ii) Next, we are applying date_format to achieve the date format we require which is MM/dd/yyyy.
When time component is involved, use to_timestamp instead of to_date.
Note that 'MM' represents month and 'mm' represents minutes.