Scala Spark : Convert Double Column to Date Time Column in dataframe - scala

I am trying to write code to convert date-time columns date and last_updated_date which are actually unix times cast as doubles into "mm-dd-yyyy" format for display. How do I do this ?
import org.joda.time._
import scala.tools._
import org.joda.time.format.DateTimeFormat._
import java.text.SimpleDateFormat
import org.apache.spark.sql.functions.{unix_timestamp, to_date}
root
|-- date: double (nullable = false)
|-- last_updated_date: double (nullable = false)
|-- Percent_Used: double (nullable = false)
+------------+---------------------+------------+
| date| last_updated_date|Percent_Used|
+------------+---------------------+------------+
| 1.453923E12| 1.47080394E12| 1.948327124|
|1.4539233E12| 1.47080394E12| 2.019636442|
|1.4539236E12| 1.47080394E12| 1.995299371|
+------------+---------------------+------------+

Cast to timestamp:
df.select(col("date").cast("timestamp"));

Convert it to a timestamp using from_unixtime:
df.select(from_unixtime("date").as("date"))

Fetching datetime from float in Python
This answer works for me give a try actually its a seconds calculation
import datetime
serial = 43822.59722222222
seconds = (serial - 25569) * 86400.0
print(datetime.datetime.utcfromtimestamp(seconds))
Convert excel timestamp double value into datetime or timestamp

Related

cast the string column to date for the particular format

How to cast the string column to date column and maintain the same format in spark data frame?
I want to cast the string column to date by specifying the format, but the after cast date always comes in the default format which is yyyy-MM-dd.
But I want the Date type with the format which is in the string value(I want the data type as Date only not as String)
For example:
val spark = SparkSession.builder().master("local").appName("appName").getOrCreate()
import spark.implicits._
//here the format is MMddyyyy(For Col2 which is of String type here)
val df = List(("1","01132019"),("2","01142019")).toDF("Col1","Col2")
import org.apache.spark.sql.functions._
//Here i need the Col3 in Date type and with the format MMddyyyy But it is converting into yyyy-MM-dd
val df1 = df.withColumn("Col3",to_date($"Col2","MMddyyyy"))
//I tried this but this will give me Col3 in String data type which i need in Date
val df1 = df.withColumn("Col3",date_format(to_date($"Col2","MMddyyyy"),"MMddyyyy"))
That's not possible, Spark accepts date type yyyy-MM-dd format only.
If you need to have MMddyyyy this format date field then store as String type(if we cast to date type results null) , While processing change the format and cast as date type.
Ex:
df.withColumn("Col3",$"col2".cast("date")) //casting col2 as date datatype Results null
.withColumn("col4",to_date($"col2","MMddyyyy").cast("date")) //changing format and casting as date type
.show(false)
Result:
+----+--------+----+----------+
|Col1| Col2|Col3| col4|
+----+--------+----+----------+
| 1|01132019|null|2019-01-13|
| 2|01142019|null|2019-01-14|
+----+--------+----+----------+
Schema:
df.withColumn("Col3",$"col2".cast("date"))
.withColumn("col4",to_date($"col2","MMddyyyy").cast("date"))
.printSchema
Result:
root
|-- Col1: string (nullable = true)
|-- Col2: string (nullable = true)
|-- Col3: date (nullable = true)
|-- col4: date (nullable = true)

Is there a way to use pyspark.sql.functions.date_add with a col('column_name') as a the second parameter instead of a static integer?

During an ETL process I have this one SAS date field that is in a 5 digit integer format, which indicates days since 01-01-1960. I order to make this data column more useful in analysis I would like to convert the column to a date data type field in Redshift.
Currently I am trying to do this in pyspark as follows:
created new column "sas_date" with string literal "1960-01-01"
Using pyspark.sql.function.date_add I pass the "sas-date" column as
the start date parameter and the integer value 'arrival_date' column as the second parameter.
When the date_add function runs I get error Column not iterable, even though I would think the arrival_date column being a series would mean it was iterable. But its not, why?
When I remove the 'arrival_date' column and replace it with a static integer value (say 1) the date_add function will work.
i94 = i94.withColumn('arrival_date', col('arrival_date').cast(Int()))
i94 = i94.withColumn('sas_date', lit("1960-01-01"))
i94 = i94.withColumn('arrival_date', date_add(col('sas_date'), i94['arrival_date']))
I want to be able to pass my column so that the second date_add parameter will be dynamic. However it seems date_add does not accept this? If date_addd does not accomplish this what other option do I have outside of using a UDF?
UPDATE:
State of data right before the date_add() operation
i94.printSchema()
root
|-- cic_id: double (nullable = true)
|-- visa_id: string (nullable = true)
|-- port_id: string (nullable = true)
|-- airline_id: string (nullable = true)
|-- cit_id: double (nullable = true)
|-- res_id: double (nullable = true)
|-- year: double (nullable = true)
|-- month: double (nullable = true)
|-- age: double (nullable = true)
|-- gender: string (nullable = true)
|-- arrival_date: integer (nullable = true)
|-- depart_date: double (nullable = true)
|-- date_begin: string (nullable = true)
|-- date_end: string (nullable = true)
|-- sas_date: string (nullable = false)
i94.limit(10).toPandas()
toPandas() result
I think you are absolutely right, date_add is designed to take int values only till Spark <3.0.0:
In spark scala implementation i see below lines.
It indicates that whatever value we pass it to function date_add it is converting again into column with lit
Spark <3.0.0:
def date_add(start: Column, days: Int): Column = date_add(start,
lit(days))
Spark >=3.0.0:
def date_add(start: Column, days: Column): Column = withExpr {
DateAdd(start.expr, days.expr) }
Now lets talk about Solution, i can think of two approaches :
Imports and prepare small set of your dataset:
import pyspark.sql.functions as f
import pyspark.sql.types as t
from datetime import datetime
from datetime import timedelta
l1 = [(5748517.0,'1960-01-01', 20574), (5748517.0,'1960-01-01', 20574), (5748517.0,'1960-01-01', 20574)]
df = spark.createDataFrame(l1).toDF('cic_id','sas_date','arrival_date')
df.show()
+---------+----------+------------+
| cic_id| sas_date|arrival_date|
+---------+----------+------------+
|5748517.0|1960-01-01| 20574|
|5748517.0|1960-01-01| 20574|
|5748517.0|1960-01-01| 20574|
+---------+----------+------------+
Now, there are two ways to achive functionality.
UDF Way :
def date_add_(date, days):
# Type check and convert to datetime object
# Format and other things should be handle more delicately
if type(date) is not datetime:
date = datetime.strptime('1960-01-01', "%Y-%m-%d")
return date + timedelta(days)
date_add_udf = f.udf(date_add_, t.DateType())
df.withColumn('actual_arrival_date', date_add_udf(f.to_date('sas_date'), 'arrival_date')).show()
+---------+----------+------------+-------------------+
| cic_id| sas_date|arrival_date|actual_arrival_date|
+---------+----------+------------+-------------------+
|5748517.0|1960-01-01| 20574| 2016-04-30|
|5748517.0|1960-01-01| 20574| 2016-04-30|
|5748517.0|1960-01-01| 20574| 2016-04-30|
+---------+----------+------------+-------------------+
Using expr evaluation :
df.withColumn('new_arrival_date', f.expr("date_add(sas_date, arrival_date)")).show()
+---------+----------+------------+----------------+
| cic_id| sas_date|arrival_date|new_arrival_date|
+---------+----------+------------+----------------+
|5748517.0|1960-01-01| 20574| 2016-04-30|
|5748517.0|1960-01-01| 20574| 2016-04-30|
|5748517.0|1960-01-01| 20574| 2016-04-30|
+---------+----------+------------+----------------+

EDIT: spark scala inbuilt udf : to_timestamp() ignores the millisecond part of the timestamp value

Sample Code:
val sparkSession = SparkUtil.getSparkSession("timestamp_format_test")
import sparkSession.implicits._
val format = "yyyy/MM/dd HH:mm:ss.SSS"
val time = "2018/12/21 08:07:36.927"
val df = sparkSession.sparkContext.parallelize(Seq(time)).toDF("in_timestamp")
val df2 = df.withColumn("out_timestamp", to_timestamp(df.col("in_timestamp"), format))
Output:
df2.show(false)
plz notice: out_timestamp loses the milli-second part from the original value
+-----------------------+-------------------+
|in_timestamp |out_timestamp |
+-----------------------+-------------------+
|2018/12/21 08:07:36.927|2018-12-21 08:07:36|
+-----------------------+-------------------+
df2.printSchema()
root
|-- in_timestamp: string (nullable = true)
|-- out_timestamp: timestamp (nullable = true)
In the above result: in_timestamp is of string type, and I would like to convert to timestamp data type, it does get convert but the millisecond part gets lost. Any idea.? Thanks.!
Sample code for preserving millisecond during conversion from string to timestamp.
val df2 = df.withColumn("out_timestamp", to_timestamp(df.col("in_timestamp")))
df2.show(false)
+-----------------------+-----------------------+
|in_timestamp |out_timestamp |
+-----------------------+-----------------------+
|2018-12-21 08:07:36.927|2018-12-21 08:07:36.927|
+-----------------------+-----------------------+
scala> df2.printSchema
root
|-- in_timestamp: string (nullable = true)
|-- out_timestamp: timestamp (nullable = true)
You just need to remove format parameter from to_timestamp. This will save your result with data type timestamp similar to String value.

Pyspark from_unixtime (unix_timestamp) does not convert to timestamp

I am using Pyspark with Python 2.7. I have a date column in string (with ms) and would like to convert to timestamp
This is what I have tried so far
df = df.withColumn('end_time', from_unixtime(unix_timestamp(df.end_time, '%Y-%M-%d %H:%m:%S.%f')) )
printSchema() shows
end_time: string (nullable = true)
when I expended timestamp as the type of variable
Try using from_utc_timestamp:
from pyspark.sql.functions import from_utc_timestamp
df = df.withColumn('end_time', from_utc_timestamp(df.end_time, 'PST'))
You'd need to specify a timezone for the function, in this case I chose PST
If this does not work please give us an example of a few rows showing df.end_time
Create a sample dataframe with Time-stamp formatted as string:
import pyspark.sql.functions as F
df = spark.createDataFrame([('22-Jul-2018 04:21:18.792 UTC', ),('23-Jul-2018 04:21:25.888 UTC',)], ['TIME'])
df.show(2,False)
df.printSchema()
Output:
+----------------------------+
|TIME |
+----------------------------+
|22-Jul-2018 04:21:18.792 UTC|
|23-Jul-2018 04:21:25.888 UTC|
+----------------------------+
root
|-- TIME: string (nullable = true)
Converting string time-format (including milliseconds ) to unix_timestamp(double). Since unix_timestamp() function excludes milliseconds we need to add it using another simple hack to include milliseconds. Extracting milliseconds from string using substring method (start_position = -7, length_of_substring=3) and Adding milliseconds seperately to unix_timestamp. (Cast to substring to float for adding)
df1 = df.withColumn("unix_timestamp",F.unix_timestamp(df.TIME,'dd-MMM-yyyy HH:mm:ss.SSS z') + F.substring(df.TIME,-7,3).cast('float')/1000)
Converting unix_timestamp(double) to timestamp datatype in Spark.
df2 = df1.withColumn("TimestampType",F.to_timestamp(df1["unix_timestamp"]))
df2.show(n=2,truncate=False)
This will give you following output
+----------------------------+----------------+-----------------------+
|TIME |unix_timestamp |TimestampType |
+----------------------------+----------------+-----------------------+
|22-Jul-2018 04:21:18.792 UTC|1.532233278792E9|2018-07-22 04:21:18.792|
|23-Jul-2018 04:21:25.888 UTC|1.532319685888E9|2018-07-23 04:21:25.888|
+----------------------------+----------------+-----------------------+
Checking the Schema:
df2.printSchema()
root
|-- TIME: string (nullable = true)
|-- unix_timestamp: double (nullable = true)
|-- TimestampType: timestamp (nullable = true)
in current version of spark , we do not have to do much with respect to timestamp conversion.
using to_timestamp function works pretty well in this case. only thing we need to take care is input the format of timestamp according to the original column.
in my case it was in format yyyy-MM-dd HH:mm:ss.
other format can be like MM/dd/yyyy HH:mm:ss or a combination as such.
from pyspark.sql.functions import to_timestamp
df=df.withColumn('date_time',to_timestamp('event_time','yyyy-MM-dd HH:mm:ss'))
df.show()
Following might help:-
from pyspark.sql import functions as F
df = df.withColumn("end_time", F.from_unixtime(F.col("end_time"), 'yyyy-MM-dd HH:mm:ss.SS').cast("timestamp"))
[Updated]

Timestamp formats and time zones in Spark (scala API)

******* UPDATE ********
As suggested in the comments I eliminated the irrelevant part of the code:
My requirements:
Unify number of milliseconds to 3
Transform string to timestamp and keep the value in UTC
Create dataframe:
val df = Seq("2018-09-02T05:05:03.456Z","2018-09-02T04:08:32.1Z","2018-09-02T05:05:45.65Z").toDF("Timestamp")
Here the reults using the spark shell:
************ END UPDATE *********************************
I am having a nice headache trying to deal with time zones and timestamp formats in Spark using scala.
This is a simplification of my script to explain my problem:
import org.apache.spark.sql.functions._
val jsonRDD = sc.wholeTextFiles("file:///data/home2/phernandez/vpp/Test_Message.json")
val jsonDF = spark.read.json(jsonRDD.map(f => f._2))
This is the resulting schema:
root
|-- MeasuredValues: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- MeasuredValue: double (nullable = true)
| | |-- Status: long (nullable = true)
| | |-- Timestamp: string (nullable = true)
Then I just select the Timestamp field as follows
jsonDF.select(explode($"MeasuredValues").as("Values")).select($"Values.Timestamp").show(5,false)
First thing I want to fix is the number of milliseconds of every timestamp and unify it to three.
I applied the date_format as follows
jsonDF.select(explode($"MeasuredValues").as("Values")).select(date_format($"Values.Timestamp","yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")).show(5,false)
Milliseconds format was fixed but timestamp is converted from UTC to local time.
To tackle this issue, I applied the to_utc_timestamp together with my local time zone.
jsonDF.select(explode($"MeasuredValues").as("Values")).select(to_utc_timestamp(date_format($"Values.Timestamp","yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"),"Europe/Berlin").as("Timestamp")).show(5,false)
Even worst, UTC value is not returned, and the milliseconds format is lost.
Any Ideas how to deal with this? I will appreciated it 😊
BR. Paul
The cause of the problem is the time format string used for conversion:
yyyy-MM-dd'T'HH:mm:ss.SSS'Z'
As you may see, Z is inside single quotes, which means that it is not interpreted as the zone offset marker, but only as a character like T in the middle.
So, the format string should be changed to
yyyy-MM-dd'T'HH:mm:ss.SSSX
where X is the Java standard date time formatter pattern (Z being the offset value for 0).
Now, the source data can be converted to UTC timestamps:
val srcDF = Seq(
("2018-04-10T13:30:34.45Z"),
("2018-04-10T13:45:55.4Z"),
("2018-04-10T14:00:00.234Z"),
("2018-04-10T14:15:04.34Z"),
("2018-04-10T14:30:23.45Z")
).toDF("Timestamp")
val convertedDF = srcDF.select(to_utc_timestamp(date_format($"Timestamp", "yyyy-MM-dd'T'HH:mm:ss.SSSX"), "Europe/Berlin").as("converted"))
convertedDF.printSchema()
convertedDF.show(false)
/**
root
|-- converted: timestamp (nullable = true)
+-----------------------+
|converted |
+-----------------------+
|2018-04-10 13:30:34.45 |
|2018-04-10 13:45:55.4 |
|2018-04-10 14:00:00.234|
|2018-04-10 14:15:04.34 |
|2018-04-10 14:30:23.45 |
+-----------------------+
*/
If you need to convert the timestamps back to strings and normalize the values to have 3 trailing zeros, there should be another date_format call, similar to what you have already applied in the question.