How to load date with custom format in Spark - scala

I have a scenario where I have a column data like "Tuesday, 09-Aug-11 21:13:26 GMT" and I want to create a schema in Spark but the datatypes TimestampType and DateType is not able to recognize this date format.
After loading the data to a dataframe using TimestampType or DateType I am seeing NULL values in that particular column.
Is there any alternative for this?

One option is to read "Tuesday, 09-Aug-11 21:13:26 GMT" as string type column & do transformation from string to timestamp something like below.
df.show(truncate=false)
+-------------------------------+
|dt |
+-------------------------------+
|Tuesday, 09-Aug-11 21:13:26 GMT|
+-------------------------------+
df.withColumn("dt",to_timestamp(col("dt"),"E, d-MMM-y H:m:s z")).show(truncate=false) //Note - It is converted GMT to IST local timezone.
+-------------------+
|dt |
+-------------------+
|2011-08-10 02:43:26|
+-------------------+

Related

pyspark: change string to timestamp

I've a column in String format , some rows are also null.
I add random timestamp to make it in the following form to convert it into timestamp.
date
null
22-04-2020
date
01-01-1990 23:59:59.000
22-04-2020 23:59:59.000
df = df.withColumn('date', F.concat (df.date, F.lit(" 23:59:59.000")))
df = df.withColumn('date', F.when(F.col('date').isNull(), '01-01-1990 23:59:59.000').otherwise(F.col('date')))
df.withColumn("date", F.to_timestamp(F.col("date"),"MM-dd-yyyy HH mm ss SSS")).show(2)
but after this the column date becomes null.
can anyone help me solve this.
either convert the string to timestamp direct
Your timestamp format should start with dd-MM, not MM-dd, and you're also missing some colons and dots in the time part. Try the code below:
df.withColumn("date", F.to_timestamp(F.col("date"),"dd-MM-yyyy HH:mm:ss.SSS")).show()
+-------------------+
| date|
+-------------------+
|1990-01-01 23:59:59|
|2020-04-22 23:59:59|
+-------------------+

converting specific string format to date in sparksql

I have a column that contains a string with the following date as a string Sat Sep 14 09:54:30 UTC 2019. Not familiar with format at all.
I need to convert to date or timestamp. Just a unit that I can compare against. I just need a point of comparison with a precision of one day.
This can help you get the timestamp from your string and then you get the days from it using Spark SQL(2.x)
spark.sql("""SELECT from_utc_timestamp(from_unixtime(unix_timestamp("Sat Sep 14 09:54:30 UTC 2019","EEE MMM dd HH:mm:ss zzz yyyy") ),"IST")as timestamp""").show()
+-------------------+
| timestamp|
+-------------------+
|2019-09-14 20:54:30|
+-------------------+

How to convert a string column (column which contains only time and not date ) to time_stamp in spark-scala?

I need to convert the column which contains only time as string to a time stamp type or any other time function which is available in spark.
Below is the test Data frame which having "Time_eg" as string column,
Time_eg
12:49:09 AM
12:50:18 AM
Schema before it convert to the time,
Time_eg: string (nullable = true)
//Converting to time stamp
val transType= test.withColumn("Time_eg", test("Time_eg").cast("timestamp"))
Schema After converting to timestamp, the schema is
Time_eg: timestamp (nullable = true)
But the output of transType.show() gives null value for the
"Time_eg" column.
Please let me know how to convert the column which contains only time as a string to time stamp in spark scala?
Much appreciate if anyone can help on this?
Thanks
You need to use a specific function to convert a string to a timestamp, and specify the format. Also, a timestamp in Spark represents a full date (with time of the day). If you do not provide the date, it will be set to 1970, Jan 1st, the begining of unix timestamps.
In your case, you can convert your strings as follows:
Seq("12:49:09 AM", "09:00:00 PM")
.toDF("Time_eg")
.select(to_timestamp('Time_eg, "hh:mm:ss aa") as "ts")
.show
+-------------------+
| ts|
+-------------------+
|1970-01-01 00:49:09|
|1970-01-01 21:00:00|
+-------------------+

PySpark: String to timestamp transformation

I am working with time data and try to convert the string to timestamp format.
Here is what the 'Time' column looks like
+----------+
| Time |
+----------+
|1358380800|
|1380672000|
+----------+
Here is what I want
+---------------+
| Time |
+---------------+
|2013/1/17 8:0:0|
|2013/10/2 8:0:0|
+---------------+
I find some similar questions and answers and have tried these code, but all end with 'null'
df2 = df.withColumn("Time", test["Time"].cast(TimestampType()))
df2 = df.withColumn('Time', F.unix_timestamp('Time', 'yyyy-MM-dd').cast(TimestampType()))
Well your are doing it the other way around. The sql function unix_timestamp converts a string with the given format to a unix timestamp. When you want to convert a unix timestamp to the datetime format, you have to use the from_unixtime sql function:
from pyspark.sql import functions as F
from pyspark.sql import types as T
l1 = [('1358380800',),('1380672000',)]
df = spark.createDataFrame(l1,['Time'])
df.withColumn('Time', F.from_unixtime(df.Time).cast(T.TimestampType())).show()
Output:
+-------------------+
| Time|
+-------------------+
|2013-01-17 01:00:00|
|2013-10-02 02:00:00|
+-------------------+

convert date to integer scala spark

I have a dataframe, that contain, 2 columns of date start_date and finish_date; and I created a new column to add the moyen between the 2 dates.
+-----+--------+-------+---------+-----+--------------------+-------------------
start_date| finish_date| moyen_date|
+-----+--------+-------+---------+-----+--------------------+-------------------
2010-11-03 15:56:... |2010-11-03 17:43:...| 0|
2010-11-03 17:43:... |2010-11-05 13:21:...| 2|
2010-11-05 13:21:... |2010-11-05 14:08:...| 0|
2010-11-05 14:08:... |2010-11-05 14:08:...| 0|
+-----+--------+-------+---------+-----+--------------------+-------------------
I calculated the difference between the 2 dates:
var result = sqlDF.withColumn("moyen_date",datediff(col("finish_date"), col("start_date")))
But I want to convert start_date and finish_date to integer, knowing that each column contain date + time.
Someone can help me please. ?
Thank you
Considering this as part of your dataframe:
df.show(false)
+---------------------+
|ts |
+---------------------+
|2010-11-03 15:56:34.0|
+---------------------+
unix_timestamp returns the number of milliseconds since epoch. The input column should be of type timestamp. The output column is of type long.
df.withColumn("unix_ts" , unix_timestamp($"ts").show(false)
+---------------------+----------+
|ts |unix_ts |
+---------------------+----------+
|2010-11-03 15:56:34.0|1288817794|
+---------------------+----------+
To convert it back to timestamp format of your choice, you can use from_unixtime which also takes an optional timestamp format as a parameter. You are using to_date, that's why you're only getting the date and not the time.
df.withColumn("unix_ts" , unix_timestamp($"ts") )
.withColumn("from_utime" , from_unixtime($"unix_ts" , "yyyy-MM-dd HH:mm:ss.S"))
.show(false)
+---------------------+----------+---------------------+
|ts |unix_ts |from_utime |
+---------------------+----------+---------------------+
|2010-11-03 15:56:34.0|1288817794|2010-11-03 15:56:34.0|
+---------------------+----------+---------------------+
The column from_utime here will be of type string though. To convert it to timestamp, you can simple use:
df.withColumn("from_utime" , $"from_utime".cast("timestamp") )
Since it's already in ISO date format, no specific conversion is needed. For any other format, you will need to use a combination of unix_timestamp and from_unixtime.