pyspark: change string to timestamp - pyspark

I've a column in String format , some rows are also null.
I add random timestamp to make it in the following form to convert it into timestamp.
date
null
22-04-2020
date
01-01-1990 23:59:59.000
22-04-2020 23:59:59.000
df = df.withColumn('date', F.concat (df.date, F.lit(" 23:59:59.000")))
df = df.withColumn('date', F.when(F.col('date').isNull(), '01-01-1990 23:59:59.000').otherwise(F.col('date')))
df.withColumn("date", F.to_timestamp(F.col("date"),"MM-dd-yyyy HH mm ss SSS")).show(2)
but after this the column date becomes null.
can anyone help me solve this.
either convert the string to timestamp direct

Your timestamp format should start with dd-MM, not MM-dd, and you're also missing some colons and dots in the time part. Try the code below:
df.withColumn("date", F.to_timestamp(F.col("date"),"dd-MM-yyyy HH:mm:ss.SSS")).show()
+-------------------+
| date|
+-------------------+
|1990-01-01 23:59:59|
|2020-04-22 23:59:59|
+-------------------+

Related

Convert string (with timestamp) to timestamp in pyspark

I have a dataframe with a string datetime column.
I am converting it to timestamp, but the values are changing.
Following is my code, can anyone help me to convert without changing values.
df=spark.createDataFrame(
data = [ ("1","2020-04-06 15:06:16 +00:00")],
schema=["id","input_timestamp"])
df.printSchema()
#Timestamp String to DateType
df = df.withColumn("timestamp",to_timestamp("input_timestamp"))
# Using Cast to convert TimestampType to DateType
df.withColumn('timestamp_string', \
to_timestamp('timestamp').cast('string')) \
.show(truncate=False)
This is the output:
+---+--------------------------+-------------------+-------------------+
|id |input_timestamp |timestamp |timestamp_string |
+---+--------------------------+-------------------+-------------------+
|1 |2020-04-06 15:06:16 +00:00|2020-04-06 08:06:16|2020-04-06 08:06:16|
+---+--------------------------+-------------------+-------------------+
I want to know why the hour is changing from 15 to 8 and how can I prevent it?
I believe to_timestamp is converting timestamp value to your local time as you have +00:00 in your data.
Try to pass the format to to_timestamp() function.
Example:
from pyspark.sql.functions import to_timestamp
df.withColumn("timestamp",to_timestamp(col("input_timestamp"),"yyyy-MM-dd HH:mm:ss +00:00")).show(10,False)
#+---+--------------------------+-------------------+
#|id |input_timestamp |timestamp |
#+---+--------------------------+-------------------+
#|1 |2020-04-06 15:06:16 +00:00|2020-04-06 15:06:16|
#+---+--------------------------+-------------------+
from pyspark.sql.functions import to_utc_timestamp
df = spark.createDataFrame(
data=[('1', '2020-04-06 15:06:16 +00:00')],
schema=['id', 'input_timestamp'])
df.printSchema()
df = df.withColumn('timestamp', to_utc_timestamp('input_timestamp',
your_local_timezone))
df.withColumn('timestamp_string', df.timestamp.cast('string')).show(truncate=False)
Replace your_local_timezone with the actual value.

How to load date with custom format in Spark

I have a scenario where I have a column data like "Tuesday, 09-Aug-11 21:13:26 GMT" and I want to create a schema in Spark but the datatypes TimestampType and DateType is not able to recognize this date format.
After loading the data to a dataframe using TimestampType or DateType I am seeing NULL values in that particular column.
Is there any alternative for this?
One option is to read "Tuesday, 09-Aug-11 21:13:26 GMT" as string type column & do transformation from string to timestamp something like below.
df.show(truncate=false)
+-------------------------------+
|dt |
+-------------------------------+
|Tuesday, 09-Aug-11 21:13:26 GMT|
+-------------------------------+
df.withColumn("dt",to_timestamp(col("dt"),"E, d-MMM-y H:m:s z")).show(truncate=false) //Note - It is converted GMT to IST local timezone.
+-------------------+
|dt |
+-------------------+
|2011-08-10 02:43:26|
+-------------------+

How to convert a string column (column which contains only time and not date ) to time_stamp in spark-scala?

I need to convert the column which contains only time as string to a time stamp type or any other time function which is available in spark.
Below is the test Data frame which having "Time_eg" as string column,
Time_eg
12:49:09 AM
12:50:18 AM
Schema before it convert to the time,
Time_eg: string (nullable = true)
//Converting to time stamp
val transType= test.withColumn("Time_eg", test("Time_eg").cast("timestamp"))
Schema After converting to timestamp, the schema is
Time_eg: timestamp (nullable = true)
But the output of transType.show() gives null value for the
"Time_eg" column.
Please let me know how to convert the column which contains only time as a string to time stamp in spark scala?
Much appreciate if anyone can help on this?
Thanks
You need to use a specific function to convert a string to a timestamp, and specify the format. Also, a timestamp in Spark represents a full date (with time of the day). If you do not provide the date, it will be set to 1970, Jan 1st, the begining of unix timestamps.
In your case, you can convert your strings as follows:
Seq("12:49:09 AM", "09:00:00 PM")
.toDF("Time_eg")
.select(to_timestamp('Time_eg, "hh:mm:ss aa") as "ts")
.show
+-------------------+
| ts|
+-------------------+
|1970-01-01 00:49:09|
|1970-01-01 21:00:00|
+-------------------+

How to convert timestamp column to epoch seconds?

How do you convert a timestamp column to epoch seconds?
var df = sc.parallelize(Seq("2018-07-01T00:00:00Z")).toDF("date_string")
df = df.withColumn("timestamp", $"date_string".cast("timestamp"))
df.show(false)
DataFrame:
+--------------------+---------------------+
|date_string |timestamp |
+--------------------+---------------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|
+--------------------+---------------------+
If you have a timestamp you can cast it to a long to get the epoch seconds
df = df.withColumn("epoch_seconds", $"timestamp".cast("long"))
df.show(false)
DataFrame
+--------------------+---------------------+-------------+
|date_string |timestamp |epoch_seconds|
+--------------------+---------------------+-------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|1530403200 |
+--------------------+---------------------+-------------+
Use unix_timestamp from org.apache.spark.functions. It can a timestamp column or from a string column where it is possible to specify the format. From the documentation:
public static Column unix_timestamp(Column s)
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return null if fail.
public static Column unix_timestamp(Column s, String p)
Convert time string with given pattern (see http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) to Unix time stamp (in seconds), return null if fail.
Use as follows:
import org.apache.spark.functions._
df.withColumn("epoch_seconds", unix_timestamp($"timestamp")))
or if the column is a string with other format:
df.withColumn("epoch_seconds", unix_timestamp($"date_string", "yyyy-MM-dd'T'HH:mm:ss'Z'")))
It can be easily done with unix_timestamp function in spark SQL like this:
spark.sql("SELECT unix_timestamp(inv_time) AS time_as_long FROM agg_counts LIMIT 10").show()
Hope this helps.
You can use the function unix_timestamp and cast it into any datatype.
Example:
val df1 = df.select(unix_timestamp($"date_string", "yyyy-MM-dd HH:mm:ss").cast(LongType).as("epoch_seconds"))

convert date to integer scala spark

I have a dataframe, that contain, 2 columns of date start_date and finish_date; and I created a new column to add the moyen between the 2 dates.
+-----+--------+-------+---------+-----+--------------------+-------------------
start_date| finish_date| moyen_date|
+-----+--------+-------+---------+-----+--------------------+-------------------
2010-11-03 15:56:... |2010-11-03 17:43:...| 0|
2010-11-03 17:43:... |2010-11-05 13:21:...| 2|
2010-11-05 13:21:... |2010-11-05 14:08:...| 0|
2010-11-05 14:08:... |2010-11-05 14:08:...| 0|
+-----+--------+-------+---------+-----+--------------------+-------------------
I calculated the difference between the 2 dates:
var result = sqlDF.withColumn("moyen_date",datediff(col("finish_date"), col("start_date")))
But I want to convert start_date and finish_date to integer, knowing that each column contain date + time.
Someone can help me please. ?
Thank you
Considering this as part of your dataframe:
df.show(false)
+---------------------+
|ts |
+---------------------+
|2010-11-03 15:56:34.0|
+---------------------+
unix_timestamp returns the number of milliseconds since epoch. The input column should be of type timestamp. The output column is of type long.
df.withColumn("unix_ts" , unix_timestamp($"ts").show(false)
+---------------------+----------+
|ts |unix_ts |
+---------------------+----------+
|2010-11-03 15:56:34.0|1288817794|
+---------------------+----------+
To convert it back to timestamp format of your choice, you can use from_unixtime which also takes an optional timestamp format as a parameter. You are using to_date, that's why you're only getting the date and not the time.
df.withColumn("unix_ts" , unix_timestamp($"ts") )
.withColumn("from_utime" , from_unixtime($"unix_ts" , "yyyy-MM-dd HH:mm:ss.S"))
.show(false)
+---------------------+----------+---------------------+
|ts |unix_ts |from_utime |
+---------------------+----------+---------------------+
|2010-11-03 15:56:34.0|1288817794|2010-11-03 15:56:34.0|
+---------------------+----------+---------------------+
The column from_utime here will be of type string though. To convert it to timestamp, you can simple use:
df.withColumn("from_utime" , $"from_utime".cast("timestamp") )
Since it's already in ISO date format, no specific conversion is needed. For any other format, you will need to use a combination of unix_timestamp and from_unixtime.