How to convert a string containing nanoseconds to datetime in Spark Dataframe - pyspark

I am having a timestamp field as below in my JSON file.
"CreateDateTime":"2019-04-03T02:02:12.6475327Z"
When I cast to timestamp, I am able to see the correct value but the 7th digit is truncated.
jsonDF.select(col("CreateDateTime").cast("timestamp")).show(truncate = False)
+--------------------------+
|CreateDateTime |
+--------------------------+
|2019-04-03 02:02:12.647532|
+--------------------------+
I want to get the same value as it appears in source in timestamp format (without T's and Z's). Can you please suggest.

Related

PySpark string column to timestamp conversion

I am currently learning pyspark and I need to convert a COLUMN of strings in format 13/09/2021 20:45 into a timestamp of just the hour 20:45.
Now I figured that I can do this with q1.withColumn("timestamp",to_timestamp("ts")) \ .show() (where q1 is my dataframe, and ts is a column we are speaking about) to convert my input into a DD/MM/YYYY HH:MM format, however values returned are only null. I therefore realised that I need an input in PySpark timestamp format (MM-dd-yyyy HH:mm:ss.SSSS) to convert it to a proper timestamp. Hence now my question:
How can I convert the column of strings dd/mm/yyyy hh:mm into a format understandable for pyspark so that I can convert it to timestamp format?
There are different ways you can do that
from pyspark.sql import functions as F
# use substring
df.withColumn('hour', F.substring('A', 12, 15)).show()
# use regex
df.withColumn('hour', F.regexp_extract('A', '\d{2}:\d{2}', 0)).show()
# use datetime
df.withColumn('hour', F.from_unixtime(F.unix_timestamp('A', 'dd/MM/yyyy HH:mm'), 'HH:mm')).show()
# Output
# +----------------+-----+
# | A| hour|
# +----------------+-----+
# |13/09/2021 20:45|20:45|
# +----------------+-----+
unix_timestamp may be a help for your problem.
Just try this:
Convert pyspark string to date format

pyspark: change string to timestamp

I've a column in String format , some rows are also null.
I add random timestamp to make it in the following form to convert it into timestamp.
date
null
22-04-2020
date
01-01-1990 23:59:59.000
22-04-2020 23:59:59.000
df = df.withColumn('date', F.concat (df.date, F.lit(" 23:59:59.000")))
df = df.withColumn('date', F.when(F.col('date').isNull(), '01-01-1990 23:59:59.000').otherwise(F.col('date')))
df.withColumn("date", F.to_timestamp(F.col("date"),"MM-dd-yyyy HH mm ss SSS")).show(2)
but after this the column date becomes null.
can anyone help me solve this.
either convert the string to timestamp direct
Your timestamp format should start with dd-MM, not MM-dd, and you're also missing some colons and dots in the time part. Try the code below:
df.withColumn("date", F.to_timestamp(F.col("date"),"dd-MM-yyyy HH:mm:ss.SSS")).show()
+-------------------+
| date|
+-------------------+
|1990-01-01 23:59:59|
|2020-04-22 23:59:59|
+-------------------+

How to convert a string column (column which contains only time and not date ) to time_stamp in spark-scala?

I need to convert the column which contains only time as string to a time stamp type or any other time function which is available in spark.
Below is the test Data frame which having "Time_eg" as string column,
Time_eg
12:49:09 AM
12:50:18 AM
Schema before it convert to the time,
Time_eg: string (nullable = true)
//Converting to time stamp
val transType= test.withColumn("Time_eg", test("Time_eg").cast("timestamp"))
Schema After converting to timestamp, the schema is
Time_eg: timestamp (nullable = true)
But the output of transType.show() gives null value for the
"Time_eg" column.
Please let me know how to convert the column which contains only time as a string to time stamp in spark scala?
Much appreciate if anyone can help on this?
Thanks
You need to use a specific function to convert a string to a timestamp, and specify the format. Also, a timestamp in Spark represents a full date (with time of the day). If you do not provide the date, it will be set to 1970, Jan 1st, the begining of unix timestamps.
In your case, you can convert your strings as follows:
Seq("12:49:09 AM", "09:00:00 PM")
.toDF("Time_eg")
.select(to_timestamp('Time_eg, "hh:mm:ss aa") as "ts")
.show
+-------------------+
| ts|
+-------------------+
|1970-01-01 00:49:09|
|1970-01-01 21:00:00|
+-------------------+

How to convert a row from Dataframe to String

I have a dataframe which contains only one row with the column name: source_column in the below format:
forecast_id:bigInt|period:numeric|name:char(50)|location:char(50)
I want to retrieve this value into a String and then split it on the regex |
First I tried converting the row from the DataFrame into the String by following way so that I can check if the row is converted to String:
val sourceColDataTypes = sourceCols.select("source_columns").rdd.map(x => x.toString()).collect()
When I try to print: println(sourceColDataTypes) to check the content, I see [Ljava.lang.String;#19bbb216
I couldn't understand the mistake here. Could anyone let me know how can I properly fetch a row from a dataframe and convert it to String.
You can also try this:
df.show()
//Input data
//+-----------+----------+--------+--------+
//|forecast_id|period |name |location|
//+-----------+----------+--------+--------+
//|1000 |period1000|name1000|loc1000 |
//+-----------+----------+--------+--------+
df.map(_.mkString(",")).show(false)
//Output:
//+--------------------------------+
//|value |
//+--------------------------------+
//|1000,period1000,name1000,loc1000|
//+--------------------------------+
df.rdd.map(_.mkString(",")).collect.foreach(println)
//1000,period1000,name1000,loc1000

How to convert timestamp column to epoch seconds?

How do you convert a timestamp column to epoch seconds?
var df = sc.parallelize(Seq("2018-07-01T00:00:00Z")).toDF("date_string")
df = df.withColumn("timestamp", $"date_string".cast("timestamp"))
df.show(false)
DataFrame:
+--------------------+---------------------+
|date_string |timestamp |
+--------------------+---------------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|
+--------------------+---------------------+
If you have a timestamp you can cast it to a long to get the epoch seconds
df = df.withColumn("epoch_seconds", $"timestamp".cast("long"))
df.show(false)
DataFrame
+--------------------+---------------------+-------------+
|date_string |timestamp |epoch_seconds|
+--------------------+---------------------+-------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|1530403200 |
+--------------------+---------------------+-------------+
Use unix_timestamp from org.apache.spark.functions. It can a timestamp column or from a string column where it is possible to specify the format. From the documentation:
public static Column unix_timestamp(Column s)
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return null if fail.
public static Column unix_timestamp(Column s, String p)
Convert time string with given pattern (see http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) to Unix time stamp (in seconds), return null if fail.
Use as follows:
import org.apache.spark.functions._
df.withColumn("epoch_seconds", unix_timestamp($"timestamp")))
or if the column is a string with other format:
df.withColumn("epoch_seconds", unix_timestamp($"date_string", "yyyy-MM-dd'T'HH:mm:ss'Z'")))
It can be easily done with unix_timestamp function in spark SQL like this:
spark.sql("SELECT unix_timestamp(inv_time) AS time_as_long FROM agg_counts LIMIT 10").show()
Hope this helps.
You can use the function unix_timestamp and cast it into any datatype.
Example:
val df1 = df.select(unix_timestamp($"date_string", "yyyy-MM-dd HH:mm:ss").cast(LongType).as("epoch_seconds"))