How to push string from Databricks to timestamp in Snowflake - pyspark

I want to insert string 2021-02-12 16:16:43 from Databricks into Snowflake timestamp. So, this is what i tried:
Use to_timestamp function in Databricks to convert from string to timestamp, but this function gives the incorrect timestamp format which Snowflake doesn't recognize.
.withColumn('date_test',to_timestamp("date_test", 'yyyy-MM-dd HH:mm:ss'))
Output format: 2021-02-12T16:16:43.000+0000
Error trace:
Py4JJavaError: An error occurred while calling o5318.save.
: net.snowflake.client.jdbc.SnowflakeSQLException: Timestamp '40' is not recognized
File '3Xko6AuNme/19.CSV.gz', line 1, character 66
Row 1, column "TEST_READY_STAGING_862888178"["DATE_TEST":5]
If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.
Then I tried using date_format, which gives correct format but the type is string, and Snowflake complains about it again
.withColumn('date_test',date_format(to_timestamp("date_test", 'yyyy-MM-dd HH:mm:ss'), 'yyyy-MM-dd HH:mm:ss'))
Error trace:
: net.snowflake.client.jdbc.SnowflakeSQLException: Timestamp '40' is not recognized
File 'gSOm5eLHFZ/22.CSV.gz', line 1, character 86
Row 1, column "TEST_READY_STAGING_25342852"["DATE_TEST":5]
If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option.
3.I tried to convert string to timestamp using udf from this topic: pyspark to_timestamp does not include milliseconds
But it just doesn't convert to timestamp.
import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import TimestampType
def _to_timestamp(s):
return datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S')
udf_to_timestamp = udf(_to_timestamp, TimestampType())
df_source_data.select('PURCHASE_DATE_TIME').withColumn("PURCHASE_DATE_TIME", udf_to_timestamp("PURCHASE_DATE_TIME")).show(1,False)
display(df_source_data)
df_source_data.printSchema()
Output:
+------------------------+
|PURCHASE_EVENT_DATE_TIME|
+------------------------+
|2021-02-12 16:16:43 |
+------------------------+
only showing top 1 row
root
|-- PURCHASE_EVENT_DATE_TIME: string (nullable = true)
Does anybody has any advice on how to push this string from Databricks to timestamp in Snowflake?

#alterego your format yyyy-MM-dd HH:mm:ss should be yyyy-MM-dd HH:mi:ss

Related

I have date column in pyspark STRING format and i want to typecast into date format

I have date in date column like maintained below
date column
01-JAN-22
In string format and I want to type cast into date format.
I have tried many ways using pyspark functions and SQL functions but not getting output its showing null.
Can anybody help me to solve this query?
You can use to_date.
The variaous datetime patterns for formatting and parsing are documented here.
df = spark.createDataFrame(data=[["01-JAN-22"]], schema=["date column"])
import pyspark.sql.functions as F
df = df.withColumn("date column", F.to_date("date column", "d-MMM-yy"))
df.printSchema()
[Out]:
root
|-- date column: date (nullable = true)
print(df.schema)
[Out]:
StructType([StructField('date column', DateType(), True)])

Convert string (with timestamp) to timestamp in pyspark

I have a dataframe with a string datetime column.
I am converting it to timestamp, but the values are changing.
Following is my code, can anyone help me to convert without changing values.
df=spark.createDataFrame(
data = [ ("1","2020-04-06 15:06:16 +00:00")],
schema=["id","input_timestamp"])
df.printSchema()
#Timestamp String to DateType
df = df.withColumn("timestamp",to_timestamp("input_timestamp"))
# Using Cast to convert TimestampType to DateType
df.withColumn('timestamp_string', \
to_timestamp('timestamp').cast('string')) \
.show(truncate=False)
This is the output:
+---+--------------------------+-------------------+-------------------+
|id |input_timestamp |timestamp |timestamp_string |
+---+--------------------------+-------------------+-------------------+
|1 |2020-04-06 15:06:16 +00:00|2020-04-06 08:06:16|2020-04-06 08:06:16|
+---+--------------------------+-------------------+-------------------+
I want to know why the hour is changing from 15 to 8 and how can I prevent it?
I believe to_timestamp is converting timestamp value to your local time as you have +00:00 in your data.
Try to pass the format to to_timestamp() function.
Example:
from pyspark.sql.functions import to_timestamp
df.withColumn("timestamp",to_timestamp(col("input_timestamp"),"yyyy-MM-dd HH:mm:ss +00:00")).show(10,False)
#+---+--------------------------+-------------------+
#|id |input_timestamp |timestamp |
#+---+--------------------------+-------------------+
#|1 |2020-04-06 15:06:16 +00:00|2020-04-06 15:06:16|
#+---+--------------------------+-------------------+
from pyspark.sql.functions import to_utc_timestamp
df = spark.createDataFrame(
data=[('1', '2020-04-06 15:06:16 +00:00')],
schema=['id', 'input_timestamp'])
df.printSchema()
df = df.withColumn('timestamp', to_utc_timestamp('input_timestamp',
your_local_timezone))
df.withColumn('timestamp_string', df.timestamp.cast('string')).show(truncate=False)
Replace your_local_timezone with the actual value.

PySpark string column to timestamp conversion

I am currently learning pyspark and I need to convert a COLUMN of strings in format 13/09/2021 20:45 into a timestamp of just the hour 20:45.
Now I figured that I can do this with q1.withColumn("timestamp",to_timestamp("ts")) \ .show() (where q1 is my dataframe, and ts is a column we are speaking about) to convert my input into a DD/MM/YYYY HH:MM format, however values returned are only null. I therefore realised that I need an input in PySpark timestamp format (MM-dd-yyyy HH:mm:ss.SSSS) to convert it to a proper timestamp. Hence now my question:
How can I convert the column of strings dd/mm/yyyy hh:mm into a format understandable for pyspark so that I can convert it to timestamp format?
There are different ways you can do that
from pyspark.sql import functions as F
# use substring
df.withColumn('hour', F.substring('A', 12, 15)).show()
# use regex
df.withColumn('hour', F.regexp_extract('A', '\d{2}:\d{2}', 0)).show()
# use datetime
df.withColumn('hour', F.from_unixtime(F.unix_timestamp('A', 'dd/MM/yyyy HH:mm'), 'HH:mm')).show()
# Output
# +----------------+-----+
# | A| hour|
# +----------------+-----+
# |13/09/2021 20:45|20:45|
# +----------------+-----+
unix_timestamp may be a help for your problem.
Just try this:
Convert pyspark string to date format

How to convert a string column (column which contains only time and not date ) to time_stamp in spark-scala?

I need to convert the column which contains only time as string to a time stamp type or any other time function which is available in spark.
Below is the test Data frame which having "Time_eg" as string column,
Time_eg
12:49:09 AM
12:50:18 AM
Schema before it convert to the time,
Time_eg: string (nullable = true)
//Converting to time stamp
val transType= test.withColumn("Time_eg", test("Time_eg").cast("timestamp"))
Schema After converting to timestamp, the schema is
Time_eg: timestamp (nullable = true)
But the output of transType.show() gives null value for the
"Time_eg" column.
Please let me know how to convert the column which contains only time as a string to time stamp in spark scala?
Much appreciate if anyone can help on this?
Thanks
You need to use a specific function to convert a string to a timestamp, and specify the format. Also, a timestamp in Spark represents a full date (with time of the day). If you do not provide the date, it will be set to 1970, Jan 1st, the begining of unix timestamps.
In your case, you can convert your strings as follows:
Seq("12:49:09 AM", "09:00:00 PM")
.toDF("Time_eg")
.select(to_timestamp('Time_eg, "hh:mm:ss aa") as "ts")
.show
+-------------------+
| ts|
+-------------------+
|1970-01-01 00:49:09|
|1970-01-01 21:00:00|
+-------------------+

How to convert timestamp column to epoch seconds?

How do you convert a timestamp column to epoch seconds?
var df = sc.parallelize(Seq("2018-07-01T00:00:00Z")).toDF("date_string")
df = df.withColumn("timestamp", $"date_string".cast("timestamp"))
df.show(false)
DataFrame:
+--------------------+---------------------+
|date_string |timestamp |
+--------------------+---------------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|
+--------------------+---------------------+
If you have a timestamp you can cast it to a long to get the epoch seconds
df = df.withColumn("epoch_seconds", $"timestamp".cast("long"))
df.show(false)
DataFrame
+--------------------+---------------------+-------------+
|date_string |timestamp |epoch_seconds|
+--------------------+---------------------+-------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|1530403200 |
+--------------------+---------------------+-------------+
Use unix_timestamp from org.apache.spark.functions. It can a timestamp column or from a string column where it is possible to specify the format. From the documentation:
public static Column unix_timestamp(Column s)
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return null if fail.
public static Column unix_timestamp(Column s, String p)
Convert time string with given pattern (see http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) to Unix time stamp (in seconds), return null if fail.
Use as follows:
import org.apache.spark.functions._
df.withColumn("epoch_seconds", unix_timestamp($"timestamp")))
or if the column is a string with other format:
df.withColumn("epoch_seconds", unix_timestamp($"date_string", "yyyy-MM-dd'T'HH:mm:ss'Z'")))
It can be easily done with unix_timestamp function in spark SQL like this:
spark.sql("SELECT unix_timestamp(inv_time) AS time_as_long FROM agg_counts LIMIT 10").show()
Hope this helps.
You can use the function unix_timestamp and cast it into any datatype.
Example:
val df1 = df.select(unix_timestamp($"date_string", "yyyy-MM-dd HH:mm:ss").cast(LongType).as("epoch_seconds"))