String datetime into spark sql

String datetime into spark sql - pyspark

How do we convert this date time string 2018-02-07 00:45 into sparksql timestamp..tried
to_timestamp('2018-02-07 00:45', 'yyyy-MM-dd HH:mm')
and
date_format('2018-02-07 00:45', 'y-MM-dd hh:mm').cast(TimestampType()
both did not work.
This is in pyspark..

from pyspark.sql.functions import to_timestamp
(sc
.parallelize([Row(dt='2016_08_21 11_31_08')])
.toDF()
.withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd hh_mm_ss"))
.show(1, False))

Related

Pyspark: Convert date from string format (20220124) to date format

In a spark dataframe, I will like to convert date column, "Date" which is in string format (eg. 20220124) to 2022-01-24 and then to date format using python.
df_new= df.withColumn('Date',to_date(df.Date, 'yyyy-MM-dd'))

You can do it with to_date function which gets the input col and format of your date.
from pyspark.sql import functions as F
df.withColumn('date', F.to_date('date', 'yyyyMMdd'))

How to convert unix timestamp (long) to datetime in pyspark sql?

In pyspark sql, I have unix timestamp column that is a long - I tried using the following but the output was not correct.
from_unixtime(col("firstAvailableDateTimeUnix"), "yyyy-MM-dd HH:mm:ss")
from_unixtime output

import org.apache.spark.sql.functions._
val df = Seq(1651484635297L).toDF("firstAvailableDateTimeUnix").withColumn("goodDate",
from_unixtime(col("firstAvailableDateTimeUnix")/1000, "yyyy-MM-dd HH:mm:ss")
)
df.show(false)
// +--------------------------+-------------------+
// |firstAvailableDateTimeUnix|goodDate |
// +--------------------------+-------------------+
// |1651484635297 |2022-05-02 12:43:55|
// +--------------------------+-------------------+

All you need to do is casting and then use date_format to format your timestamp:
date_format(col("firstAvailableDateTimeUnix").cast('timestamp'), "yyyy-MM-dd HH:mm:ss"))

Convert string (with timestamp) to timestamp in pyspark

I have a dataframe with a string datetime column.
I am converting it to timestamp, but the values are changing.
Following is my code, can anyone help me to convert without changing values.
df=spark.createDataFrame(
data = [ ("1","2020-04-06 15:06:16 +00:00")],
schema=["id","input_timestamp"])
df.printSchema()
#Timestamp String to DateType
df = df.withColumn("timestamp",to_timestamp("input_timestamp"))
# Using Cast to convert TimestampType to DateType
df.withColumn('timestamp_string', \
to_timestamp('timestamp').cast('string')) \
.show(truncate=False)
This is the output:
+---+--------------------------+-------------------+-------------------+
|id |input_timestamp |timestamp |timestamp_string |
+---+--------------------------+-------------------+-------------------+
|1 |2020-04-06 15:06:16 +00:00|2020-04-06 08:06:16|2020-04-06 08:06:16|
+---+--------------------------+-------------------+-------------------+
I want to know why the hour is changing from 15 to 8 and how can I prevent it?

I believe to_timestamp is converting timestamp value to your local time as you have +00:00 in your data.
Try to pass the format to to_timestamp() function.
Example:
from pyspark.sql.functions import to_timestamp
df.withColumn("timestamp",to_timestamp(col("input_timestamp"),"yyyy-MM-dd HH:mm:ss +00:00")).show(10,False)
#+---+--------------------------+-------------------+
#|id |input_timestamp |timestamp |
#+---+--------------------------+-------------------+
#|1 |2020-04-06 15:06:16 +00:00|2020-04-06 15:06:16|
#+---+--------------------------+-------------------+

from pyspark.sql.functions import to_utc_timestamp
df = spark.createDataFrame(
data=[('1', '2020-04-06 15:06:16 +00:00')],
schema=['id', 'input_timestamp'])
df.printSchema()
df = df.withColumn('timestamp', to_utc_timestamp('input_timestamp',
your_local_timezone))
df.withColumn('timestamp_string', df.timestamp.cast('string')).show(truncate=False)
Replace your_local_timezone with the actual value.

Converting timestamp format in dataframe

I have read a csv file and made a dataframe where timestamp column is in format "11/12/2020 3:01".
How do I convert this into "yyyy-mm-dd hh:mm:ss.ssssss" format for the data of that particular timestamp column?

import org.apache.spark.sql.functions._
df.withColumn("timestamp_col",
date_format(
unix_timestamp($"timestamp_col", "dd/MM/yyyy h:mm").cast("timestamp"),
"yyyy-MM-dd hh:mm:ss.SSSSSS"
)
)

watch for .strftime in the documentation
https://docs.python.org/3/library/datetime.html
timestamp.strftime("%d.%m.%Y")

How to convert timestamp column to epoch seconds?

How do you convert a timestamp column to epoch seconds?
var df = sc.parallelize(Seq("2018-07-01T00:00:00Z")).toDF("date_string")
df = df.withColumn("timestamp", $"date_string".cast("timestamp"))
df.show(false)
DataFrame:
+--------------------+---------------------+
|date_string |timestamp |
+--------------------+---------------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|
+--------------------+---------------------+

If you have a timestamp you can cast it to a long to get the epoch seconds
df = df.withColumn("epoch_seconds", $"timestamp".cast("long"))
df.show(false)
DataFrame
+--------------------+---------------------+-------------+
|date_string |timestamp |epoch_seconds|
+--------------------+---------------------+-------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|1530403200 |
+--------------------+---------------------+-------------+

Use unix_timestamp from org.apache.spark.functions. It can a timestamp column or from a string column where it is possible to specify the format. From the documentation:
public static Column unix_timestamp(Column s)
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return null if fail.
public static Column unix_timestamp(Column s, String p)
Convert time string with given pattern (see http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) to Unix time stamp (in seconds), return null if fail.
Use as follows:
import org.apache.spark.functions._
df.withColumn("epoch_seconds", unix_timestamp($"timestamp")))
or if the column is a string with other format:
df.withColumn("epoch_seconds", unix_timestamp($"date_string", "yyyy-MM-dd'T'HH:mm:ss'Z'")))

It can be easily done with unix_timestamp function in spark SQL like this:
spark.sql("SELECT unix_timestamp(inv_time) AS time_as_long FROM agg_counts LIMIT 10").show()
Hope this helps.

You can use the function unix_timestamp and cast it into any datatype.
Example:
val df1 = df.select(unix_timestamp($"date_string", "yyyy-MM-dd HH:mm:ss").cast(LongType).as("epoch_seconds"))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

String datetime into spark sql - pyspark

How do we convert this date time string 2018-02-07 00:45 into sparksql timestamp..tried to_timestamp('2018-02-07 00:45', 'yyyy-MM-dd HH:mm') and date_format('2018-02-07 00:45', 'y-MM-dd hh:mm').cast(TimestampType() both did not work. This is in pyspark..

from pyspark.sql.functions import to_timestamp (sc .parallelize([Row(dt='2016_08_21 11_31_08')]) .toDF() .withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd hh_mm_ss")) .show(1, False))

Related

Pyspark: Convert date from string format (20220124) to date format

How to convert unix timestamp (long) to datetime in pyspark sql?

Convert string (with timestamp) to timestamp in pyspark

Converting timestamp format in dataframe

How to convert timestamp column to epoch seconds?

Categories

Resources