In pyspark sql, I have unix timestamp column that is a long - I tried using the following but the output was not correct.
from_unixtime(col("firstAvailableDateTimeUnix"), "yyyy-MM-dd HH:mm:ss")
from_unixtime output
import org.apache.spark.sql.functions._
val df = Seq(1651484635297L).toDF("firstAvailableDateTimeUnix").withColumn("goodDate",
from_unixtime(col("firstAvailableDateTimeUnix")/1000, "yyyy-MM-dd HH:mm:ss")
)
df.show(false)
// +--------------------------+-------------------+
// |firstAvailableDateTimeUnix|goodDate |
// +--------------------------+-------------------+
// |1651484635297 |2022-05-02 12:43:55|
// +--------------------------+-------------------+
All you need to do is casting and then use date_format to format your timestamp:
date_format(col("firstAvailableDateTimeUnix").cast('timestamp'), "yyyy-MM-dd HH:mm:ss"))
Related
I have a dataframe with a string datetime column.
I am converting it to timestamp, but the values are changing.
Following is my code, can anyone help me to convert without changing values.
df=spark.createDataFrame(
data = [ ("1","2020-04-06 15:06:16 +00:00")],
schema=["id","input_timestamp"])
df.printSchema()
#Timestamp String to DateType
df = df.withColumn("timestamp",to_timestamp("input_timestamp"))
# Using Cast to convert TimestampType to DateType
df.withColumn('timestamp_string', \
to_timestamp('timestamp').cast('string')) \
.show(truncate=False)
This is the output:
+---+--------------------------+-------------------+-------------------+
|id |input_timestamp |timestamp |timestamp_string |
+---+--------------------------+-------------------+-------------------+
|1 |2020-04-06 15:06:16 +00:00|2020-04-06 08:06:16|2020-04-06 08:06:16|
+---+--------------------------+-------------------+-------------------+
I want to know why the hour is changing from 15 to 8 and how can I prevent it?
I believe to_timestamp is converting timestamp value to your local time as you have +00:00 in your data.
Try to pass the format to to_timestamp() function.
Example:
from pyspark.sql.functions import to_timestamp
df.withColumn("timestamp",to_timestamp(col("input_timestamp"),"yyyy-MM-dd HH:mm:ss +00:00")).show(10,False)
#+---+--------------------------+-------------------+
#|id |input_timestamp |timestamp |
#+---+--------------------------+-------------------+
#|1 |2020-04-06 15:06:16 +00:00|2020-04-06 15:06:16|
#+---+--------------------------+-------------------+
from pyspark.sql.functions import to_utc_timestamp
df = spark.createDataFrame(
data=[('1', '2020-04-06 15:06:16 +00:00')],
schema=['id', 'input_timestamp'])
df.printSchema()
df = df.withColumn('timestamp', to_utc_timestamp('input_timestamp',
your_local_timezone))
df.withColumn('timestamp_string', df.timestamp.cast('string')).show(truncate=False)
Replace your_local_timezone with the actual value.
I have read a csv file and made a dataframe where timestamp column is in format "11/12/2020 3:01".
How do I convert this into "yyyy-mm-dd hh:mm:ss.ssssss" format for the data of that particular timestamp column?
import org.apache.spark.sql.functions._
df.withColumn("timestamp_col",
date_format(
unix_timestamp($"timestamp_col", "dd/MM/yyyy h:mm").cast("timestamp"),
"yyyy-MM-dd hh:mm:ss.SSSSSS"
)
)
watch for .strftime in the documentation
https://docs.python.org/3/library/datetime.html
timestamp.strftime("%d.%m.%Y")
How do you convert a timestamp column to epoch seconds?
var df = sc.parallelize(Seq("2018-07-01T00:00:00Z")).toDF("date_string")
df = df.withColumn("timestamp", $"date_string".cast("timestamp"))
df.show(false)
DataFrame:
+--------------------+---------------------+
|date_string |timestamp |
+--------------------+---------------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|
+--------------------+---------------------+
If you have a timestamp you can cast it to a long to get the epoch seconds
df = df.withColumn("epoch_seconds", $"timestamp".cast("long"))
df.show(false)
DataFrame
+--------------------+---------------------+-------------+
|date_string |timestamp |epoch_seconds|
+--------------------+---------------------+-------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|1530403200 |
+--------------------+---------------------+-------------+
Use unix_timestamp from org.apache.spark.functions. It can a timestamp column or from a string column where it is possible to specify the format. From the documentation:
public static Column unix_timestamp(Column s)
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return null if fail.
public static Column unix_timestamp(Column s, String p)
Convert time string with given pattern (see http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) to Unix time stamp (in seconds), return null if fail.
Use as follows:
import org.apache.spark.functions._
df.withColumn("epoch_seconds", unix_timestamp($"timestamp")))
or if the column is a string with other format:
df.withColumn("epoch_seconds", unix_timestamp($"date_string", "yyyy-MM-dd'T'HH:mm:ss'Z'")))
It can be easily done with unix_timestamp function in spark SQL like this:
spark.sql("SELECT unix_timestamp(inv_time) AS time_as_long FROM agg_counts LIMIT 10").show()
Hope this helps.
You can use the function unix_timestamp and cast it into any datatype.
Example:
val df1 = df.select(unix_timestamp($"date_string", "yyyy-MM-dd HH:mm:ss").cast(LongType).as("epoch_seconds"))
I have the following DataFrame:
+----------+-------------------+
| timestamp| created|
+----------+-------------------+
|1519858893|2018-03-01 00:01:33|
|1519858950|2018-03-01 00:02:30|
|1519859900|2018-03-01 00:18:20|
|1519859900|2018-03-01 00:18:20|
How to create a timestamp correctly`?
I was able to create timestamp column which is epoch timestamp, but dates to not coincide:
df.withColumn("timestamp",unix_timestamp($"created"))
For example, 1519858893 points to 2018-02-28.
Just use date_format and to_utc_timestamp inbuilt functions
import org.apache.spark.sql.functions._
df.withColumn("timestamp", to_utc_timestamp(date_format(col("created"), "yyy-MM-dd"), "Asia/Kathmandu"))
Try below code
df.withColumn("dateColumn", df("timestamp").cast(DateType))
You can check one solution here https://stackoverflow.com/a/46595413
To elaborate more on that with the dataframe having different formats of timestamp/date in string, you can do this -
val df = spark.sparkContext.parallelize(Seq("2020-04-21 10:43:12.000Z", "20-04-2019 10:34:12", "11-30-2019 10:34:12", "2020-05-21 21:32:43", "20-04-2019", "2020-04-21")).toDF("ts")
def strToDate(col: Column): Column = {
val formats: Seq[String] = Seq("dd-MM-yyyy HH:mm:SS", "yyyy-MM-dd HH:mm:SS", "dd-MM-yyyy", "yyyy-MM-dd")
coalesce(formats.map(f => to_timestamp(col, f).cast(DateType)): _*)
}
val formattedDF = df.withColumn("dt", strToDate(df.col("ts")))
formattedDF.show()
+--------------------+----------+
| ts| dt|
+--------------------+----------+
|2020-04-21 10:43:...|2020-04-21|
| 20-04-2019 10:34:12|2019-04-20|
| 2020-05-21 21:32:43|2020-05-21|
| 20-04-2019|2019-04-20|
| 2020-04-21|2020-04-21|
+--------------------+----------+
Note: - This code assumes that data does not contain any column of format -> MM-dd-yyyy, MM-dd-yyyy HH:mm:SS
How do we convert this date time string 2018-02-07 00:45 into sparksql timestamp..tried
to_timestamp('2018-02-07 00:45', 'yyyy-MM-dd HH:mm')
and
date_format('2018-02-07 00:45', 'y-MM-dd hh:mm').cast(TimestampType()
both did not work.
This is in pyspark..
from pyspark.sql.functions import to_timestamp
(sc
.parallelize([Row(dt='2016_08_21 11_31_08')])
.toDF()
.withColumn("parsed", to_timestamp("dt", "yyyy_MM_dd hh_mm_ss"))
.show(1, False))