How do you convert a timestamp column to epoch seconds?
var df = sc.parallelize(Seq("2018-07-01T00:00:00Z")).toDF("date_string")
df = df.withColumn("timestamp", $"date_string".cast("timestamp"))
df.show(false)
DataFrame:
+--------------------+---------------------+
|date_string |timestamp |
+--------------------+---------------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|
+--------------------+---------------------+
If you have a timestamp you can cast it to a long to get the epoch seconds
df = df.withColumn("epoch_seconds", $"timestamp".cast("long"))
df.show(false)
DataFrame
+--------------------+---------------------+-------------+
|date_string |timestamp |epoch_seconds|
+--------------------+---------------------+-------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|1530403200 |
+--------------------+---------------------+-------------+
Use unix_timestamp from org.apache.spark.functions. It can a timestamp column or from a string column where it is possible to specify the format. From the documentation:
public static Column unix_timestamp(Column s)
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return null if fail.
public static Column unix_timestamp(Column s, String p)
Convert time string with given pattern (see http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) to Unix time stamp (in seconds), return null if fail.
Use as follows:
import org.apache.spark.functions._
df.withColumn("epoch_seconds", unix_timestamp($"timestamp")))
or if the column is a string with other format:
df.withColumn("epoch_seconds", unix_timestamp($"date_string", "yyyy-MM-dd'T'HH:mm:ss'Z'")))
It can be easily done with unix_timestamp function in spark SQL like this:
spark.sql("SELECT unix_timestamp(inv_time) AS time_as_long FROM agg_counts LIMIT 10").show()
Hope this helps.
You can use the function unix_timestamp and cast it into any datatype.
Example:
val df1 = df.select(unix_timestamp($"date_string", "yyyy-MM-dd HH:mm:ss").cast(LongType).as("epoch_seconds"))
Related
I have a dataframe with a string datetime column.
I am converting it to timestamp, but the values are changing.
Following is my code, can anyone help me to convert without changing values.
df=spark.createDataFrame(
data = [ ("1","2020-04-06 15:06:16 +00:00")],
schema=["id","input_timestamp"])
df.printSchema()
#Timestamp String to DateType
df = df.withColumn("timestamp",to_timestamp("input_timestamp"))
# Using Cast to convert TimestampType to DateType
df.withColumn('timestamp_string', \
to_timestamp('timestamp').cast('string')) \
.show(truncate=False)
This is the output:
+---+--------------------------+-------------------+-------------------+
|id |input_timestamp |timestamp |timestamp_string |
+---+--------------------------+-------------------+-------------------+
|1 |2020-04-06 15:06:16 +00:00|2020-04-06 08:06:16|2020-04-06 08:06:16|
+---+--------------------------+-------------------+-------------------+
I want to know why the hour is changing from 15 to 8 and how can I prevent it?
I believe to_timestamp is converting timestamp value to your local time as you have +00:00 in your data.
Try to pass the format to to_timestamp() function.
Example:
from pyspark.sql.functions import to_timestamp
df.withColumn("timestamp",to_timestamp(col("input_timestamp"),"yyyy-MM-dd HH:mm:ss +00:00")).show(10,False)
#+---+--------------------------+-------------------+
#|id |input_timestamp |timestamp |
#+---+--------------------------+-------------------+
#|1 |2020-04-06 15:06:16 +00:00|2020-04-06 15:06:16|
#+---+--------------------------+-------------------+
from pyspark.sql.functions import to_utc_timestamp
df = spark.createDataFrame(
data=[('1', '2020-04-06 15:06:16 +00:00')],
schema=['id', 'input_timestamp'])
df.printSchema()
df = df.withColumn('timestamp', to_utc_timestamp('input_timestamp',
your_local_timezone))
df.withColumn('timestamp_string', df.timestamp.cast('string')).show(truncate=False)
Replace your_local_timezone with the actual value.
I've a column in String format , some rows are also null.
I add random timestamp to make it in the following form to convert it into timestamp.
date
null
22-04-2020
date
01-01-1990 23:59:59.000
22-04-2020 23:59:59.000
df = df.withColumn('date', F.concat (df.date, F.lit(" 23:59:59.000")))
df = df.withColumn('date', F.when(F.col('date').isNull(), '01-01-1990 23:59:59.000').otherwise(F.col('date')))
df.withColumn("date", F.to_timestamp(F.col("date"),"MM-dd-yyyy HH mm ss SSS")).show(2)
but after this the column date becomes null.
can anyone help me solve this.
either convert the string to timestamp direct
Your timestamp format should start with dd-MM, not MM-dd, and you're also missing some colons and dots in the time part. Try the code below:
df.withColumn("date", F.to_timestamp(F.col("date"),"dd-MM-yyyy HH:mm:ss.SSS")).show()
+-------------------+
| date|
+-------------------+
|1990-01-01 23:59:59|
|2020-04-22 23:59:59|
+-------------------+
I have a Spark DataFrame with a timestamp column in milliseconds since the epoche. The column is a string. I now want to transform the column to a readable human time but keep the milliseconds.
For example:
1614088453671 -> 23-2-2021 13:54:13.671
Every example i found transforms the timestamp to a normal human readable time without milliseconds.
What i have:
+------------------+
|epoch_time_seconds|
+------------------+
|1614088453671 |
+------------------+
What i want to reach:
+------------------+------------------------+
|epoch_time_seconds|human_date |
+------------------+------------------------+
|1614088453671 |23-02-2021 13:54:13.671 |
+------------------+------------------------+
The time before the milliseconds can be obtained using date_format from_unixtime, while the milliseconds can be obtained using a modulo. Combine them using format_string.
val df2 = df.withColumn(
"human_date",
format_string(
"%s.%s",
date_format(
from_unixtime(col("epoch_time_seconds")/1000),
"dd-MM-yyyy HH:mm:ss"
),
col("epoch_time_seconds") % 1000
)
)
df2.show(false)
+------------------+-----------------------+
|epoch_time_seconds|human_date |
+------------------+-----------------------+
|1614088453671 |23-02-2021 13:54:13.671|
+------------------+-----------------------+
I have data in hive table in the below format.
2019-11-21 18:19:15.817
I wrote a sql query as below to get the above column value into epoch format.
val newDF = spark.sql(f"""select TRIM(id) as ID, unix_timestamp(sig_ts) as SIG_TS from table""")
And I am getting the output column SIG_TS as 1574360296 which is not having milliseconds.
How to get the epoch timestamp of a date with milliseconds?
Simple way: Create an UDF since spark's built-in function truncates at seconds.
import java.sql.Timestamp
val fullTimestampUDF = udf{t: Timestamp => t.getTime}
val df = Seq("2019-11-21 18:19:15.817").toDF("sig_ts")
.withColumn("sig_ts_ut", unix_timestamp($"sig_ts"))
.withColumn("sig_ts_ut_long", fullTimestampUDF($"sig_ts"))
df.show(false)
+-----------------------+----------+--------------+
|sig_ts |sig_ts_ut |sig_ts_ut_long|
+-----------------------+----------+--------------+
|2019-11-21 18:19:15.817|1574356755|1574356755817 |
+-----------------------+----------+--------------+
I need to convert the column which contains only time as string to a time stamp type or any other time function which is available in spark.
Below is the test Data frame which having "Time_eg" as string column,
Time_eg
12:49:09 AM
12:50:18 AM
Schema before it convert to the time,
Time_eg: string (nullable = true)
//Converting to time stamp
val transType= test.withColumn("Time_eg", test("Time_eg").cast("timestamp"))
Schema After converting to timestamp, the schema is
Time_eg: timestamp (nullable = true)
But the output of transType.show() gives null value for the
"Time_eg" column.
Please let me know how to convert the column which contains only time as a string to time stamp in spark scala?
Much appreciate if anyone can help on this?
Thanks
You need to use a specific function to convert a string to a timestamp, and specify the format. Also, a timestamp in Spark represents a full date (with time of the day). If you do not provide the date, it will be set to 1970, Jan 1st, the begining of unix timestamps.
In your case, you can convert your strings as follows:
Seq("12:49:09 AM", "09:00:00 PM")
.toDF("Time_eg")
.select(to_timestamp('Time_eg, "hh:mm:ss aa") as "ts")
.show
+-------------------+
| ts|
+-------------------+
|1970-01-01 00:49:09|
|1970-01-01 21:00:00|
+-------------------+