I have a dataframe with a string datetime column.
I am converting it to timestamp, but the values are changing.
Following is my code, can anyone help me to convert without changing values.
df=spark.createDataFrame(
data = [ ("1","2020-04-06 15:06:16 +00:00")],
schema=["id","input_timestamp"])
df.printSchema()
#Timestamp String to DateType
df = df.withColumn("timestamp",to_timestamp("input_timestamp"))
# Using Cast to convert TimestampType to DateType
df.withColumn('timestamp_string', \
to_timestamp('timestamp').cast('string')) \
.show(truncate=False)
This is the output:
+---+--------------------------+-------------------+-------------------+
|id |input_timestamp |timestamp |timestamp_string |
+---+--------------------------+-------------------+-------------------+
|1 |2020-04-06 15:06:16 +00:00|2020-04-06 08:06:16|2020-04-06 08:06:16|
+---+--------------------------+-------------------+-------------------+
I want to know why the hour is changing from 15 to 8 and how can I prevent it?
I believe to_timestamp is converting timestamp value to your local time as you have +00:00 in your data.
Try to pass the format to to_timestamp() function.
Example:
from pyspark.sql.functions import to_timestamp
df.withColumn("timestamp",to_timestamp(col("input_timestamp"),"yyyy-MM-dd HH:mm:ss +00:00")).show(10,False)
#+---+--------------------------+-------------------+
#|id |input_timestamp |timestamp |
#+---+--------------------------+-------------------+
#|1 |2020-04-06 15:06:16 +00:00|2020-04-06 15:06:16|
#+---+--------------------------+-------------------+
from pyspark.sql.functions import to_utc_timestamp
df = spark.createDataFrame(
data=[('1', '2020-04-06 15:06:16 +00:00')],
schema=['id', 'input_timestamp'])
df.printSchema()
df = df.withColumn('timestamp', to_utc_timestamp('input_timestamp',
your_local_timezone))
df.withColumn('timestamp_string', df.timestamp.cast('string')).show(truncate=False)
Replace your_local_timezone with the actual value.
Related
I have a timestamp of this type: 2022-11-09T23:19:32.000Z
When I cast to date, my output is "2022-11-10" but I wanna "2022-11-09". Is there a way to force utc 0 (not +1) or extract directly data with a regex to bring only date without consider timezone?
I have tried also substring('2022-11-09T23:19:32.000Z', 1, 10) or some function to extract string... but my output is the same: "2022-11-10".
Example:
Input
id
start_date
123
2020-04-10T23:55:19.000Z
My code:
df_output = df_input.withColumn('date', F.regex_extract(F.col('start_date', '(\\d{4})-(\\d{2})-(\\d{2})', 0))
Wrong Output
id
start_date
date
123
2020-04-10T23:55:19.000Z
2020-04-11
Desidered Output [I wanna extract string from timestamp without consider timezone]
id
start_date
date
123
2020-04-10T23:55:19.000Z
2020-04-10
Can't you use the to_date function? This here works for me:
from datetime import datetime
from pyspark.sql.functions import to_date
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
df = spark.createDataFrame(
[
(
"123",
datetime.strptime("2020-04-10T23:55:19.000Z", '%Y-%m-%dT%H:%M:%S.%fZ')
)
],
StructType([
StructField("id", StringType()),
StructField("start_date", TimestampType()),
]))
df.withColumn("date", to_date("start_date", "%Y-%m-%d")).show()
Output:
+---+-------------------+----------+
| id| start_date| date|
+---+-------------------+----------+
|123|2020-04-10 23:55:19|2020-04-10|
+---+-------------------+----------+
I have below event_time in my data frame
I would like to convert the event_time into date/time. Used below code, however it's not coming properly
import pyspark.sql.functions as f
df = df.withColumn("date", f.from_unixtime("Event_Time", "dd/MM/yyyy HH:MM:SS"))
df.show()
I am getting below output and it's not coming properly
Can anyone advise how to do this properly as I am new to pyspark?
Seems that your data is in Microseconds (1/1,000,000 second) so you would have to divide by 1,000,000
df = spark.createDataFrame(
[
('1645904274665267',),
('1645973845823770',),
('1644134156697560',),
('1644722868485010',),
('1644805678702121',),
('1645071502180365',),
('1644220446396240',),
('1645736052650785',),
('1646006645296010',),
('1644544811297016',),
('1644614023559317',),
('1644291365608571',),
('1645643575551339',)
], ['Event_Time']
)
import pyspark.sql.functions as f
df = df.withColumn("date", f.from_unixtime(f.col("Event_Time")/1000000))
df.show(truncate = False)
output
+----------------+-------------------+
|Event_Time |date |
+----------------+-------------------+
|1645904274665267|2022-02-26 20:37:54|
|1645973845823770|2022-02-27 15:57:25|
|1644134156697560|2022-02-06 08:55:56|
|1644722868485010|2022-02-13 04:27:48|
|1644805678702121|2022-02-14 03:27:58|
|1645071502180365|2022-02-17 05:18:22|
|1644220446396240|2022-02-07 08:54:06|
|1645736052650785|2022-02-24 21:54:12|
|1646006645296010|2022-02-28 01:04:05|
|1644544811297016|2022-02-11 03:00:11|
|1644614023559317|2022-02-11 22:13:43|
|1644291365608571|2022-02-08 04:36:05|
|1645643575551339|2022-02-23 20:12:55|
+----------------+-------------------+
I've a column in String format , some rows are also null.
I add random timestamp to make it in the following form to convert it into timestamp.
date
null
22-04-2020
date
01-01-1990 23:59:59.000
22-04-2020 23:59:59.000
df = df.withColumn('date', F.concat (df.date, F.lit(" 23:59:59.000")))
df = df.withColumn('date', F.when(F.col('date').isNull(), '01-01-1990 23:59:59.000').otherwise(F.col('date')))
df.withColumn("date", F.to_timestamp(F.col("date"),"MM-dd-yyyy HH mm ss SSS")).show(2)
but after this the column date becomes null.
can anyone help me solve this.
either convert the string to timestamp direct
Your timestamp format should start with dd-MM, not MM-dd, and you're also missing some colons and dots in the time part. Try the code below:
df.withColumn("date", F.to_timestamp(F.col("date"),"dd-MM-yyyy HH:mm:ss.SSS")).show()
+-------------------+
| date|
+-------------------+
|1990-01-01 23:59:59|
|2020-04-22 23:59:59|
+-------------------+
How do you convert a timestamp column to epoch seconds?
var df = sc.parallelize(Seq("2018-07-01T00:00:00Z")).toDF("date_string")
df = df.withColumn("timestamp", $"date_string".cast("timestamp"))
df.show(false)
DataFrame:
+--------------------+---------------------+
|date_string |timestamp |
+--------------------+---------------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|
+--------------------+---------------------+
If you have a timestamp you can cast it to a long to get the epoch seconds
df = df.withColumn("epoch_seconds", $"timestamp".cast("long"))
df.show(false)
DataFrame
+--------------------+---------------------+-------------+
|date_string |timestamp |epoch_seconds|
+--------------------+---------------------+-------------+
|2018-07-01T00:00:00Z|2018-07-01 00:00:00.0|1530403200 |
+--------------------+---------------------+-------------+
Use unix_timestamp from org.apache.spark.functions. It can a timestamp column or from a string column where it is possible to specify the format. From the documentation:
public static Column unix_timestamp(Column s)
Converts time string in format yyyy-MM-dd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale, return null if fail.
public static Column unix_timestamp(Column s, String p)
Convert time string with given pattern (see http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html) to Unix time stamp (in seconds), return null if fail.
Use as follows:
import org.apache.spark.functions._
df.withColumn("epoch_seconds", unix_timestamp($"timestamp")))
or if the column is a string with other format:
df.withColumn("epoch_seconds", unix_timestamp($"date_string", "yyyy-MM-dd'T'HH:mm:ss'Z'")))
It can be easily done with unix_timestamp function in spark SQL like this:
spark.sql("SELECT unix_timestamp(inv_time) AS time_as_long FROM agg_counts LIMIT 10").show()
Hope this helps.
You can use the function unix_timestamp and cast it into any datatype.
Example:
val df1 = df.select(unix_timestamp($"date_string", "yyyy-MM-dd HH:mm:ss").cast(LongType).as("epoch_seconds"))
I have the following DataFrame:
+----------+-------------------+
| timestamp| created|
+----------+-------------------+
|1519858893|2018-03-01 00:01:33|
|1519858950|2018-03-01 00:02:30|
|1519859900|2018-03-01 00:18:20|
|1519859900|2018-03-01 00:18:20|
How to create a timestamp correctly`?
I was able to create timestamp column which is epoch timestamp, but dates to not coincide:
df.withColumn("timestamp",unix_timestamp($"created"))
For example, 1519858893 points to 2018-02-28.
Just use date_format and to_utc_timestamp inbuilt functions
import org.apache.spark.sql.functions._
df.withColumn("timestamp", to_utc_timestamp(date_format(col("created"), "yyy-MM-dd"), "Asia/Kathmandu"))
Try below code
df.withColumn("dateColumn", df("timestamp").cast(DateType))
You can check one solution here https://stackoverflow.com/a/46595413
To elaborate more on that with the dataframe having different formats of timestamp/date in string, you can do this -
val df = spark.sparkContext.parallelize(Seq("2020-04-21 10:43:12.000Z", "20-04-2019 10:34:12", "11-30-2019 10:34:12", "2020-05-21 21:32:43", "20-04-2019", "2020-04-21")).toDF("ts")
def strToDate(col: Column): Column = {
val formats: Seq[String] = Seq("dd-MM-yyyy HH:mm:SS", "yyyy-MM-dd HH:mm:SS", "dd-MM-yyyy", "yyyy-MM-dd")
coalesce(formats.map(f => to_timestamp(col, f).cast(DateType)): _*)
}
val formattedDF = df.withColumn("dt", strToDate(df.col("ts")))
formattedDF.show()
+--------------------+----------+
| ts| dt|
+--------------------+----------+
|2020-04-21 10:43:...|2020-04-21|
| 20-04-2019 10:34:12|2019-04-20|
| 2020-05-21 21:32:43|2020-05-21|
| 20-04-2019|2019-04-20|
| 2020-04-21|2020-04-21|
+--------------------+----------+
Note: - This code assumes that data does not contain any column of format -> MM-dd-yyyy, MM-dd-yyyy HH:mm:SS