Extract date from pySpark timestamp column (no UTC timezone) in Palantir - date

I have a timestamp of this type: 2022-11-09T23:19:32.000Z
When I cast to date, my output is "2022-11-10" but I wanna "2022-11-09". Is there a way to force utc 0 (not +1) or extract directly data with a regex to bring only date without consider timezone?
I have tried also substring('2022-11-09T23:19:32.000Z', 1, 10) or some function to extract string... but my output is the same: "2022-11-10".
Example:
Input
id
start_date
123
2020-04-10T23:55:19.000Z
My code:
df_output = df_input.withColumn('date', F.regex_extract(F.col('start_date', '(\\d{4})-(\\d{2})-(\\d{2})', 0))
Wrong Output
id
start_date
date
123
2020-04-10T23:55:19.000Z
2020-04-11
Desidered Output [I wanna extract string from timestamp without consider timezone]
id
start_date
date
123
2020-04-10T23:55:19.000Z
2020-04-10

Can't you use the to_date function? This here works for me:
from datetime import datetime
from pyspark.sql.functions import to_date
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
df = spark.createDataFrame(
[
(
"123",
datetime.strptime("2020-04-10T23:55:19.000Z", '%Y-%m-%dT%H:%M:%S.%fZ')
)
],
StructType([
StructField("id", StringType()),
StructField("start_date", TimestampType()),
]))
df.withColumn("date", to_date("start_date", "%Y-%m-%d")).show()
Output:
+---+-------------------+----------+
| id| start_date| date|
+---+-------------------+----------+
|123|2020-04-10 23:55:19|2020-04-10|
+---+-------------------+----------+

Related

How to filter pyspark dataframe with last 14 days?

I am having a date column in my dataframe
I wanted to filter out the last 14 days from the dataframe using the date column.
I tried the below code but it's not working
last_14 = df.filter((df('Date')> date_add(current_timestamp(), -14)).select("Event_Time","User_ID","Impressions","Clicks","URL", "Date")
Event_time, user_id, impressions, clicks, URL is my other columns
Can anyone advise how to do this?
from pyspark.sql import functions as F, types as T
df = spark.createDataFrame(
[
('2022-03-10',),
('2022-03-09',),
('2022-03-08',),
('2022-02-02',),
('2022-02-01',)
], ['Date']
).withColumn('Date', F.to_date('Date', 'y-M-d'))
df\
.filter((F.col('Date') > F.date_sub(F.current_date(), 14)))\
.show()
+----------+
| Date|
+----------+
|2022-03-10|
|2022-03-09|
|2022-03-08|
+----------+
In your code it would be:
last_14 = df.filter((F.col('Date') > F.date_sub(F.current_date(), 14))).select("Event_Time","User_ID","Impressions","Clicks","URL", "Date")

Convert event time into date and time in Pyspark?

I have below event_time in my data frame
I would like to convert the event_time into date/time. Used below code, however it's not coming properly
import pyspark.sql.functions as f
df = df.withColumn("date", f.from_unixtime("Event_Time", "dd/MM/yyyy HH:MM:SS"))
df.show()
I am getting below output and it's not coming properly
Can anyone advise how to do this properly as I am new to pyspark?
Seems that your data is in Microseconds (1/1,000,000 second) so you would have to divide by 1,000,000
df = spark.createDataFrame(
[
('1645904274665267',),
('1645973845823770',),
('1644134156697560',),
('1644722868485010',),
('1644805678702121',),
('1645071502180365',),
('1644220446396240',),
('1645736052650785',),
('1646006645296010',),
('1644544811297016',),
('1644614023559317',),
('1644291365608571',),
('1645643575551339',)
], ['Event_Time']
)
import pyspark.sql.functions as f
df = df.withColumn("date", f.from_unixtime(f.col("Event_Time")/1000000))
df.show(truncate = False)
output
+----------------+-------------------+
|Event_Time |date |
+----------------+-------------------+
|1645904274665267|2022-02-26 20:37:54|
|1645973845823770|2022-02-27 15:57:25|
|1644134156697560|2022-02-06 08:55:56|
|1644722868485010|2022-02-13 04:27:48|
|1644805678702121|2022-02-14 03:27:58|
|1645071502180365|2022-02-17 05:18:22|
|1644220446396240|2022-02-07 08:54:06|
|1645736052650785|2022-02-24 21:54:12|
|1646006645296010|2022-02-28 01:04:05|
|1644544811297016|2022-02-11 03:00:11|
|1644614023559317|2022-02-11 22:13:43|
|1644291365608571|2022-02-08 04:36:05|
|1645643575551339|2022-02-23 20:12:55|
+----------------+-------------------+

Convert string (with timestamp) to timestamp in pyspark

I have a dataframe with a string datetime column.
I am converting it to timestamp, but the values are changing.
Following is my code, can anyone help me to convert without changing values.
df=spark.createDataFrame(
data = [ ("1","2020-04-06 15:06:16 +00:00")],
schema=["id","input_timestamp"])
df.printSchema()
#Timestamp String to DateType
df = df.withColumn("timestamp",to_timestamp("input_timestamp"))
# Using Cast to convert TimestampType to DateType
df.withColumn('timestamp_string', \
to_timestamp('timestamp').cast('string')) \
.show(truncate=False)
This is the output:
+---+--------------------------+-------------------+-------------------+
|id |input_timestamp |timestamp |timestamp_string |
+---+--------------------------+-------------------+-------------------+
|1 |2020-04-06 15:06:16 +00:00|2020-04-06 08:06:16|2020-04-06 08:06:16|
+---+--------------------------+-------------------+-------------------+
I want to know why the hour is changing from 15 to 8 and how can I prevent it?
I believe to_timestamp is converting timestamp value to your local time as you have +00:00 in your data.
Try to pass the format to to_timestamp() function.
Example:
from pyspark.sql.functions import to_timestamp
df.withColumn("timestamp",to_timestamp(col("input_timestamp"),"yyyy-MM-dd HH:mm:ss +00:00")).show(10,False)
#+---+--------------------------+-------------------+
#|id |input_timestamp |timestamp |
#+---+--------------------------+-------------------+
#|1 |2020-04-06 15:06:16 +00:00|2020-04-06 15:06:16|
#+---+--------------------------+-------------------+
from pyspark.sql.functions import to_utc_timestamp
df = spark.createDataFrame(
data=[('1', '2020-04-06 15:06:16 +00:00')],
schema=['id', 'input_timestamp'])
df.printSchema()
df = df.withColumn('timestamp', to_utc_timestamp('input_timestamp',
your_local_timezone))
df.withColumn('timestamp_string', df.timestamp.cast('string')).show(truncate=False)
Replace your_local_timezone with the actual value.

How to convert one time zone to another in Spark Dataframe

I am reading from PostgreSQL into Spark Dataframe and have date column in PostgreSQL like below:
last_upd_date
---------------------
"2021-04-21 22:33:06.308639-05"
But in spark dataframe it's adding the hour interval.
eg: 2020-04-22 03:33:06.308639
Here it is adding 5 hours to the last_upd_date column.
But I want output as 2021-04-21 22:33:06.308639
Can anyone help me how to fix this spark dataframe.
You can create an udf that formats the timestamp with the required timezone:
import java.time.{Instant, ZoneId}
val formatTimestampWithTz = udf((i: Instant, zone: String)
=> i.atZone(ZoneId.of(zone))
.format(DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSSSSS")))
val df = Seq(("2021-04-21 22:33:06.308639-05")).toDF("dateString")
.withColumn("date", to_timestamp('dateString, "yyyy-MM-dd HH:mm:ss.SSSSSSx"))
.withColumn("date in Berlin", formatTimestampWithTz('date, lit("Europe/Berlin")))
.withColumn("date in Anchorage", formatTimestampWithTz('date, lit("America/Anchorage")))
.withColumn("date in GMT-5", formatTimestampWithTz('date, lit("-5")))
df.show(numRows = 10, truncate = 50, vertical = true)
Result:
-RECORD 0------------------------------------------
dateString | 2021-04-21 22:33:06.308639-05
date | 2021-04-22 05:33:06.308639
date in Berlin | 2021-04-22 05:33:06.308639
date in Anchorage | 2021-04-21 19:33:06.308639
date in GMT-5 | 2021-04-21 22:33:06.308639

How can i split timestamp to Date and time?

//loading DF
val df1 = spark.read.option("header",true).option("inferSchema",true).csv("time.csv ")
//
+-------------+
| date_time|
+-----+-------+
|1545905416000|
+-----+-------+
when i use the cast to change the column value to DateType, it shows error
=> the datatype is not matching (date_time : bigint)in df
df1.withColumn("date_time", df1("date").cast(DateType)).show()
Any solution for solveing it???
i tried doing
val a = df1.withColumn("date_time",df1("date").cast(StringType)).drop("date").toDF()
a.withColumn("fomatedDateTime",a("date_time").cast(DateType)).show()
but it does not work.
Welcome to StackOverflow!
You need to convert the timestamp from epoch format to date and then do the computation. You can try this:
import spark.implicits._
val df = spark.read.option("header",true).option("inferSchema",true).csv("time.csv ")
val df1 = df.withColumn(
"dateCreated",
date_format(
to_date(
substring(
from_unixtime($"date_time".divide(1000)),
0,
10
),
"yyyy-MM-dd"
)
,"dd-MM-yyyy")
)
.withColumn(
"timeCreated",
substring(
from_unixtime($"date_time".divide(1000)),
11,
19
)
)
Sample data from my usecase:
+---------+-------------+--------+-----------+-----------+
| adId| date_time| price|dateCreated|timeCreated|
+---------+-------------+--------+-----------+-----------+
|230010452|1469178808000| 5950.0| 22-07-2016| 14:43:28|
|230147621|1469456306000| 19490.0| 25-07-2016| 19:48:26|
|229662644|1468546792000| 12777.0| 15-07-2016| 07:09:52|
|229218611|1467815284000| 9996.0| 06-07-2016| 19:58:04|
|229105894|1467656022000| 7700.0| 04-07-2016| 23:43:42|
|230214681|1469559471000| 4600.0| 27-07-2016| 00:27:51|
|230158375|1469469248000| 999.0| 25-07-2016| 23:24:08|
+---------+-------------+--------+-----------+-----------+
You need to adjust the time. By default it would be your timezone. For me it's GMT +05:30. Hope it helps.