24h clock with from_unixtime - pyspark

I need to transform a dataframe with a column of timestamps in Unixtime/LongType-Format to actual TimestampType.
According to epochconverter.com:
1646732321 = 8. März 2022 10:38:41 GMT+1
1646768324 = 8. March 2022 20:38:44 GMT+1
However, when I use from_unixtime on the dataframe, I get a 12-hour clock and it basically subtracts 12 hours from my second timestamp for some reason? How can I tell PySpark to use a 24h clock?
The output of the code below is:
+---+----------+-------------------+
|id |mytime |mytime_new |
+---+----------+-------------------+
|ABC|1646732321|2022-03-08 10:38:41|
|DFG|1646768324|2022-03-08 08:38:44|
+---+----------+-------------------+
The second line should be 2022-03-08 20:38:44.
Reproducible code example:
data = [
("ABC", 1646732321)
,
("DFG", 1646768324)
]
schema = StructType(
[
StructField("id", StringType(), True),
StructField("mytime", LongType(), True),
]
)
df = spark.createDataFrame(data, schema)
df = df.withColumn(
"mytime_new",
from_unixtime(df["mytime"], "yyyy-MM-dd hh:mm:ss"),
)
df.show(10, False)

Found my mistake 3 minutes later... the issue was my timestamp-format string for the hour (hh):
Instead of:
from_unixtime(df["mytime"], "yyyy-MM-dd hh:mm:ss"),
I needed:
from_unixtime(df["mytime"], "yyyy-MM-dd HH:mm:ss"),

Related

Multiple formats in Date Time column in Spark

I am using Spark3.0.1
I have following data as csv:
348702330256514,37495066290,9084849,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,330148,33946,614677375609919,11-02-2018 00:00:00,GENUINE
348702330256514,37495066290,136052,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,4310362,33946,614677375609919,11-02-2018 00:00:00,GENUINE
348702330256514,37495066290,9097094,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,2291118,33946,614677375609919,11-02-2018 00:00:00,GENUINE
348702330256514,37495066290,4900011,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,633447,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,6259303,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,369067,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,1193207,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,9335696,33946,614677375609919,11-02-2018 0:00:00,GENUINE
As you can see the second last column has Timestamp data where the hour column will have data both in single as well as double digits, depending on the hour of the day (This is sample data and not all records have all zeros for time part).
This is the problem and I tried to solve the problem as below:
Read the column as String and then Use a column Method to format it to TimeStamp type.
val schema = StructType(
List(
StructField("_corrupt_record", StringType)
, StructField("card_id", LongType)
, StructField("member_id", LongType)
, StructField("amount", IntegerType)
, StructField("postcode", IntegerType)
, StructField("pos_id", LongType)
, StructField("transaction_dt", StringType)
, StructField("status", StringType)
)
)
// format the timestamp column
def format_time_column(timeStampCol: Column
, formats: Seq[String] = Seq( "dd-MM-yyyy HH:mm:ss", "dd-MM-yyyy H:mm:ss"
, "dd-MM-yyyy HH:m:ss", "dd-MM-yyyy H:m:ss")) ={
coalesce(
formats.map(f => to_timestamp(timeStampCol, f)):_*
)
}
val cardTransaction = spark.read
.format("csv")
.option("header", false)
.schema(schema)
.option("path", cardTransactionFilePath)
.option("columnNameOfCorruptRecord", "_corrupt_record")
.load
.withColumn("transaction_dt", format_time_column(col("transaction_dt")))
cardTransaction.cache()
cardTransaction.show(5)
This code produces following error:
*Note:
The record highlighted has only 1 digit for hour
Whatever is the first format provided in the list of formats, only that works all the rest formats are not considered.
Problem is that to_timestamp() throws exception instead of producing null as is expected by coalesce(), when wrong format is encounterd.
How to solve it?
In Spark 3.0, we define our own pattern strings in Datetime Patterns for Formatting and Parsing, which is implemented via DateTimeFormatter under the hood.
In Spark version 2.4 and below, java.text.SimpleDateFormat is used for timestamp/date string conversions, and the supported patterns are described in SimpleDateFormat.
The old behavior can be restored by setting spark.sql.legacy.timeParserPolicy to LEGACY.
sparkConf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
Doc:
sql-migration-guide.html#query-engine

Pyspark - extract first monday of week

Given Dataframe :
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James",202245),
("Michael",202133),
("Robert",202152),
("Maria",202252),
("Jen",202201)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("Week", IntegerType(), True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)
The week column denotes the week number and year ie; 202245 is the 45th week of 2022. I would like to extract the date on which Monday falls for the 45th week of 2022 which is 7th Nov 2022.
What I tried, :
Tried using datetime and udf:
def get_monday_from_week(x: int) -> datetime.date:
"""
Converts fiscal week to datetime of first Monday from that week
Args:
x (int): fiscal week
Returns:
datetime.date: datetime of first Monday from that week
"""
x = str(x)
r = datetime.datetime.strptime(x + "-1", "%Y%W-%w")
return r
How can I use spark functions to implement this, I am trying to avoid the use of udf here.

Extract date from pySpark timestamp column (no UTC timezone) in Palantir

I have a timestamp of this type: 2022-11-09T23:19:32.000Z
When I cast to date, my output is "2022-11-10" but I wanna "2022-11-09". Is there a way to force utc 0 (not +1) or extract directly data with a regex to bring only date without consider timezone?
I have tried also substring('2022-11-09T23:19:32.000Z', 1, 10) or some function to extract string... but my output is the same: "2022-11-10".
Example:
Input
id
start_date
123
2020-04-10T23:55:19.000Z
My code:
df_output = df_input.withColumn('date', F.regex_extract(F.col('start_date', '(\\d{4})-(\\d{2})-(\\d{2})', 0))
Wrong Output
id
start_date
date
123
2020-04-10T23:55:19.000Z
2020-04-11
Desidered Output [I wanna extract string from timestamp without consider timezone]
id
start_date
date
123
2020-04-10T23:55:19.000Z
2020-04-10
Can't you use the to_date function? This here works for me:
from datetime import datetime
from pyspark.sql.functions import to_date
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
df = spark.createDataFrame(
[
(
"123",
datetime.strptime("2020-04-10T23:55:19.000Z", '%Y-%m-%dT%H:%M:%S.%fZ')
)
],
StructType([
StructField("id", StringType()),
StructField("start_date", TimestampType()),
]))
df.withColumn("date", to_date("start_date", "%Y-%m-%d")).show()
Output:
+---+-------------------+----------+
| id| start_date| date|
+---+-------------------+----------+
|123|2020-04-10 23:55:19|2020-04-10|
+---+-------------------+----------+

Convert event time into date and time in Pyspark?

I have below event_time in my data frame
I would like to convert the event_time into date/time. Used below code, however it's not coming properly
import pyspark.sql.functions as f
df = df.withColumn("date", f.from_unixtime("Event_Time", "dd/MM/yyyy HH:MM:SS"))
df.show()
I am getting below output and it's not coming properly
Can anyone advise how to do this properly as I am new to pyspark?
Seems that your data is in Microseconds (1/1,000,000 second) so you would have to divide by 1,000,000
df = spark.createDataFrame(
[
('1645904274665267',),
('1645973845823770',),
('1644134156697560',),
('1644722868485010',),
('1644805678702121',),
('1645071502180365',),
('1644220446396240',),
('1645736052650785',),
('1646006645296010',),
('1644544811297016',),
('1644614023559317',),
('1644291365608571',),
('1645643575551339',)
], ['Event_Time']
)
import pyspark.sql.functions as f
df = df.withColumn("date", f.from_unixtime(f.col("Event_Time")/1000000))
df.show(truncate = False)
output
+----------------+-------------------+
|Event_Time |date |
+----------------+-------------------+
|1645904274665267|2022-02-26 20:37:54|
|1645973845823770|2022-02-27 15:57:25|
|1644134156697560|2022-02-06 08:55:56|
|1644722868485010|2022-02-13 04:27:48|
|1644805678702121|2022-02-14 03:27:58|
|1645071502180365|2022-02-17 05:18:22|
|1644220446396240|2022-02-07 08:54:06|
|1645736052650785|2022-02-24 21:54:12|
|1646006645296010|2022-02-28 01:04:05|
|1644544811297016|2022-02-11 03:00:11|
|1644614023559317|2022-02-11 22:13:43|
|1644291365608571|2022-02-08 04:36:05|
|1645643575551339|2022-02-23 20:12:55|
+----------------+-------------------+

How can i split timestamp to Date and time?

//loading DF
val df1 = spark.read.option("header",true).option("inferSchema",true).csv("time.csv ")
//
+-------------+
| date_time|
+-----+-------+
|1545905416000|
+-----+-------+
when i use the cast to change the column value to DateType, it shows error
=> the datatype is not matching (date_time : bigint)in df
df1.withColumn("date_time", df1("date").cast(DateType)).show()
Any solution for solveing it???
i tried doing
val a = df1.withColumn("date_time",df1("date").cast(StringType)).drop("date").toDF()
a.withColumn("fomatedDateTime",a("date_time").cast(DateType)).show()
but it does not work.
Welcome to StackOverflow!
You need to convert the timestamp from epoch format to date and then do the computation. You can try this:
import spark.implicits._
val df = spark.read.option("header",true).option("inferSchema",true).csv("time.csv ")
val df1 = df.withColumn(
"dateCreated",
date_format(
to_date(
substring(
from_unixtime($"date_time".divide(1000)),
0,
10
),
"yyyy-MM-dd"
)
,"dd-MM-yyyy")
)
.withColumn(
"timeCreated",
substring(
from_unixtime($"date_time".divide(1000)),
11,
19
)
)
Sample data from my usecase:
+---------+-------------+--------+-----------+-----------+
| adId| date_time| price|dateCreated|timeCreated|
+---------+-------------+--------+-----------+-----------+
|230010452|1469178808000| 5950.0| 22-07-2016| 14:43:28|
|230147621|1469456306000| 19490.0| 25-07-2016| 19:48:26|
|229662644|1468546792000| 12777.0| 15-07-2016| 07:09:52|
|229218611|1467815284000| 9996.0| 06-07-2016| 19:58:04|
|229105894|1467656022000| 7700.0| 04-07-2016| 23:43:42|
|230214681|1469559471000| 4600.0| 27-07-2016| 00:27:51|
|230158375|1469469248000| 999.0| 25-07-2016| 23:24:08|
+---------+-------------+--------+-----------+-----------+
You need to adjust the time. By default it would be your timezone. For me it's GMT +05:30. Hope it helps.