Pyspark - extract first monday of week - pyspark

Given Dataframe :
from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James",202245),
("Michael",202133),
("Robert",202152),
("Maria",202252),
("Jen",202201)
]
schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("Week", IntegerType(), True) \
])
df = spark.createDataFrame(data=data2,schema=schema)
df.printSchema()
df.show(truncate=False)
The week column denotes the week number and year ie; 202245 is the 45th week of 2022. I would like to extract the date on which Monday falls for the 45th week of 2022 which is 7th Nov 2022.
What I tried, :
Tried using datetime and udf:
def get_monday_from_week(x: int) -> datetime.date:
"""
Converts fiscal week to datetime of first Monday from that week
Args:
x (int): fiscal week
Returns:
datetime.date: datetime of first Monday from that week
"""
x = str(x)
r = datetime.datetime.strptime(x + "-1", "%Y%W-%w")
return r
How can I use spark functions to implement this, I am trying to avoid the use of udf here.

Related

24h clock with from_unixtime

I need to transform a dataframe with a column of timestamps in Unixtime/LongType-Format to actual TimestampType.
According to epochconverter.com:
1646732321 = 8. März 2022 10:38:41 GMT+1
1646768324 = 8. March 2022 20:38:44 GMT+1
However, when I use from_unixtime on the dataframe, I get a 12-hour clock and it basically subtracts 12 hours from my second timestamp for some reason? How can I tell PySpark to use a 24h clock?
The output of the code below is:
+---+----------+-------------------+
|id |mytime |mytime_new |
+---+----------+-------------------+
|ABC|1646732321|2022-03-08 10:38:41|
|DFG|1646768324|2022-03-08 08:38:44|
+---+----------+-------------------+
The second line should be 2022-03-08 20:38:44.
Reproducible code example:
data = [
("ABC", 1646732321)
,
("DFG", 1646768324)
]
schema = StructType(
[
StructField("id", StringType(), True),
StructField("mytime", LongType(), True),
]
)
df = spark.createDataFrame(data, schema)
df = df.withColumn(
"mytime_new",
from_unixtime(df["mytime"], "yyyy-MM-dd hh:mm:ss"),
)
df.show(10, False)
Found my mistake 3 minutes later... the issue was my timestamp-format string for the hour (hh):
Instead of:
from_unixtime(df["mytime"], "yyyy-MM-dd hh:mm:ss"),
I needed:
from_unixtime(df["mytime"], "yyyy-MM-dd HH:mm:ss"),

Is there a date format in Spark only for the year or month?

For a dataframe df with a column date_string - which represents a string like "20220331" - the following works perfectly:
df = df.withColumn("date",to_date(col("date_string"),"yyyymmdd"))
For "20220331" the column "date" of type date - just as required - now looks like this: 2022-03-31
I now want two columns "year" and "month" of type date. For "20220331" the column year should be 2022 and the column month should be 2022-03. The following does not work:
df = df.withColumn("year",to_date(col("date_string"),"yyyy")
.withColumn("month",to_date(col("date_string"),"yyyymm")))
Is it even possible in Spark to have something in the form of yyyy and yyyy-mm in the date type?
You can use date_format:
scala> Seq(1).toDF("seq").select(
| date_format(current_timestamp(),"yyyyMM")
| ).show
+----------------------------------------+
|date_format(current_timestamp(), yyyyMM)|
+----------------------------------------+
| 202203|
+----------------------------------------+
Alternatively, if your date is stored as a string, you could just substring the values out.

How to convert one time zone to another in Spark Dataframe

I am reading from PostgreSQL into Spark Dataframe and have date column in PostgreSQL like below:
last_upd_date
---------------------
"2021-04-21 22:33:06.308639-05"
But in spark dataframe it's adding the hour interval.
eg: 2020-04-22 03:33:06.308639
Here it is adding 5 hours to the last_upd_date column.
But I want output as 2021-04-21 22:33:06.308639
Can anyone help me how to fix this spark dataframe.
You can create an udf that formats the timestamp with the required timezone:
import java.time.{Instant, ZoneId}
val formatTimestampWithTz = udf((i: Instant, zone: String)
=> i.atZone(ZoneId.of(zone))
.format(DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss.SSSSSS")))
val df = Seq(("2021-04-21 22:33:06.308639-05")).toDF("dateString")
.withColumn("date", to_timestamp('dateString, "yyyy-MM-dd HH:mm:ss.SSSSSSx"))
.withColumn("date in Berlin", formatTimestampWithTz('date, lit("Europe/Berlin")))
.withColumn("date in Anchorage", formatTimestampWithTz('date, lit("America/Anchorage")))
.withColumn("date in GMT-5", formatTimestampWithTz('date, lit("-5")))
df.show(numRows = 10, truncate = 50, vertical = true)
Result:
-RECORD 0------------------------------------------
dateString | 2021-04-21 22:33:06.308639-05
date | 2021-04-22 05:33:06.308639
date in Berlin | 2021-04-22 05:33:06.308639
date in Anchorage | 2021-04-21 19:33:06.308639
date in GMT-5 | 2021-04-21 22:33:06.308639

Split date into day of the week, month,year using Pyspark

I have very little experience in Pyspark and I am trying with no success to create 3 new columns from a column that contain the timestamp of each row.
The column containing the date has the following format:EEE MMM dd HH:mm:ss Z yyyy.
So it looks like this:
+--------------------+
| timestamp|
+--------------------+
|Fri Oct 18 17:07:...|
|Mon Oct 21 21:49:...|
|Thu Oct 31 18:03:...|
|Sun Oct 20 15:00:...|
|Mon Sep 30 23:35:...|
+--------------------+
The 3 columns have to contain: the day of the week as an integer (so 0 for monday, 1 for tuesday...), the number of the month and the year.
What is the most effective way to create these additional 3 columns and append them to the pyspark dataframe? Thanks in advance!!
Spark 1.5 and higher has many date processing functions. Here are some that maybe useful for you
from pyspark.sql.functions import *
from pyspark.sql.functions import year, month, dayofweek
df = df.withColumn('dayOfWeek', dayofweek(col('your_date_column')))
df = df.withColumn('month', month(col('your_date_column')))
df = df.withColumn('year', year(col('your_date_column')))

Adding float column to TimestampType column (seconds+miliseconds)

I am trying to add a float column to a TimestampType column in pyspark, but there does not seem to be a way to do this while maintaining the fractional seconds. example of float_seconds is 19.702300786972046, example of timestamp is 2021-06-17 04:31:32.48761
what I want:
calculated_df = beginning_df.withColumn("calculated_column", float_seconds_col + TimestampType_col)
I have tried the following methods, but neither completely solves the problem:
#method 1 adds a single time, but cannot be used to add an entire column to the timestamp column.
calculated_df = beginning_df.withColumn("calculated_column",col("TimestampType_col") + F.expr('INTERVAL 19.702300786 seconds'))
#method 2 converts the float column to unixtime, but cuts off the decimals (which are important)
timestamp_seconds = beginning_df.select(from_unixtime("float_seconds"))
Image of the two columns in question
You could achieve it using a UDF as follows:
from datetime import datetime, timedelta
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StructType, StructField, FloatType, TimestampType
spark = SparkSession \
.builder \
.appName("StructuredStreamTesting") \
.getOrCreate()
schema = (StructType([
StructField('dt', TimestampType(), nullable=True),
StructField('sec', FloatType(), nullable=True),
]))
item1 = {
"dt": datetime.fromtimestamp(1611859271.516),
"sec": 19.702300786,
}
item2 = {
"dt": datetime.fromtimestamp(1611859271.517),
"sec": 19.702300787,
}
item3 = {
"dt": datetime.fromtimestamp(1611859271.518),
"sec": 19.702300788,
}
df = spark.createDataFrame([item1, item2, item3], schema=schema)
df.printSchema()
#udf(returnType=TimestampType())
def add_time(dt, sec):
return dt + timedelta(seconds=sec)
df = df.withColumn("new_dt", add_time(col("dt"), col("sec")))
df.printSchema()
df.show(truncate=False)
Timestamp data type supports nanoseconds (max 9 digits precision). Your float_seconds_col has precision > 9 digits (15 in your example, it is femto-seconds), it will be lost if converted to Timestamp anyway.
Plain vanilla Hive:
select
timestamp(cast(concat(cast(unix_timestamp(TimestampType_col) as string), --seconds
'.',
regexp_extract(TimestampType_col,'\\.(\\d+)$')) --fractional
as decimal (30, 15)
) + float_seconds_col --round this value to nanos to get better timestamp conversion (round(float_seconds_col,9))
) as result --max precision is 9 (nanoseconds)
from
(
select 19.702300786972046 float_seconds_col,
timestamp('2021-06-17 04:31:32.48761') TimestampType_col
) s
Result:
2021-06-17 04:31:52.189910786