How to get 1st day of the year in pyspark - date

I have a date variable that I need to pass to various functions.
For e.g, if I have the date in a variable as 12/09/2021, it should return me 01/01/2021
How do I get 1st day of the year in PySpark

You can use the trunc-function which truncates parts of a date.
df = spark.createDataFrame([()], [])
(
df
.withColumn('current_date', f.current_date())
.withColumn("year_start", f.trunc("current_date", "year"))
.show()
)
# Output
+------------+----------+
|current_date|year_start|
+------------+----------+
| 2022-02-23|2022-01-01|
+------------+----------+

x = '12/09/2021'
'01/01/' + x[-4:]
output: '01/01/2021'

You can achieve this with date_trunc with to_date as the later returns a Timestamp rather than a Date
Data Preparation
df = pd.DataFrame({
'Date':['2021-01-23','2002-02-09','2009-09-19'],
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+----------+
| Date|
+----------+
|2021-01-23|
|2002-02-09|
|2009-09-19|
+----------+
Date Trunc & To Date
sparkDF = sparkDF.withColumn('first_day_year_dt',F.to_date(F.date_trunc('year',F.col('Date')),'yyyy-MM-dd'))\
.withColumn('first_day_year_timestamp',F.date_trunc('year',F.col('Date')))
sparkDF.show()
+----------+-----------------+------------------------+
| Date|first_day_year_dt|first_day_year_timestamp|
+----------+-----------------+------------------------+
|2021-01-23| 2021-01-01| 2021-01-01 00:00:00|
|2002-02-09| 2002-01-01| 2002-01-01 00:00:00|
|2009-09-19| 2009-01-01| 2009-01-01 00:00:00|
+----------+-----------------+------------------------+

Related

Convert String to Timestamp in Spark (Hive) and the datetime is invalid

I'm trying to change a String to timestamp, but in my zone, the last Sunday of March between 2:00 AM and 3:00 AM does not exist and returns null. Example:
scala> spark.sql("select to_timestamp('20220327 021500', 'yyyyMMdd HHmmss') from *").show(1)
+--------------------------------------------------+
|to_timestamp('20220327 021500', 'yyyyMMdd HHmmss')|
+--------------------------------------------------+
| null|
+--------------------------------------------------+
only showing top 1 row
scala> spark.sql("select to_timestamp('20220327 031500', 'yyyyMMdd HHmmss') from *").show(1)
+--------------------------------------------------+
|to_timestamp('20220327 031500', 'yyyyMMdd HHmmss')|
+--------------------------------------------------+
| 2022-03-27 03:15:00|
+--------------------------------------------------+
only showing top 1 row
A solution may be to add one hour between 2:00 AM and 3:00 AM for those days, but I don't know how to implement this solution
I can't change the data source.
What can I do?
Thanks
EDIT
The official documentation says:
In Spark 3.1, from_unixtime, unix_timestamp,to_unix_timestamp,
to_timestamp and to_date will fail if the specified datetime pattern
is invalid. In Spark 3.0 or earlier, they result NULL. (1)
Let's consider the following dataframe with a column called ts.
val df = Seq("20220327 021500", "20220327 031500", "20220327 011500").toDF("ts")
in spark 3.1+, we can use to_timestamp which automatically adds one hour in your situation.
df.withColumn("time", to_timestamp($"ts", "yyyyMMdd HHmmss")).show
+---------------+-------------------+
| ts| time|
+---------------+-------------------+
|20220327 021500|2022-03-27 03:15:00|
|20220327 031500|2022-03-27 03:15:00|
|20220327 011500|2022-03-27 01:15:00|
+---------------+-------------------+
in spark 3.0 and 2.4.7, we obtain this:
df.withColumn("time", to_timestamp($"ts", "yyyyMMdd HHmmss")).show
+---------------+-------------------+
| ts| time|
+---------------+-------------------+
|20220327 021500| null|
|20220327 031500|2022-03-27 03:15:00|
|20220327 011500|2022-03-27 01:15:00|
+---------------+-------------------+
But strangely, in spark 2.4.7, to_utc_timestamp works the same way as to_timestamp in future versions. The only problem is that we can not use custom date format. Yet, if we convert the date ourselves, we can obtain this:
df.withColumn("ts", concat(
substring('ts, 0, 4), lit("-"),
substring('ts, 5, 2), lit("-"),
substring('ts, 7, 5), lit(":"),
substring('ts,12,2), lit(":"),
substring('ts,14,2))
)
.withColumn("time", to_utc_timestamp('ts, "UTC"))
.show
+-------------------+-------------------+
| ts| time|
+-------------------+-------------------+
|2022-03-27 02:15:00|2022-03-27 03:15:00|
|2022-03-27 03:15:00|2022-03-27 03:15:00|
|2022-03-27 01:15:00|2022-03-27 01:15:00|
+-------------------+-------------------+
I found two different solutions, in both solutions you have to change the String format to yyyy-MM-dd HH:mm:ss:
select cast(CONCAT(SUBSTR('20220327021000',0,4),'-',
SUBSTR('20220327021000',5,2),'-',
SUBSTR('20220327021000',7,2),' ',
SUBSTR('20220327021020',9,2),':',
SUBSTR('20220327021020',11,2),':',
SUBSTR('20220327021020',13,2)) as timestamp);
select
cast(
regexp_replace("20220327031005",
'^(\\d{4})(\\d{2})(\\d{2})(\\d{2})(\\d{2})(\\d{2})$','$1-$2-$3 $4:$5:$6'
) as timestamp);

Converting string time to day timestamp

I have just started working for Pyspark, and need some help converting a column datatype.
My dataframe has a string column, which stores the time of day in AM/PM, and I need to convert this into datetime for further processing/analysis.
fd = spark.createDataFrame([(['0143A'])], ['dt'])
fd.show()
+-----+
| dt|
+-----+
|0143A|
+-----+
from pyspark.sql.functions import date_format, to_timestamp
#fd.select(date_format('dt','hhmma')).show()
fd.select(to_timestamp('dt','hhmmaa')).show()
+----------------------------+
|to_timestamp(`dt`, 'hhmmaa')|
+----------------------------+
| null|
+----------------------------+
Expected output: 01:43
How can I get the proper datetime format in the above scenario?
Thanks for your help!
If we look at the doc for to_timestamp (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.to_timestamp) we see that the format must be specified as a SimpleDateFormat (https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).
In order to retrieve the time of the day in AM/PM, we must use hhmma. But in SimpleDateFormat, a catches AM or PM, and not A or P. So we need to change our string :
import pyspark.sql.functions as F
df = spark.createDataFrame([(['0143A'])], ['dt'])
df2 = df.withColumn('dt', F.concat(F.col('dt'), F.lit('M')))
df3 = df2.withColumn('ts', F.to_timestamp('dt','hhmma'))
df3.show()
+------+-------------------+
| dt| ts|
+------+-------------------+
|0143AM|1970-01-01 01:43:00|
+------+-------------------+
If you want to retrieve it as a string in the format you mentionned, you can use date_format :
df4 = df3.withColumn('time', F.date_format(F.col('ts'), format='HH:mm'))
df4.show()
+------+-------------------+-----+
| dt| ts| time|
+------+-------------------+-----+
|0143AM|1970-01-01 01:43:00|01:43|
+------+-------------------+-----+

How to retrieve the month from a date column values in scala dataframe?

Given:
val df = Seq((1L, "04-04-2015")).toDF("id", "date")
val df2 = df.withColumn("month", from_unixtime(unix_timestamp($"date", "dd/MM/yy"), "MMMMM"))
df2.show()
I got this output:
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015| null|
+---+----------+-----+
However, I want the output to be as below:
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015|April|
+---+----------+-----+
How can I do that in sparkSQL using Scala?
This should do it:
val df2 = df.withColumn("month", date_format(to_date($"date", "dd-MM-yyyy"), "MMMM"))
df2.show
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015|April|
+---+----------+-----+
NOTE:
The first string (to_date) must match the format of your existing date
Be careful with: "dd-MM-yyyy" vs "MM-dd-yyyy"
The second string (date_format) is the format of the output
Docs:
to_date
date_format
Nothing Wrong in your code just keeps your date format as your date column.
Here i am attaching screenshot with your code and change codes.
HAppy Hadoooooooooooopppppppppppppppppppppp
Not exactly related to this question but who wants to get a month as integer there is a month function:
val df2 = df.withColumn("month", month($"date", "dd-MM-yyyy"))
df2.show
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015| 4|
+---+----------+-----+
The same way you can use the year function to get only year.

How to find the time difference between 2 date-times in Scala?

I have a dataframe
+-----+----+----------+------------+----------+------------+
|empId| lId| date1| time1 | date2 | time2 |
+-----+----+----------+------------+----------+------------+
| 1234|1212|2018-04-20|21:40:29.077|2018-04-20|22:40:29.077|
| 1235|1212|2018-04-20|22:40:29.077|2018-04-21|00:40:29.077|
+-----+----+----------+------------+----------+------------+
Need to find the time difference between the 2 date-times(in minutes) for each empId and save as a new column.
Required output :
+-----+----+----------+------------+----------+------------+---------+
|empId| lId| date1| time1 | date2 | time2 |TimeDiff |
+-----+----+----------+------------+----------+------------+---------+
| 1234|1212|2018-04-20|21:40:29.077|2018-04-20|22:40:29.077|60 |
| 1235|1212|2018-04-20|22:40:29.077|2018-04-21|00:40:29.077|120 |
+-----+----+----------+------------+----------+------------+---------+
You can concat date and time and convert it to timestamp and find the difference in minutes as below
import org.apache.spark.sql.functions._
val format = "yyyy-MM-dd HH:mm:ss.SSS" //datetime format after concat
val newDF = df1.withColumn("TimeDiffInMinute",
abs(unix_timestamp(concat_ws(" ", $"date1", $"time1"), format).cast("long")
- (unix_timestamp(concat_ws(" ", $"date2", $"time2"), format)).cast("long") / 60D
)
unix_timestamp to convert the datetime to timestamp, Subtraction of timestamp results in seconds and divide by 60 results in minutes.
Output:
+-----+----+----------+------------+----------+------------+---------+
|empId| lId| date1| time1| date2| time2|dateTime1|
+-----+----+----------+------------+----------+------------+---------+
| 1234|1212|2018-04-20|21:40:29.077|2018-04-20|22:40:29.077| 60.0|
| 1235|1212|2018-04-20|22:40:29.077|2018-04-21|00:40:29.077| 120.0|
+-----+----+----------+------------+----------+------------+---------+
Hope this helped!

Extract week day number from string column (datetime stamp) in spark api

I am new to Spark API. I am trying to extract weekday number from a column say col_date (having datetime stamp e.g '13AUG15:09:40:15') which is string and add another column as weekday(integer). I am not able to do successfully.
the approach below worked for me, using a 'one line' udf - similar but different to above:
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('dayofweek').getOrCreate()
set up the dataframe:
df = spark.createDataFrame(
[(1, "2018-05-12")
,(2, "2018-05-13")
,(3, "2018-05-14")
,(4, "2018-05-15")
,(5, "2018-05-16")
,(6, "2018-05-17")
,(7, "2018-05-18")
,(8, "2018-05-19")
,(9, "2018-05-20")
], ("id", "date"))
set up the udf:
from pyspark.sql.functions import udf,desc
from datetime import datetime
weekDay = udf(lambda x: datetime.strptime(x, '%Y-%m-%d').strftime('%w'))
df = df.withColumn('weekDay', weekDay(df['date'])).sort(desc("date"))
results:
df.show()
+---+----------+-------+
| id| date|weekDay|
+---+----------+-------+
| 9|2018-05-20| 0|
| 8|2018-05-19| 6|
| 7|2018-05-18| 5|
| 6|2018-05-17| 4|
| 5|2018-05-16| 3|
| 4|2018-05-15| 2|
| 3|2018-05-14| 1|
| 2|2018-05-13| 0|
| 1|2018-05-12| 6|
+---+----------+-------+
Well, this is quite simple.
This simple function make all the job and returns weekdays as number (monday = 1):
from time import time
from datetime import datetime
# get weekdays and daily hours from timestamp
def toWeekDay(x):
# v = datetime.strptime(datetime.fromtimestamp(int(x)).strftime("%Y %m %d %H"), "%Y %m %d %H").strftime('%w') - from unix timestamp
v = datetime.strptime(x, '%d%b%y:%H:%M:%S').strftime('%w')
return v
days = ['13AUG15:09:40:15','27APR16:20:04:35'] # create example dates
days = sc.parallelize(days) # for example purposes - transform python list to RDD so we can do it in a 'Spark [parallel] way'
days.take(2) # to see whats in RDD
> ['13AUG15:09:40:15', '27APR16:20:04:35']
result = v.map(lambda x: (toWeekDay(x))) # apply functon toWeekDay on each element of RDD
result.take(2) # lets see results
> ['4', '3']
Please see Python documentation for further details on datetime processing.