Convert date to end of month in Spark

Convert date to end of month in Spark - pyspark

I have a Spark DataFrame as shown below:
#Create DataFrame
df <- data.frame(name = c("Thomas", "William", "Bill", "John"),
dates = c('2017-01-05', '2017-02-23', '2017-03-16', '2017-04-08'))
df <- createDataFrame(df)
#Make sure df$dates column is in 'date' format
df <- withColumn(df, 'dates', cast(df$dates, 'date'))
name | dates
--------------------
Thomas |2017-01-05
William |2017-02-23
Bill |2017-03-16
John |2017-04-08
I want to change dates to the end of month date, so they would look like shown below. How do I do this? Either SparkR or PySpark code is fine.
name | dates
--------------------
Thomas |2017-01-31
William |2017-02-28
Bill |2017-03-31
John |2017-04-30

You may use the following (PySpark):
from pyspark.sql.functions import last_day
df.select('name', last_day(df.dates).alias('dates')).show()
To clarify, last_day(date) returns the last day of the month of which date belongs to.
I'm pretty sure there is a similar function in sparkR
https://spark.apache.org/docs/1.6.2/api/R/last_day.html

last_day is a poorly named function and should be wrapped in something more descriptive to make the code easier to read.
endOfMonth is a better function name. Here's how to use this function with the Scala API. Suppose you have the following data:
+----------+
| some_date|
+----------+
|2016-09-10|
|2020-01-01|
|2016-01-10|
| null|
+----------+
Run the endOfMonth function that's part of spark-daria:
import com.github.mrpowers.spark.daria.sql.functions._
df.withColumn("res", endOfMonth(col("some_date"))).show()
Here are the results:
+----------+----------+
| some_date| res|
+----------+----------+
|2016-09-10|2016-09-30|
|2020-01-01|2020-01-31|
|2016-01-10|2016-01-31|
| null| null|
+----------+----------+
I'll try to add this function to quinn as well so there is an easily accessible function for PySpark users as well.

For completeness, here is the SparkR code:
df <- withColumn(df, 'dates', last_day(df$dates))

Related

How to get week of year in spark 3.0+?

I'm trying to create a calendar file with columns for day, month, etc. The following code works fine, but I couldn't find a clean way to extract the week of year (1-52). In spark 3.0+, the following line of code doesn't work: .withColumn("week_of_year", date_format(col("day_id"), "W"))
I know that I can create a view/table and then run a SQL query on it to extract the week_of_year, but is there no better way to do it?
`
df.withColumn("day_id", to_date(col("day_id"), date_fmt))
.withColumn("week_day", date_format(col("day_id"), "EEEE"))
.withColumn("month_of_year", date_format(col("day_id"), "M"))
.withColumn("year", date_format(col("day_id"), "y"))
.withColumn("day_of_month", date_format(col("day_id"), "d"))
.withColumn("quarter_of_year", date_format(col("day_id"), "Q"))

It seems those patterns are not supported anymore in spark 3+
Caused by: java.lang.IllegalArgumentException: All week-based patterns are unsupported since Spark 3.0, detected: w, Please use the SQL function EXTRACT instead
You can use this:
import org.apache.spark.sql.functions._
df.withColumn("week_of_year", weekofyear($"date"))
TESTING
INPUT
val df = List("2021-05-15", "1985-10-05")
.toDF("date")
.withColumn("date", to_date($"date", "yyyy-MM-dd")
df.show
+----------+
| date|
+----------+
|2021-05-15|
|1985-10-05|
+----------+
OUTPUT
df.withColumn("week_of_year", weekofyear($"date")).show
+----------+------------+
| date|week_of_year|
+----------+------------+
|2021-05-15| 19|
|1985-10-05| 40|
+----------+------------+

The exception you saw, recomend to use EXTRACT SQL function instead https://spark.apache.org/docs/3.0.0/api/sql/index.html#extract
val df = Seq(("2019-11-16 16:50:59.406")).toDF("input_timestamp")
df.selectExpr("input_timestamp", "extract(week FROM input_timestamp) as w").show
+--------------------+---+
| input_timestamp| w|
+--------------------+---+
|2019-11-16 16:50:...| 46|
+--------------------+---+

Converting string time to day timestamp

I have just started working for Pyspark, and need some help converting a column datatype.
My dataframe has a string column, which stores the time of day in AM/PM, and I need to convert this into datetime for further processing/analysis.
fd = spark.createDataFrame([(['0143A'])], ['dt'])
fd.show()
+-----+
| dt|
+-----+
|0143A|
+-----+
from pyspark.sql.functions import date_format, to_timestamp
#fd.select(date_format('dt','hhmma')).show()
fd.select(to_timestamp('dt','hhmmaa')).show()
+----------------------------+
|to_timestamp(`dt`, 'hhmmaa')|
+----------------------------+
| null|
+----------------------------+
Expected output: 01:43
How can I get the proper datetime format in the above scenario?
Thanks for your help!

If we look at the doc for to_timestamp (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.to_timestamp) we see that the format must be specified as a SimpleDateFormat (https://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html).
In order to retrieve the time of the day in AM/PM, we must use hhmma. But in SimpleDateFormat, a catches AM or PM, and not A or P. So we need to change our string :
import pyspark.sql.functions as F
df = spark.createDataFrame([(['0143A'])], ['dt'])
df2 = df.withColumn('dt', F.concat(F.col('dt'), F.lit('M')))
df3 = df2.withColumn('ts', F.to_timestamp('dt','hhmma'))
df3.show()
+------+-------------------+
| dt| ts|
+------+-------------------+
|0143AM|1970-01-01 01:43:00|
+------+-------------------+
If you want to retrieve it as a string in the format you mentionned, you can use date_format :
df4 = df3.withColumn('time', F.date_format(F.col('ts'), format='HH:mm'))
df4.show()
+------+-------------------+-----+
| dt| ts| time|
+------+-------------------+-----+
|0143AM|1970-01-01 01:43:00|01:43|
+------+-------------------+-----+

Spark 2.0 How to convert DF Date/timstamp column to another date format in scala?

For my learning , i have been using below sample dataset .
+-------------------+-----+-----+-----+-----+-------+
| MyDate| Open| High| Low|Close| Volume|
+-------------------+-----+-----+-----+-----+-------+
|2006-01-03 00:00:00|983.8|493.8|481.1|492.9|1537660|
|2006-01-04 00:00:00|979.6|491.0|483.5|483.8|1871020|
|2006-01-05 00:00:00|972.2|487.8|484.0|486.2|1143160|
|2006-01-06 00:00:00|977.8|489.0|482.0|486.2|1370250|
|2006-01-09 00:00:00|973.4|487.4|483.0|483.9|1680740|
+-------------------+-----+-----+-----+-----+-------+
I tried to change "MyDate" column values to different format like "YYYY-MON" and written like this..
citiDataDF.withColumn("New-Mydate",to_timestamp($"MyDate", "yyyy-MON")).show(5)
After executing the code, found that new column "New-Mydate". but i couldn't see the desired output format. can you please help

You need date_format instead to_timestamp:
val citiDataDF = List("2006-01-03 00:00:00").toDF("MyDate")
citiDataDF.withColumn("New-Mydate",date_format($"New-Mydate", "yyyy-MMM")).show(5)
Result:
+-------------------+----------+
| MyDate|New-Mydate|
+-------------------+----------+
|2006-01-03 00:00:00| 2006-Jan|
+-------------------+----------+
Note: Three "M" mean the month as string, if you want a month as Int, you must use only two "M"

PySpark: String to timestamp transformation

I am working with time data and try to convert the string to timestamp format.
Here is what the 'Time' column looks like
+----------+
| Time |
+----------+
|1358380800|
|1380672000|
+----------+
Here is what I want
+---------------+
| Time |
+---------------+
|2013/1/17 8:0:0|
|2013/10/2 8:0:0|
+---------------+
I find some similar questions and answers and have tried these code, but all end with 'null'
df2 = df.withColumn("Time", test["Time"].cast(TimestampType()))
df2 = df.withColumn('Time', F.unix_timestamp('Time', 'yyyy-MM-dd').cast(TimestampType()))

Well your are doing it the other way around. The sql function unix_timestamp converts a string with the given format to a unix timestamp. When you want to convert a unix timestamp to the datetime format, you have to use the from_unixtime sql function:
from pyspark.sql import functions as F
from pyspark.sql import types as T
l1 = [('1358380800',),('1380672000',)]
df = spark.createDataFrame(l1,['Time'])
df.withColumn('Time', F.from_unixtime(df.Time).cast(T.TimestampType())).show()
Output:
+-------------------+
| Time|
+-------------------+
|2013-01-17 01:00:00|
|2013-10-02 02:00:00|
+-------------------+

Extract week day number from string column (datetime stamp) in spark api

I am new to Spark API. I am trying to extract weekday number from a column say col_date (having datetime stamp e.g '13AUG15:09:40:15') which is string and add another column as weekday(integer). I am not able to do successfully.

the approach below worked for me, using a 'one line' udf - similar but different to above:
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName('dayofweek').getOrCreate()
set up the dataframe:
df = spark.createDataFrame(
[(1, "2018-05-12")
,(2, "2018-05-13")
,(3, "2018-05-14")
,(4, "2018-05-15")
,(5, "2018-05-16")
,(6, "2018-05-17")
,(7, "2018-05-18")
,(8, "2018-05-19")
,(9, "2018-05-20")
], ("id", "date"))
set up the udf:
from pyspark.sql.functions import udf,desc
from datetime import datetime
weekDay = udf(lambda x: datetime.strptime(x, '%Y-%m-%d').strftime('%w'))
df = df.withColumn('weekDay', weekDay(df['date'])).sort(desc("date"))
results:
df.show()
+---+----------+-------+
| id| date|weekDay|
+---+----------+-------+
| 9|2018-05-20| 0|
| 8|2018-05-19| 6|
| 7|2018-05-18| 5|
| 6|2018-05-17| 4|
| 5|2018-05-16| 3|
| 4|2018-05-15| 2|
| 3|2018-05-14| 1|
| 2|2018-05-13| 0|
| 1|2018-05-12| 6|
+---+----------+-------+

Well, this is quite simple.
This simple function make all the job and returns weekdays as number (monday = 1):
from time import time
from datetime import datetime
# get weekdays and daily hours from timestamp
def toWeekDay(x):
# v = datetime.strptime(datetime.fromtimestamp(int(x)).strftime("%Y %m %d %H"), "%Y %m %d %H").strftime('%w') - from unix timestamp
v = datetime.strptime(x, '%d%b%y:%H:%M:%S').strftime('%w')
return v
days = ['13AUG15:09:40:15','27APR16:20:04:35'] # create example dates
days = sc.parallelize(days) # for example purposes - transform python list to RDD so we can do it in a 'Spark [parallel] way'
days.take(2) # to see whats in RDD
> ['13AUG15:09:40:15', '27APR16:20:04:35']
result = v.map(lambda x: (toWeekDay(x))) # apply functon toWeekDay on each element of RDD
result.take(2) # lets see results
> ['4', '3']
Please see Python documentation for further details on datetime processing.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Convert date to end of month in Spark - pyspark

For completeness, here is the SparkR code: df <- withColumn(df, 'dates', last_day(df$dates))

Related

How to get week of year in spark 3.0+?

Converting string time to day timestamp

Spark 2.0 How to convert DF Date/timstamp column to another date format in scala?

PySpark: String to timestamp transformation

Extract week day number from string column (datetime stamp) in spark api

Categories

Resources