I want to create a loop in pyspark where I give a month and it should select the table on the end of the month and the end of month of the previous month.
The selection of the month is made with a string.
So I give '201901' and it should select '20190131' and '20181231'.
And if possible it should run automatically and select the end of previous month from today and end of previous previous month of today.
So today we are 2020-05-07 so it should select '20200430' and '20200331'.
def selectTables(date):
i = 0
for i in range(len(date)):
recentDate = .... # should be for the first iteration '20190131'
previousDate = .... # should be for the first iteration '20181231'
recent = spark.read.parquet('table.parquet/date=' + recentDate[i])
previous = spark.read.parquet('table.parquet/date=' + previousDate[i])
selectTables(['201901', '201902'])
Use add_months,last_day in built spark functions to get last day.
Example:
date='201901'
recentDate=spark.sql("select string(last_day(to_date('{}','yyyyMM')))".format(date)).collect()[0][0]
#u'2019-01-31'
previousDate=spark.sql("select string(last_day(add_months(to_date('{}','yyyyMM'),'-1')))".format(date)).collect()[0][0]
#u'2018-12-31'
Related
I am trying to create a list of the last days of each month for the past n months from the current date but not including current month
I tried different approaches:
def last_n_month_end(n_months):
"""
Returns a list of the last n month end dates
"""
return [datetime.date.today().replace(day=1) - datetime.timedelta(days=1) - datetime.timedelta(days=30*i) for i in range(n_months)]
somehow this partly works if each every month only has 30 days and also not work in databricks pyspark. It returns AttributeError: 'method_descriptor' object has no attribute 'today'
I also tried the approach mentioned in Generate a sequence of the last days of all previous N months with a given month
def previous_month_ends(date, months):
year, month, day = [int(x) for x in date.split('-')]
d = datetime.date(year, month, day)
t = datetime.timedelta(1)
s = datetime.date(year, month, 1)
return [(x - t).strftime('%Y-%m-%d')
for m in range(months - 1, -1, -1)
for x in (datetime.date(s.year, s.month - m, s.day) if s.month > m else \
datetime.date(s.year - 1, s.month - (m - 12), s.day),)]
but I am not getting it correctly.
I also tried:
df = spark.createDataFrame([(1,)],['id'])
days = df.withColumn('last_dates', explode(expr('sequence(last_day(add_months(current_date(),-3)), last_day(add_months(current_date(), -1)), interval 1 month)')))
I got the last three months (Sep, oct, nov), but all of them are the 30th but Oct has Oct 31st. However, it gives me the correct last days when I put more than 3.
What I am trying to get is this:
(last days of the last 4 months not including last_day of current_date)
daterange = ['2022-08-31','2022-09-30','2022-10-31','2022-11-30']
Not sure if this is the best or optimal way to do it, but this does it...
Requires the following package since datetime does not seem to have anyway to subtract months as far as I know without hardcoding the number of days or weeks. Not sure, so don't quote me on this....
Package Installation:
pip install python-dateutil
Edit: There was a misunderstanding from my end. I had assumed that all dates were required and not just the month ends. Anyways hope the updated code might help. Still not the most optimal, but easy to understand I guess..
# import datetime package
from datetime import date, timedelta
from dateutil.relativedelta import relativedelta
def previous_month_ends(months_to_subtract):
# get first day of current month
first_day_of_current_month = date.today().replace(day=1)
print(f"First Day of Current Month: {first_day_of_current_month}")
# Calculate and previous month's Last date
date_range_list = [first_day_of_current_month - relativedelta(days=1)]
cur_iter = 1
while cur_iter < months_to_subtract:
# Calculate First Day of previous months relative to first day of current month
cur_iter_fdom = first_day_of_current_month - relativedelta(months=cur_iter)
# Subtract one day to get the last day of previous month
cur_iter_ldom = cur_iter_fdom - relativedelta(days=1)
# Append to the list
date_range_list.append(cur_iter_ldom)
# Increment Counter
cur_iter+=1
return date_range_list
print(previous_month_ends(3))
Function to calculate date list between 2 dates:
Calculate the first of current month.
Calculate start and end dates and then loop through them to get the list of dates.
I have ignored the date argument, since I have assumed that it will be for current date. alternatively it can be added following your own code which should work perfectly.
# import datetime package
from datetime import date, timedelta
from dateutil.relativedelta import relativedelta
def gen_date_list(months_to_subtract):
# get first day of current month
first_day_of_current_month = date.today().replace(day=1)
print(f"First Day of Current Month: {first_day_of_current_month}")
start_date = first_day_of_current_month - relativedelta(months=months_to_subtract)
end_date = first_day_of_current_month - relativedelta(days=1)
print(f"Start Date: {start_date}")
print(f"End Date: {end_date}")
date_range_list = [start_date]
cur_iter_date = start_date
while cur_iter_date < end_date:
cur_iter_date += timedelta(days=1)
date_range_list.append(cur_iter_date)
# print(date_range_list)
return date_range_list
print(gen_date_list(3))
Hope it helps...Edits/Comments are welcome - I am learning myself...
I just thought a work around I can use since my last codes work:
df = spark.createDataFrame([(1,)],['id'])
days = df.withColumn('last_dates', explode(expr('sequence(last_day(add_months(current_date(),-3)), last_day(add_months(current_date(), -1)), interval 1 month)')))
is to enter -4 and just remove the last_date that I do not need days.pop(0) that should give me the list of needed last_dates.
from datetime import datetime, timedelta
def get_last_dates(n_months):
'''
generates a list of lastdates for each month for the past n months
Param:
n_months = number of months back
'''
last_dates = [] # initiate an empty list
for i in range(n_months):
last_dates.append((datetime.today() - timedelta(days=i*30)).replace(day=1) - timedelta(days=1))
return last_dates
This should give you a more accurate last_days
I have a variable in a SAS dataset that has a number of dates (e.g. 01APR21). What I'm looking to do is create a new variable that shows the date of the first Monday of that week. So using the above example of 01APR21, the output would be 29/03/2021 as that what was when the Monday in that week was. I'm assuming it's using intnx, but I can't get my head around it.
data test;
format date date8.;
format first_day date10.;
date = '01APR21'd;
first_day = ?;
run;
INTNX Parameters:
Interval : WEEK
Increment: 0 (same week)
Alignment: Beginning
(Sunday)
Then add 1 to get to Monday instead of Sunday. You could probably play with the SHIFT INDEX parameter as well.
Monday = intnx('week', dateVariable, 0, 'B') + 1
i need to calculate difference between two date excluding sunday. I have table with dates and i need to calculate number of dates of repeated days from last date.
if i have dates like that
27-05-2017
29-05-2017
30-05-2017
I use this code in script
date(max(Date)) as dateMax,
date(min(Date)) as dateMin
And i get min date = 27-05-2017 and max date = 30-05-2017 then i use in expressions
=floor(((dateMax - dateMin)+1)/7)*6 + mod((dateMax - dateMin)+1,7)
+ if(Weekday(dateMin) + mod((dateMax - dateMin)+1,7) < 7, 0, -1)
And get result 3 days. Thats OK, but the problem is if I have next dates:
10-05-2017
11-05-2017
27-05-2017
29-05-2017
30-05-2017
When use previously code I get min date = 10-05-2017 and max date = 30-05-2017 and result 18, but this is not OK.
I need to count only dates from
27-05-2017
29-05-2017
30-05-2017
I need to get max date and go throw loop repeated dates and if have brake to see is that date sunday if yes then step that date and continue to count repeated dates and if i again have break and if not sunday than close loop and remember number of days.
In my case instead of 18 days i need to get 3 days.
Any idea?
I'd recommend you creating a master calendar in the script where you can apply weights or any other rule to your days. Then in your table or app you can just loop through the dates or perform operations and sum their weights (0: if sunday, 1: if not). Let's see an example:
// In this case I'll do a master calendar of the present year
LET vMinDate = Num(MakeDate(year(today()),1,1));
LET vMaxDate = Num(MakeDate(year(today()),12,31));
Calendar_tmp:
LOAD
$(vMinDate) + Iterno() - 1 as Num,
Date($(vMinDate) + Iterno() - 1) as Date_tmp
AUTOGENERATE 1 WHILE $(vMinDate) + Iterno() - 1 <= $(vMaxDate);
Master_Calendar:
LOAD
Date_tmp AS Date,
Week(Date_tmp) as Week,
Year(Date_tmp) as Year,
Capitalize(Month(Date_tmp)) as Month,
Day(Date_tmp) as Day,
WeekDay(Date_tmp) as WeekDay,
if(WeekDay = '7',0,1) as DayWeight //HERE IS WHERE YOU COULD DEFINE A VARIABLE TO DIRECTLY COUNT THE DAY IF IT IS NOT SUNDAY
'T' & ceil(num(Month(Date_tmp))/3) as Quarter,
'T' & ceil(num(Month(Date_tmp))/3) & '-' & right(year(Date_tmp),2) as QuarterYear,
date(monthStart(Date_tmp),'MMMM-YYYY') as MonthYear,
date(monthstart(Date_tmp),'MMM-YY') as MonthYear2
RESIDENT Calendar_tmp
ORDER BY Date_tmp ASC;
DROP Table Calendar_tmp;
so I'm trying to get 2 dates in an excel sheet and use the DateDiff function to get the number of days between the 2 dates. I am essentially adding the number of days together and dividing by the the number of rows to get and average amount of days. So far I have it to where the total amount of days for every row gets added together and is displayed on column "E" and the number of rows is placed on column "F". I know I am close because at one point it worked but I was dumb and changed something and now i does not. here is my code and the excel sheet.
Sub GetDays()
Range("C1").Select
Do Until ActiveCell.Value = ""
date1 = DateValue(ActiveCell.Offset(1, 0).Value)
date2 = DateValue(ActiveCell.Offset(1, 0).EntireRow.Cells(1, "D").Value)
DayCount = DateDiff("d", date1, date2) + DayCount
ActiveCell.Offset(1, 0).EntireRow.Cells(1, "E").Value = DayCount
StudentCount = StudentCount + 1
ActiveCell.Offset(1, 0).EntireRow.Cells(1, "F").Value = StudentCount
ActiveCell.Offset(1, 0).Select
Loop
End Sub!
Here is a snippet of the sheet
The issue I discovered when testing your code is that your loop is comparing to the ActiveCell value to determine when to exit, but then your code is operating on the cell below ActiveCell, as a result of the Offset(1,0) call. So when your loop is on the last line of data, ActiveCell.Value = "3/25/2015 10:52", but your next line of code is trying to populate date1 with the DateValue of a null since it is offset down one row. This throws a Type Mismatch error.
I've adjusted your code below, this works for me:
Sub GetDays()
Range("C1").Select
Do Until ActiveCell.Value = ""
date1 = DateValue(ActiveCell.Value)
date2 = DateValue(ActiveCell.Offset(0, 1).Value)
DayCount = DateDiff("d", date1, date2) + DayCount
ActiveCell.Offset(0, 2).Value = DayCount
StudentCount = StudentCount + 1
ActiveCell.Offset(0, 3).Value = StudentCount
ActiveCell.Offset(1, 0).Select
Loop
End Sub
I adjusted the offset command so that we are looking at the same row at all times each loop. I replaced the "EntireRow.Cells(1, "D")" sections by just using the column integer in Offset().
You may need to change the second line to: Range ("C2").Select for my code to work, depending on if your data starts on row 1 or row 2.
Let's say that I have a range of SQL tables that are named name_YYYY_WW where YYYY = year and WW = week number. If I call upon a function that guides a user defined date to the right table.
If the date entered is "20110101":
SELECT EXTRACT (WEEK FROM DATE '20110101') returns 52 and
SELECT EXTRACT (YEAR FROM DATE '20110101') returns 2011.
While is nothing wrong with these results I want "20110101" to either point to table name_2010_52 or name_2011_01, not name_2011_52 as it does now when I concanate the results to form the query for the table.
Any elegant solutions to this problem?
The function to_char() will allow you to format a date or timestamp to output correct the iso week and iso year.
SELECT to_char('2011-01-01'::date, 'IYYY_IW') as iso_year_week;
will produce:
iso_year_week
---------------
2010_52
(1 row)
You could use a CASE:
WITH sub(field) AS (
SELECT CAST('20110101' AS date) -- just to test
)
SELECT
CASE
WHEN EXTRACT (WEEK FROM field ) > 1 AND EXTRACT (MONTH FROM field) = 1 AND EXTRACT (DAY FROM field) < 3 THEN 1
ELSE
EXTRACT (WEEK FROM field)
END
FROM
sub;