I have one table with dates and another table where there is rather weekly data. My weeks start at Tuesday and the second table's date is supposed to determine the week (basically the Tuesday before the date is the start of the week; alternatively that date is an example day in that week).
How can I join the dates to information about weeks?
Here is the setup:
from datetime import datetime as dt
import pandas as pd
df=pd.DataFrame([dt(2016,2,3), dt(2016,2,8), dt(2016,2,9), dt(2016,2,15)])
df_week=pd.DataFrame([(dt(2016,2,4),"a"), (dt(2016,2,11),"b")], columns=["week", "val"])
# note the actual start of the weeks are the Tuesdays: 2.2., 9.2.
# I expect a new column df["val"]=["a", "a", "b", "b"]
I've seen pandas date_range, but I cannot see how to do that from there.
You're looking for DatetimeIndex.asof:
This will give you the closest index up to the day in df:
df_week.set_index('week', inplace=True)
df_week.index.asof(df['day'][1])
You can now use it to select the corresponding value:
df_week.loc[df_week.index.asof(df['day'][1])]
Finally, apply it to the entire dataframe:
df = pd.DataFrame([dt(2016,2,8), dt(2016,2,9), dt(2016,2,15)], columns=['day'])
df['val'] = df.apply(lambda row: df_week.loc[df_week.index.asof(row['day'])]['val'], axis=1)
I removed the first value from df because I didn't want to deal with edge cases.
Result:
day val
0 2016-02-08 a
1 2016-02-09 a
2 2016-02-15 b
Related
I am trying to create a list of the last days of each month for the past n months from the current date but not including current month
I tried different approaches:
def last_n_month_end(n_months):
"""
Returns a list of the last n month end dates
"""
return [datetime.date.today().replace(day=1) - datetime.timedelta(days=1) - datetime.timedelta(days=30*i) for i in range(n_months)]
somehow this partly works if each every month only has 30 days and also not work in databricks pyspark. It returns AttributeError: 'method_descriptor' object has no attribute 'today'
I also tried the approach mentioned in Generate a sequence of the last days of all previous N months with a given month
def previous_month_ends(date, months):
year, month, day = [int(x) for x in date.split('-')]
d = datetime.date(year, month, day)
t = datetime.timedelta(1)
s = datetime.date(year, month, 1)
return [(x - t).strftime('%Y-%m-%d')
for m in range(months - 1, -1, -1)
for x in (datetime.date(s.year, s.month - m, s.day) if s.month > m else \
datetime.date(s.year - 1, s.month - (m - 12), s.day),)]
but I am not getting it correctly.
I also tried:
df = spark.createDataFrame([(1,)],['id'])
days = df.withColumn('last_dates', explode(expr('sequence(last_day(add_months(current_date(),-3)), last_day(add_months(current_date(), -1)), interval 1 month)')))
I got the last three months (Sep, oct, nov), but all of them are the 30th but Oct has Oct 31st. However, it gives me the correct last days when I put more than 3.
What I am trying to get is this:
(last days of the last 4 months not including last_day of current_date)
daterange = ['2022-08-31','2022-09-30','2022-10-31','2022-11-30']
Not sure if this is the best or optimal way to do it, but this does it...
Requires the following package since datetime does not seem to have anyway to subtract months as far as I know without hardcoding the number of days or weeks. Not sure, so don't quote me on this....
Package Installation:
pip install python-dateutil
Edit: There was a misunderstanding from my end. I had assumed that all dates were required and not just the month ends. Anyways hope the updated code might help. Still not the most optimal, but easy to understand I guess..
# import datetime package
from datetime import date, timedelta
from dateutil.relativedelta import relativedelta
def previous_month_ends(months_to_subtract):
# get first day of current month
first_day_of_current_month = date.today().replace(day=1)
print(f"First Day of Current Month: {first_day_of_current_month}")
# Calculate and previous month's Last date
date_range_list = [first_day_of_current_month - relativedelta(days=1)]
cur_iter = 1
while cur_iter < months_to_subtract:
# Calculate First Day of previous months relative to first day of current month
cur_iter_fdom = first_day_of_current_month - relativedelta(months=cur_iter)
# Subtract one day to get the last day of previous month
cur_iter_ldom = cur_iter_fdom - relativedelta(days=1)
# Append to the list
date_range_list.append(cur_iter_ldom)
# Increment Counter
cur_iter+=1
return date_range_list
print(previous_month_ends(3))
Function to calculate date list between 2 dates:
Calculate the first of current month.
Calculate start and end dates and then loop through them to get the list of dates.
I have ignored the date argument, since I have assumed that it will be for current date. alternatively it can be added following your own code which should work perfectly.
# import datetime package
from datetime import date, timedelta
from dateutil.relativedelta import relativedelta
def gen_date_list(months_to_subtract):
# get first day of current month
first_day_of_current_month = date.today().replace(day=1)
print(f"First Day of Current Month: {first_day_of_current_month}")
start_date = first_day_of_current_month - relativedelta(months=months_to_subtract)
end_date = first_day_of_current_month - relativedelta(days=1)
print(f"Start Date: {start_date}")
print(f"End Date: {end_date}")
date_range_list = [start_date]
cur_iter_date = start_date
while cur_iter_date < end_date:
cur_iter_date += timedelta(days=1)
date_range_list.append(cur_iter_date)
# print(date_range_list)
return date_range_list
print(gen_date_list(3))
Hope it helps...Edits/Comments are welcome - I am learning myself...
I just thought a work around I can use since my last codes work:
df = spark.createDataFrame([(1,)],['id'])
days = df.withColumn('last_dates', explode(expr('sequence(last_day(add_months(current_date(),-3)), last_day(add_months(current_date(), -1)), interval 1 month)')))
is to enter -4 and just remove the last_date that I do not need days.pop(0) that should give me the list of needed last_dates.
from datetime import datetime, timedelta
def get_last_dates(n_months):
'''
generates a list of lastdates for each month for the past n months
Param:
n_months = number of months back
'''
last_dates = [] # initiate an empty list
for i in range(n_months):
last_dates.append((datetime.today() - timedelta(days=i*30)).replace(day=1) - timedelta(days=1))
return last_dates
This should give you a more accurate last_days
I have a bus_date column. which has multiple records with different date i.e 2021-03-15, 2021-05-12, 2021-01-15 etc.
I want to calculate previous year end for all given dates. my expected output is 2020-12-31 for all three dates.
However, I can use function date_sub(start_date, num_days).
but I don't want to manually pass num_days. since there are million of rows with diff dates.
Can we write a view from a table or create dataframe, which will calculate previous year end?
You can use date_add and date_trunc to achieve this.
import pyspark.sql.functions as F
......
data = [
('2021-03-15',),
('2021-05-12',),
('2021-01-15',)
]
df = spark.createDataFrame(data, ['bus_date'])
df = df.withColumn('pre_year_end', F.date_add(F.date_trunc('yyyy', 'bus_date'), -1))
df.show()
I have a column which has datetime values. Example: 01/17/2020 15:55:00. I want to round off the time to nearest quarter (01/17/2020 16:00:00). Note: please don't answer for this question using pandas i want answer only using pyspark.
try this this will work for you.
from pyspark.sql.functions import current_timestamp
result = data.withColumn("hour",hour((round(unix_timestamp("date")/3600)*3600).cast("timestamp")))
Although in Spark we don't have a sql functions that truncates directly the datetime to a quarter, we can build the column using a bunch of functions.
First, create the DataFrame
from pyspark.sql.functions import current_timestamp
dateDF = spark.range(10)\
.withColumn("today", current_timestamp())
dateDF.show(10, False)
Then, truncate the minutes that belongs to the next quarter (stroing it in a mins column)
from pyspark.sql.functions import minute, hour, col, round, date_trunc, unix_timestamp, to_timestamp
dateDF2 = dateDF.select(col("today"),
(round(minute(col("today"))/15)*15).cast("int").alias("mins"))
Then, we truncate the timestamp to the thour measure, convert it to unix_timestamp, add the minutes for truncation and convert it again to the timestamp type
dateDF2.select(col("today"), to_timestamp(unix_timestamp(date_trunc("hour", col("today"))) + col("mins")*60).alias("truncated_timestamp")).show(10, False)
Hope this helps
I want to create a timestamp column to create a line chart from two columns containing month and year respectively.
The df looks like this:
I know I can create a string concat and then convert it to a datetime column:
df.select('*',
concat('01', df['month'],
df['year']).alias('date')).withColumn("date",
df['date'].cast(TimestampType()))
But I wanted a cleaner approach using an inbuilt PySpark functionality that can also help me create other date parts, like week number, quarters, etc. Any suggestions?
You will have to concatenate the string once, make the timestamp type column and then you can easily extract week, quarter etc.
You can use this function (and edit it to create whatever other columns you need as well):
def spark_date_parsing(df, date_column, date_format):
"""
Parses the date column given the date format in a spark dataframe
NOTE: This is a Pyspark implementation
Parameters
----------
:param df: Spark dataframe having a date column
:param date_column: Name of the date column
:param date_format: Simple Date Format (Java-style) of the dates in the date column
Returns
-------
:return: A spark dataframe with a parsed date column
"""
df = df.withColumn(date_column, F.to_timestamp(F.col(date_column), date_format))
# Spark returns 'null' if the parsing fails, so first check the count of null values
# If parse_fail_count = 0, return parsed column else raise error
parse_fail_count = df.select(
([F.count(F.when(F.col(date_column).isNull(), date_column))])
).collect()[0][0]
if parse_fail_count == 0:
return df
else:
raise ValueError(
f"Incorrect date format '{date_format}' for date column '{date_column}'"
)
Usage (with whatever is your resultant date format):
df = spark_date_parsing(df, "date", "dd/MM/yyyy")
recently I asked how to convert calendar weeks into a list of dates and received a great and most helpful answer:
convert calendar weeks into daily dates
I tried to apply the above method to create a list of dates based on a column with "year - month". Alas i cannot make out how to account for the different number of days in different months.
And I wonder whether the package lubridate 'automatically' takes leap years into account?
Sample data:
df <- data.frame(YearMonth = c("2016 - M02", "2016 - M06"), values = c(28,60))
M02 = February, M06 = June (M11 would mean November, etc.)
Desired result:
DateList Values
2016-02-01 1
2016-02-02 1
ect
2016-02-28 1
2016-06-01 2
etc
2016-06-30 2
Values would something like
df$values / days_in_month()
Thanks a million in advance - it is honestly very much appreciated!
I'll leave the parsing of the line to you.
To find the last day of a month, assuming you have GNU date, you can do this:
year=2016
month=02
last_day=$(date -d "$year-$month-01 + 1 month - 1 day" +%d)
echo $last_day # => 29 -- oho, a leap year!
Then you can use a for loop to print out each day.
thanks to answer 6 at Add a month to a Date and answer for (how to extract number with leading 0) i got an idea to solve my own question using lubridate. It might not be the most elegant way, but it works.
sample data
data <- data_frame(mon=c("M11","M02"), year=c("2013","2014"), costs=c(200,300))
step 1: create column with number of month
temp2 <- gregexpr("[0-9]+", data$mon)
data$monN <- as.numeric(unlist(regmatches(data$mon, temp2)))
step 2: from year and number of month create a column with the start date
data$StartDate <- as.Date(paste(as.numeric(data$year), formatC(data$monN, width=2, flag="0") ,"01", sep = "-"))
step 3: create a column EndDate as last day of the month based on startdate
data$EndDate <- data$StartDate
day(data$EndDate) <- days_in_month(data$EndDate)
step 4: apply answer from Apply seq.Date using two dataframe columns to create daily list for respective month
data$id <- c(1:nrow(data))
dataL <- setDT(data)[,list(datelist=seq(StartDate, EndDate, by='1 day'), costs= costs/days_in_month(EndDate)) , by = id]