I have a bus_date column. which has multiple records with different date i.e 2021-03-15, 2021-05-12, 2021-01-15 etc.
I want to calculate previous year end for all given dates. my expected output is 2020-12-31 for all three dates.
However, I can use function date_sub(start_date, num_days).
but I don't want to manually pass num_days. since there are million of rows with diff dates.
Can we write a view from a table or create dataframe, which will calculate previous year end?
You can use date_add and date_trunc to achieve this.
import pyspark.sql.functions as F
......
data = [
('2021-03-15',),
('2021-05-12',),
('2021-01-15',)
]
df = spark.createDataFrame(data, ['bus_date'])
df = df.withColumn('pre_year_end', F.date_add(F.date_trunc('yyyy', 'bus_date'), -1))
df.show()
Related
I have a column which has datetime values. Example: 01/17/2020 15:55:00. I want to round off the time to nearest quarter (01/17/2020 16:00:00). Note: please don't answer for this question using pandas i want answer only using pyspark.
try this this will work for you.
from pyspark.sql.functions import current_timestamp
result = data.withColumn("hour",hour((round(unix_timestamp("date")/3600)*3600).cast("timestamp")))
Although in Spark we don't have a sql functions that truncates directly the datetime to a quarter, we can build the column using a bunch of functions.
First, create the DataFrame
from pyspark.sql.functions import current_timestamp
dateDF = spark.range(10)\
.withColumn("today", current_timestamp())
dateDF.show(10, False)
Then, truncate the minutes that belongs to the next quarter (stroing it in a mins column)
from pyspark.sql.functions import minute, hour, col, round, date_trunc, unix_timestamp, to_timestamp
dateDF2 = dateDF.select(col("today"),
(round(minute(col("today"))/15)*15).cast("int").alias("mins"))
Then, we truncate the timestamp to the thour measure, convert it to unix_timestamp, add the minutes for truncation and convert it again to the timestamp type
dateDF2.select(col("today"), to_timestamp(unix_timestamp(date_trunc("hour", col("today"))) + col("mins")*60).alias("truncated_timestamp")).show(10, False)
Hope this helps
I want to search through a column of dates in the format YYYY-MM-DD (column G - in a random order) and sum up all corresponding cost values for all dates in the same month.
So, for example, the total cost for December 2019 would be 200.
My current formula is:
=SUMPRODUCT((MONTH(G2:G6)=12)*(YEAR(G2:G6)=2019)*(H2:H6))
This gives me the total cost for that month correctly, but I cannot work out how to do this without hardcoding the year and month!
How would I do this with a formula (given the two date columns are a different format)?
You can do this easily combining SUMIFS with EDATE:
SUMIFS function
EDATE function
The formula I've used in cell B2 is:
=SUMIFS($F$2:$F$6;$E$2:$E$6;">="&A2;$E$2:$E$6;"<="&(EDATE(A2;1)-1))
For this formula to work, in column A must be first day of each month!. In cell A2 the value is 01/11/2019, but applied a format of mmmm yyyy to see it like that (and chart will do the same).
paste in D2 cell:
=ARRAYFORMULA(QUERY({EOMONTH(G2:G, -1)+1, H2:H},
"select Col1,sum(Col2)
where Col1 is not null
and not Col1 = date '1900-01-01'
group by Col1
label sum(Col2)''
format Col1 'mmm yyyy'", 0))
I want to create a timestamp column to create a line chart from two columns containing month and year respectively.
The df looks like this:
I know I can create a string concat and then convert it to a datetime column:
df.select('*',
concat('01', df['month'],
df['year']).alias('date')).withColumn("date",
df['date'].cast(TimestampType()))
But I wanted a cleaner approach using an inbuilt PySpark functionality that can also help me create other date parts, like week number, quarters, etc. Any suggestions?
You will have to concatenate the string once, make the timestamp type column and then you can easily extract week, quarter etc.
You can use this function (and edit it to create whatever other columns you need as well):
def spark_date_parsing(df, date_column, date_format):
"""
Parses the date column given the date format in a spark dataframe
NOTE: This is a Pyspark implementation
Parameters
----------
:param df: Spark dataframe having a date column
:param date_column: Name of the date column
:param date_format: Simple Date Format (Java-style) of the dates in the date column
Returns
-------
:return: A spark dataframe with a parsed date column
"""
df = df.withColumn(date_column, F.to_timestamp(F.col(date_column), date_format))
# Spark returns 'null' if the parsing fails, so first check the count of null values
# If parse_fail_count = 0, return parsed column else raise error
parse_fail_count = df.select(
([F.count(F.when(F.col(date_column).isNull(), date_column))])
).collect()[0][0]
if parse_fail_count == 0:
return df
else:
raise ValueError(
f"Incorrect date format '{date_format}' for date column '{date_column}'"
)
Usage (with whatever is your resultant date format):
df = spark_date_parsing(df, "date", "dd/MM/yyyy")
I would like to find difference between Date column in spark dataset and date value which is not column.
If both were column i would do
datediff(col("dateStart"),col("dateEnd")
As i want to find difference between col("dateStart") and another date which is not column
val dsWithTimeDiff = detailedRecordsDs.withColumn(RunDate,
lit(runDate.toString))
val dsWithTimeDiff = dsWithRunDate.withColumn(DateDiff,
datediff(to_date(col(RunDate)), col(DateCol)))
Is there better way to do instead of adding one more column and then finding difference
I have one table with dates and another table where there is rather weekly data. My weeks start at Tuesday and the second table's date is supposed to determine the week (basically the Tuesday before the date is the start of the week; alternatively that date is an example day in that week).
How can I join the dates to information about weeks?
Here is the setup:
from datetime import datetime as dt
import pandas as pd
df=pd.DataFrame([dt(2016,2,3), dt(2016,2,8), dt(2016,2,9), dt(2016,2,15)])
df_week=pd.DataFrame([(dt(2016,2,4),"a"), (dt(2016,2,11),"b")], columns=["week", "val"])
# note the actual start of the weeks are the Tuesdays: 2.2., 9.2.
# I expect a new column df["val"]=["a", "a", "b", "b"]
I've seen pandas date_range, but I cannot see how to do that from there.
You're looking for DatetimeIndex.asof:
This will give you the closest index up to the day in df:
df_week.set_index('week', inplace=True)
df_week.index.asof(df['day'][1])
You can now use it to select the corresponding value:
df_week.loc[df_week.index.asof(df['day'][1])]
Finally, apply it to the entire dataframe:
df = pd.DataFrame([dt(2016,2,8), dt(2016,2,9), dt(2016,2,15)], columns=['day'])
df['val'] = df.apply(lambda row: df_week.loc[df_week.index.asof(row['day'])]['val'], axis=1)
I removed the first value from df because I didn't want to deal with edge cases.
Result:
day val
0 2016-02-08 a
1 2016-02-09 a
2 2016-02-15 b