PySpark dataframe with following columns:
'step', 'type', 'amount', 'Payee_Account', 'Old_Bal_Orig', 'New_Bal_Orig', 'Beneficiary_Account','Old_Bal_Dest', 'New_Bal_Dest', 'day_of_month', 'day_of_week'
where step is the timestamp in Hour starting from 1.
Following code will calculate a monthly average for each beneficiary. But we need weekly average over a month. That is, range is the month and the averaging period is week.
df = df.withColumn("Freq_Benef_Acc", F.avg('amount').over(Window.partitionBy("Beneficiary_Account").orderBy("day_of_month").rangeBetween(-30, Window.currentRow)) )
Related
I have a bus_date column. which has multiple records with different date i.e 2021-03-15, 2021-05-12, 2021-01-15 etc.
I want to calculate previous year end for all given dates. my expected output is 2020-12-31 for all three dates.
However, I can use function date_sub(start_date, num_days).
but I don't want to manually pass num_days. since there are million of rows with diff dates.
Can we write a view from a table or create dataframe, which will calculate previous year end?
You can use date_add and date_trunc to achieve this.
import pyspark.sql.functions as F
......
data = [
('2021-03-15',),
('2021-05-12',),
('2021-01-15',)
]
df = spark.createDataFrame(data, ['bus_date'])
df = df.withColumn('pre_year_end', F.date_add(F.date_trunc('yyyy', 'bus_date'), -1))
df.show()
I have salesamount column and the saledate column in my table. so I need to calculate total sales for each month based on below calculation.
total sales for current month=sum(salesamount) as of 1st day of the next month
for example sales of December-2021 should be calculated based on the total sales as on Jan-1-2022. total sales for January should be blank until Feb-1-2022 as it should be the total sales as on feb-1-2022. I am very much confused how we can achieve this in Dax. Really appreciate your help.
If I understand your question correctly, you can use the following DAX measure:
Total Sales =
var currentDate = MAX(myTable[saleDate])
var firstOfMonth = DATE(YEAR(CALCULATE(MAX(myTable[saledate]), ALL(myTable))),
MONTH(CALCULATE(MAX(myTable[saledate]), ALL(myTable))), 1)
var result = SUM(myTable[salesamount])
Return IF(currentDate < firstOfMonth, result)
This will take the current date of the report context and compare it to the 1st of the current month. If the currentDate is less than the 1st of the current month, the result will be calculated.
If your dataset has a Date table, you can replace the 'myTable[saledate]' references with a reference to the Date column in your date table.
With some sample data, here are the results:
(I added the firstOfMonth column for demonstration purposes)
I need help making a division with a relative date range.
For example:
Date Range: 01-01-2015 & 20-01-2015
Divison: 1.209,812 / 1.207,810 = 1,0016575
You can do this with a combination of LOD calcs and a Context Filter.
Select Add to Context on your relative range date filter.
Create Level of Detail calcs to isolate the min and max dates based on your filter
{max([Order Date])}
and
{min([Order Date])}
Create min and max values based on your measures. I'm using the Superstore data set in this example. The calculation states if the date equals the max date, return the Sales value. Repeat for min values.
if [Order Date] = [max date] then [Sales] end
You should have something like this:
Now just create a division between the max and min. You'll need to remove the date from the view for the calculation to render.
sum([max value]) / sum([min value])
See attached sample workbook if needed. https://www.dropbox.com/s/1ed15pwhihmjkdv/181227%20stack%20question.twbx?dl=0
I have time stamp column values in epoch ( ex. min value = 1276570880, max value = 1276657260). How do I group records in my Hive table based on 30 min intervals.
I need count a value for every 30 min starting from the min time stamp value until the max time stamp value in the time stamp column.
I have tried the following query, but it has not resulted any results.
SELECT COUNT(method) AS mycount, FROM_UNIXTIME(floor(UNIX_TIMESTAMP(ts)/1800)*1800)
FROM http
WHERE ts >= '2010-06-14 20:01:20'
AND ts <= '2010-06-14 22:01:20'
AND method='GET'
GROUP
BY FROM_UNIXTIME(
floor(UNIX_TIMESTAMP(ts)/1800)*1800)
This should work. Using round on the timestamps is important for the grouping to work correctly. Here is a SQLfiddle example which shows your specific example.
select count(method) as mycount,
from_unixtime(round(unix_timestamp(ts))
from http
where ts >= '2010-06-14 20:01:20'
and ts <= '2010-06-14 22:01:20'
and method='GET'
group by round(ts/1800)
I have one table with dates and another table where there is rather weekly data. My weeks start at Tuesday and the second table's date is supposed to determine the week (basically the Tuesday before the date is the start of the week; alternatively that date is an example day in that week).
How can I join the dates to information about weeks?
Here is the setup:
from datetime import datetime as dt
import pandas as pd
df=pd.DataFrame([dt(2016,2,3), dt(2016,2,8), dt(2016,2,9), dt(2016,2,15)])
df_week=pd.DataFrame([(dt(2016,2,4),"a"), (dt(2016,2,11),"b")], columns=["week", "val"])
# note the actual start of the weeks are the Tuesdays: 2.2., 9.2.
# I expect a new column df["val"]=["a", "a", "b", "b"]
I've seen pandas date_range, but I cannot see how to do that from there.
You're looking for DatetimeIndex.asof:
This will give you the closest index up to the day in df:
df_week.set_index('week', inplace=True)
df_week.index.asof(df['day'][1])
You can now use it to select the corresponding value:
df_week.loc[df_week.index.asof(df['day'][1])]
Finally, apply it to the entire dataframe:
df = pd.DataFrame([dt(2016,2,8), dt(2016,2,9), dt(2016,2,15)], columns=['day'])
df['val'] = df.apply(lambda row: df_week.loc[df_week.index.asof(row['day'])]['val'], axis=1)
I removed the first value from df because I didn't want to deal with edge cases.
Result:
day val
0 2016-02-08 a
1 2016-02-09 a
2 2016-02-15 b