Populate variable using datadiff in pyspark - pyspark

I'm a long time T-SQL user and am new to python. I inherited a project where one of my processes hard coded a number rather than making it dynamic. The value in the variable is nothing more than the number of months between two dates. After assigned to the variable, the integer is used in a later calculation. The problem I have is I have only found solutions to use months_between() in a dataframe. While the value is calculated correctly, the downstream process requires the integer as an input and not reading the dataframe.
In SQL, I would have written:
DECLARE varMonths
SET varMonths = SELECT DATDIFF(mm, date1, date2)
In Python I tried:
elig_endnum2 = spark.sql("select round(months_between(current_date(), date('2007-01-01')),0)")
If someone could provide me a little direction and a link to a resource on the appropriate way to solve this, I'd be grateful.

input_1 = '2022-02-14'
input_2 = '2007-01-01'
df = spark.createDataFrame([(input_1, input_2)], ['date1', 'date2'])
datadiff = int(df.select(months_between(df.date1, df.date2).alias('months')).collect()[0][0])
datadiff
Out[32]: 181

Related

How can I add a dynamic date column on Power BI

I have a question on Power BI, basically i have a column with date (HIREDATE), and my task is to add a column that calculate the difference between the ACTUAL date and the date on my column. So it needs to be dynamic.
I watched some youtube videos but haven't found a case like mine, even though i think it's a common, not very rare task.
Following a tutorial on Youtube and on Microsoft WebSite I added a custom column named Experience with the following code :
= Duration.ToRecord ( YEAR(TODAY()) - [#"Date d'embauche"]) /* Date d'embauche = HIREDATE in french*/
It shows me :
No syntax errors have detected
But when I click on
OK
It shows me this :
Expression.Error: The name 'YEAR' wasn't recognized. Make sure it's spelled correctly.
Please help me solve this.
For Power Query you would need DateTime.FixedLocalNow() to get the date time, then wrap that function with Date.Year to extract the year, so you would have the following:
Date.Year(DateTime.FixedLocalNow()) - Date.Year([#"Date d'embauche"])
in this example, the formula is used in a custom column, to give the time difference in years.
Normally it is best to do these sort of transformations in Power Query before getting to the data model.
Probably you are adding the Custom column in Power Query. I would rather recommend to add a Calculated Column through the following DAX expression:
Date difference =
YEAR (
TODAY ()
)
- YEAR ( 'Table Name'[HIREDATE] )

Using SELECT IF to recode one date from four variables in SPSS

Self-taught at SPSS here. Need to know the appropriate syntax to recode four DATE variables into one, based on which would be the latest date. I have four DATE variables in a dataset with 165 cases:
wnd_heal_date
wnd_heal_d14_date
wnd_heal_d30_date
wnd_heal_3m_date
And each variable may or may not contain a value for each case. I want to recode a new variable which scans the dates from all four and only selects the one that is the latest and puts it into a new variable (x_final_wound_heal_date).
How to use the SELECT IF function for this purpose?
select if function selects rows in the data, and so is not appropriate for this case. What you can do is this instead:
compute x_final_wound_heal_date =
max(wnd_heal_date, wnd_heal_d14_date, wnd_heal_d30_date, wnd_heal_3m_date).
VARIABLE LABELS x_final_wnd_heal_date 'Time to definitive wound healing (days)'.
VARIABLE LEVEL x_final_wnd_heal_date(SCALE).
ALTER TYPE x_final_wnd_heal_date(DATE11).
This will put the latest of available date values in the new variable.

Azure Data Factory : returns an array of dates from a specified range

I'm trying to returns an array of dates in data factory. But i just want the user to specify a date range with two parameters, startDate and endDate :
I want to return this array by specifying "12-08-2020" and "12-13-2020" in trigger :
["12-08-2020","12-09-2020","12-10-2020","12-12-2020","12-13-2020"]
Do did not find a simple way to do it yet.
One way i thought about would be :
add a lookup activity on a date dimension,
then add two filters to select only items greater than startDate and lower than endDate.
But this seems to be cumbersome and overkill. Is there a simpler way to do it ?
EDIT :
This answer seems to be relevant (i did not see it at first) : Execute azure data factory foreach activity with start date and end date
I think we can use recursive query in Lookup activity.
The pseudo code is as follows:
In sql we can use this query to get a table:
;with temp as
(
select CONVERT(varchar(100),'12-08-2020', 110) as dt
union all
select CONVERT(varchar(100), DATEADD(day,1,dt), 110) from temp
where datediff(day,CONVERT(varchar(100), DATEADD(day,1,dt), 110),'12-13-2020')>=0
) select * from temp
The result is as follows:
So in ADF, I think we can use a Lookup sql query to return the result what you want.
According to this official document, we only need to replace the parameters of the sql statement.
Next,I will use '#{pipeline().parameters.startDate}' to return a date string, note: There is a pair of single quotes outside.
I set two parameters as follows:
Type the following code into a Lookup activity.
;with temp as
(
select CONVERT(varchar(100),'#{pipeline().parameters.startDate}', 110) as dt
union all
select CONVERT(varchar(100), DATEADD(day,1,dt), 110) from temp
where datediff(day,CONVERT(varchar(100), DATEADD(day,1,dt), 110),'#{pipeline().parameters.endDate}')>=0
) select * from temp
Don't select First row only.
The debug result is as follows:
I had similar use case and ended up using Until with little changes.
Two parameters which takes two parameters start_day and end_day
Also have to introduce two variables for implementing counter logic. more details can be found at how-to-increment-parameter-data-factory
and finally the expression in the untill block is
#less(int(adddays(pipeline().parameters.end_day, 0, 'yyyyMMdd')), int(adddays(pipeline().parameters.start_day, int(variables('counter')), 'yyyyMMdd')))
final note the until block executes when the expression returns false and loops out on true
I managed to get something similar to work with a combination of derived column transformation using a mapLoop() function followed by a flatten transformation.
The derived column expression first calculates an array of dates in a single column.
mapLoop(toInteger((To_Date - From_Date)/86400000)),toDate(addDays(From_Date,#index)))
where 86400000 is the number of miliseconds in 24 hours
The flatten transformation uses this column to unroll the array into separate rows.

DAX: Distinct and then aggregate twice

I'm trying to create a Measure in Power BI using DAX that achieves the below.
The data set has four columns, Name, Month, Country and Value. I have duplicates so first I need to dedupe across all four columns, then group by Month and sum up the value. And then, I need to average across the Month to arrive at a single value. How would I achieve this in DAX?
I figured it out. Reply by #OscarLar was very close but nested SUMMARIZE causes problems because it cannot aggregate values calculated dynamically within the query itself (https://www.sqlbi.com/articles/nested-grouping-using-groupby-vs-summarize/).
I kept the inner SUMMARIZE from #OscarLar's answer changed the outer SUMMARIZE with a GROUPBY. Here's the code that worked.
AVERAGEX(GROUPBY(SUMMARIZE(Data, Data[Name], Data[Month], Data[Country], Data[Value]), Data[Month], "Month_Value", sumx(CURRENTGROUP(), Data[Value])), [Month_Value])
Not sure I completeley understood the question since you didn't provide example data or some DAX code you've already tried. Please do so next time.
I'm assuming parts of this can not (for reasons) be done using power query so that you have to use DAX. Then I think this will do what you described.
Create a temporary data table called Data_reduced in which duplicate rows have been removed.
Data_reduced =
SUMMARIZE(
'Data';
[Name];
[Month];
[Country];
[Value]
)
Then create the averaging measure like this
AveragePerMonth =
AVERAGEX(
SUMMARIZE(
'Data_reduced';
'Data_reduced'[Month];
"Sum_month"; SUM('Data_reduced'[Value])
);
[Sum_month]
)
Where Data is the name of the table.

K2 blackpearl: Create DateTime[] from data fields

I cannot figure out how to use the DateTime Maximum function with multiple data fields. I have 2 (DateTime) data fields that I want to get the maximum value.
The "values" field only lets you type a literal or drop in ONE field. How can I create an array of my 2 fields?
Thank you!
Ob
I think the problem you are having is that the input is not a date but an array of dates: DateTime[]
I don´t think this function is typically made to help you find the biggest between 2 dates in 2 datafields.
The way this function works is that you need to give it an array and of course the second param is a date.
To do so, you can use a smartobject that will return you a list of datetime (when you call the list, make sure you don´t return just the first, this is the default value).
The function will then work fine and tell you which of the dates in this list is the 'maximum'.
Now, if you really have to use the dates in those datafields, you will first need to convert those 2 datetime into an array of datetime. Unfortunately, I am not aware of a function able to do that (I may be wrong...).
I see 3 options nonetheless:
You write a custom function that will do just that: take 2 input and return an array of them (there are Knowledge Base article explaining how to do that)
You use a stored procedure (that you call thru a smart object) that converts your 2 data into an array and you then can use the function you mentioned
3 . I think you may simplify even further by letting the stored procedure find the max for you. See below.
A simple version for #3 would be:
declare #D1 datetime
declare #D2 datetime
SET #D1 = getdate()
SET #D2 = getdate()+100
if (#D1>#D2)
select #D1
else
select #D2
I hope this helps.