PySpark Error using partition over a ranked column - pyspark

I have two Spark Dataframes, the first one contains information related to Events as follows:
Id
User_id
Date
1
1
2021-08-15
2
2
2020-03-10
The second Dataframe contains information related to previous Purchase as below:
Id
User_id
Date
1
1
2021-07-15
2
1
2021-07-10
3
1
2021-04-12
4
2
2020-02-10
What I wondering to know is how to bring the quantity of purchase for each User 90 days prior to Event Date.
The code I'm using is:
(events.join(purchase,
on = [events.User_id == purchase.User_id,
events.Date >= purchase.Date],
how = "left")
.withColumn('rank_test', rank().over(W.partitionBy(purchase['User_id']).orderBy(col("Date").desc())))
.withColumn('is90days', when(floor((events["Date"].cast('long') - purchase["Date"].cast('long'))/86400) <= 90, 1).otherwise(0))
.where(col('is90days') == 1)
.withColumn('maxPurchase', max('rank_test').over(W.partitionBy(events['ID'])))
.where(col('rank_test') == col('maxPurchase'))
)
But I'm getting the following error:
AttributeError: 'str' object has no attribute 'over'
What I was expecting is a table as follows:
Id
User_id
Date
qtyPurchasePast90days
1
1
2021-08-15
2
2
2
2020-03-10
1
I appreciate your time in helping me!
Regards

In your code (line 8), Spark is confusing between Python built-in function max and Spark function max. And the main reason is that you (and many many others) import Spark functions in a not recommended way
# DON'T do this
from pyspark.sql.functions import *
max('rank_test') # 't'
# DO this
from pyspark.sql import functions as F
F.max('rank_test') # Column<'max(rank_test)'>

Related

How do I create new columns containing lead values generated from another column in Pyspark?

The following code pulls down daily oil prices (dcoilwtico), resamples the daily figures to monthly, calculates the 12-month (i.e. year over year percent) change and finally contains a loop to shift the YoY percent change ahead 1 month (dcoilwtico_1), 2 months (dcoilwtico_2) all the way out to 12 months (dcoilwtico_12) as new columns:
import pandas_datareader as pdr
start = datetime.datetime (2016, 1, 1)
end = datetime.datetime (2022, 12, 1)
#1. Get historic data
df_fred_daily = pdr.DataReader(['DCOILWTICO'],'fred', start, end).dropna().resample('M').mean() # Pull daily, remove NaN and collapse from daily to monthly
df_fred_daily.columns= df_fred_daily.columns.str.lower()
#2. Expand df range: index, column names
index_fred = pd.date_range('2022-12-31', periods=13, freq='M')
columns_fred_daily = df_fred_daily.columns.to_list()
#3. Append history + empty df
df_fred_daily_forecast = pd.DataFrame(index=index_fred, columns=columns_fred_daily)
df_fred_test_daily=pd.concat([df_fred_daily, df_fred_daily_forecast])
#4. New df, calculate yoy percent change for each commodity
df_fred_test_daily_yoy= ((df_fred_test_daily - df_fred_test_daily.shift(12))/df_fred_test_daily.shift(12))*100
#5. Extend each variable as a series from 1 to 12 months
for col in df_fred_test_daily_yoy.columns:
for i in range(1,13):
df_fred_test_daily_yoy["%s_%s"%(col,i)] = df_fred_test_daily_yoy[col].shift(i)
df_fred_test_daily_yoy.tail(18)
And produces the following df:
Question: My real world example contains hundreds of columns and I would like to generate these same results using Pyspark.
How would this be coded using Pyspark?
As your code is already ready, I would use koalas, "a pandas spark version", You just need to install https://pypi.org/project/koalas/
see the simple example
import databricks.koalas as ks
import pandas as pd
pdf = pd.DataFrame({'x':range(3), 'y':['a','b','b'], 'z':['a','b','b']})
# Create a Koalas DataFrame from pandas DataFrame
df = ks.from_pandas(pdf)
# Rename the columns
df.columns = ['x', 'y', 'z1']
# Do some operations in place:
df['x2'] = df.x * df.x

Match dates using two date columns as range

I am trying to create a column within databricks using pyspark. I need to check if date column is found between two other date columns and if it is then 1 if it is not then 0. I am wanting to call this ground truth, since this will tell me if on date it's found in between the two date columns. This is what I have so far:
df = (df
.withColumn("Ground_truth_IE", when(col("ReadingDateTime").between(col("EventStartDateTime") & col("EventEndDateTime")), 1).otherwiste(0)
)
)
But I continue to get an error:
TypeError: between() missing 1 required positional argument: 'upperBound'
The between() operator in pyspark should be used like: between(lowerBound, upperBound)
df = df.withColumn("Ground_truth_IE", when(col("ReadingDateTime")\
.between(col("EventStartDateTime"),col("EventEndDateTime")), 1).otherwise(0))

PySpark: How to concatenate two dataframes without duplicates rows?

I'd like to concatenate two dataframes A, B to a new one without duplicate rows (if rows in B already exist in A, don't add):
Dataframe A:
A B
0 1 2
1 3 1
Dataframe B:
A B
0 5 6
1 3 1
I wish to merge them such that the final DataFrame is of the following shape:
Final Dataframe:
A B
0 1 2
1 3 1
2 5 6
How can I do this?
pyspark.sql.DataFrame.union and pyspark.sql.DataFrame.unionAll seem to yield the same result with duplicates.
Instead, you can get the desired output by using direct SQL:
dfA.createTempView('dataframea')
dfB.createTempView('dataframeb')
aunionb = spark.sql('select * from dataframea union select * from dataframeb')
Using SQL produces the expected/correct result.
In order to remove any duplicate rows, just use union() followed by a distinct().
Mentioned in the documentation
http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html
"union(other)
Return a new DataFrame containing union of rows in this frame and another frame.
This is equivalent to UNION ALL in SQL. To do a SQL-style set union (that does deduplication of elements), use this function followed by a distinct."
You have just to drop duplicates after union.
df = dfA.union(dfB).dropDuplicates()

Remove from dataframe unique rows

I have such problem: i need to remove rows that have in the column A unique values from dataframe
In example below of the DF1 row 0 and 3 should be removed
A B C
0 5 100 5
1 1 200 5
2 1 150 4
3 3 500 5
The one solution that I thought till now it is:
groupby(A)
count rows in each group
filter out counts > 1
save result into DF2
DF1.intersect(DF2)
any other ideas? solution for RDD also can help, but better for DataFrame
Thanks!
A more condensed syntax (but following same approach):
df=sqlContext.createDataFrame([[5,100,5],[1,200,5],[1,150,4],[3,500,5]],['A','B','C'])
df.registerTempTable('df') # Making SQL queries possible
df_t=sqlContext.sql('select A,count(B) from df group by A having count(B)=1') # step 1 to 4 in 1 statement
df2=df.join(df_t,df.A==df_t.A,'leftsemi') # only keep records that have a matching key
Some people refer to the 'leftsemi' as 'left keep'. It keeps records of dataframe 1 if the key also exists in df_t.

Min value with GROUP BY in Power BI Desktop

id datetime new_column datetime_rankx
1 12.01.2015 18:10:10 12.01.2015 18:10:10 1
2 03.12.2014 14:44:57 03.12.2014 14:44:57 1
2 21.11.2015 11:11:11 03.12.2014 14:44:57 2
3 01.01.2011 12:12:12 01.01.2011 12:12:12 1
3 02.02.2012 13:13:13 01.01.2011 12:12:12 2
3 03.03.2013 14:14:14 01.01.2011 12:12:12 3
I want to make new column, which will have minimum datetime value for each row in group by id.
How could I do it in Power BI desktop using DAX query?
Use this expression:
NewColumn =
CALCULATE(
MIN(
Table[datetime]),
FILTER(Table,Table[id]=EARLIER(Table[id])
)
)
In Power BI using a table with your data it will produce this:
UPDATE: Explanation and EARLIER function usage.
Basically, EARLIER function will give you access to values of different row context.
When you use CALCULATE function it creates a row context of the whole table, theoretically it iterates over every table row. The same happens when you use FILTER function it will iterate on the whole table and evaluate every row against the filter condition.
So far we have two row contexts, the row context created by CALCULATE and the row context created by FILTER. Note FILTER use the EARLIER to get access to the CALCULATE's row context. Having said that, in our case for every row in the outer (CALCULATE's row context) the FILTER returns a set of rows that correspond to the current id in the outer context.
If you have a programming background it could give you some sense. It is similar to a nested loop.
Hope this Python code points the main idea behind this:
outer_context = ['row1','row2','row3','row4']
inner_context = ['row1','row2','row3','row4']
for outer_row in outer_context:
for inner_row in inner_context:
if inner_row == outer_row: #this line is what the FILTER and EARLIER do
#Calculate the min datetime using the filtered rows
...
...
UPDATE 2: Adding a ranking column.
To get the desired rank you can use this expression:
RankColumn =
RANKX(
CALCULATETABLE(Table,ALLEXCEPT(Table,Table[id]))
,Table[datetime]
,Hoja1[datetime]
,1
)
This is the table with the rank column:
Let me know if this helps.