I'd like to merge two dataframes, df1 & df2, based on whether rows of df2 fall within a 3-6 month date range after rows of df1. For example:
df1 (for each company I have quarterly data):
company DATADATE
0 012345 2005-06-30
1 012345 2005-09-30
2 012345 2005-12-31
3 012345 2006-03-31
4 123456 2005-01-31
5 123456 2005-03-31
6 123456 2005-06-30
7 123456 2005-09-30
df2 (for each company I have event dates that can happen on any day):
company EventDate
0 012345 2005-07-28 <-- won't get merged b/c not within date range
1 012345 2005-10-12
2 123456 2005-05-15
3 123456 2005-05-17
4 123456 2005-05-25
5 123456 2005-05-30
6 123456 2005-08-08
7 123456 2005-11-29
8 abcxyz 2005-12-31 <-- won't be merged because company not in df1
Ideal merged df -- rows with EventDates in df2 that are 3-6 months (i.e. 1 quarter) after DATADATEs in rows of df1 will be merged:
company DATADATE EventDate
0 012345 2005-06-30 2005-10-12
1 012345 2005-09-30 NaN <-- nan because no EventDates fell in this range
2 012345 2005-12-31 NaN
3 012345 2006-03-31 NaN
4 123456 2005-01-31 2005-05-15
5 123456 2005-01-31 2005-05-17
5 123456 2005-01-31 2005-05-25
5 123456 2005-01-31 2005-05-30
6 123456 2005-03-31 2005-08-08
7 123456 2005-06-30 2005-11-19
8 123456 2005-09-30 NaN
I am trying to apply this related topic [ Merge pandas DataFrames based on irregular time intervals ] by adding start_time and end_time columns to df1 denoting 3 months (start_time) to 6 months (end_time) after DATADATE, then using np.searchsorted(), but this case is a bit trickier because I'd like to merge on a company-by-company basis.
This is actually one of those rare questions where the algorithmic complexity might be significantly different for different solutions. You might want to consider this over the niftiness of 1-liner snippets.
Algorithmically:
sort the larger of the dataframes according to the date
for each date in the smaller dataframe, use the bisect module to find the relevant rows in the larger dataframe
For dataframes with lengths m and n, respectively (m < n) the complexity should be O(m log(n)).
This is my solution going off of the algorithm that Ami Tavory suggested below:
#find the date offsets to define date ranges
start_time = df1.DATADATE.apply(pd.offsets.MonthEnd(3))
end_time = df1.DATADATE.apply(pd.offsets.MonthEnd(6))
#make these extra columns
df1['start_time'] = start_time
df1['end_time'] = end_time
#find unique company names in both dfs
unique_companies_df1 = df1.company.unique()
unique_companies_df2 = df2.company.unique()
#sort df1 by company and DATADATE, so we can iterate in a sensible order
sorted_df1=df1.sort(['company','DATADATE']).reset_index(drop=True)
#define empty df to append data
df3 = pd.DataFrame()
#iterate through each company in df1, find
#that company in sorted df2, then for each
#DATADATE quarter of df1, bisect df2 in the
#correct locations (i.e. start_time to end_time)
for cmpny in unique_companies_df1:
if cmpny in unique_companies_df2: #if this company is in both dfs, take the relevant rows that are associated with this company
selected_df2 = df2[df2.company==cmpny].sort('EventDate').reset_index(drop=True)
selected_df1 = sorted_df1[sorted_df1.company==cmpny].reset_index(drop=True)
for quarter in xrange(len(selected_df1.DATADATE)): #iterate through each DATADATE quarter in df1
lo=bisect.bisect_right(selected_df2.EventDate,selected_CS.start_time[quarter]) #bisect_right to ensure that we do not include dates before our date range
hi=bisect.bisect_left(selected_IT.EventDate,selected_CS.end_time[quarter]) #bisect_left here to not include dates after our desired date range
df_right = selected_df2.loc[lo:hi].copy() #grab all rows with EventDates that fall within our date range
df_left = pd.DataFrame(selected_df1.loc[quarter]).transpose()
if len(df_right)==0: # if no EventDates fall within range, create a row with cmpny in the 'company' column, and a NaT in the EventDate column to merge
df_right.loc[0,'company']=cmpny
temp = pd.merge(df_left,df_right,how='inner',on='company') #merge the df1 company quarter with all df2's rows that fell within date range
df3=df3.append(temp)
Related
Lets say I have flight data (from Foundry Academy).
Starting dataset:
Date
flight_id
origin_state
carrier_name
jan
000000001
California
delta air
jan
000000002
Alabama
delta air
jan
000000003
California
southwest
feb
000000004
California
southwest
...
...
...
...
I'm doing monthly data aggregation by state and by carrier. Header of my aggregated data looks like this:
origin state
carrier name
jan
feb
...
Alabama
delta air
1
0
...
California
delta air
1
0
...
California
southwest
1
1
...
I need to get subtotals for each state;
I need to sort by most flights;
and I want it to be sorted by states, then by carrier.
desired output
origin state
carrier name
jan
feb
...
California
null
2
1
...
California
delta air
1
0
...
California
southwest
1
1
...
Alabama
null
1
0
...
Alabama
delta air
1
0
...
PIVOT - doesn't provide subtotals for categories;
EXPRESSION - doesn't offer possibility to split date column into columns.
I solved it on Contour. not the prettiest solution, but it works.
I've created two paths to the same dataset:
| Date | flight_id | origin_state | carrier_name |
| ---- | --------- | ------------ | ------------ |
| ... | ... | ... | ... |
1st path was used to calculate full aggregation. pivot table and switch to pivoted data:
Switch to pivoted data: using column "date",
grouped by "origin_state" and "carrier_name",
aggregated by Count
2nd path was used to get subtotals:
Switch to pivoted data: using column "date",
grouped by "origin_state",
aggregated by Count
Afterwards I've added empty column "carrier_name" to second dataset. And made union of both datasets
Add rows that appear in "second_path" by column name
After that I've added additional column with expression
Add new column "order" from max("Jan") OVER (
PARTITION BY "origin_state" )
After that I sorted resulting dataset.
Sort dataset by "order" descending, then by "Jan" descending
I received result. but it has additional column, and now I wish to change row formatting of subtotals.
Other approaches are welcome. as my real data has more hierarchical levels.
I have two Spark DataFrames, the first one (Events) contains events information as following:
Event_id
Date
User_id
1
2019-04-19
1
2
2019-05-30
2
3
2020-01-20
1
The second one (User) contains information from users as below:
Id
User_id
Date
Weight-kg
1
1
2019-04-05
78
2
1
2019-04-17
75
3
2
2019-10-10
50
4
1
2020-02-10
76
What I wonder to know is how to bring the latest value of weight from User before the Event Date using PySpark?
The return of this code must be the following table:
Event_id
Date
User_id
Weight-kg
1
2019-04-19
1
75
2
2019-05-30
2
null
3
2020-01-20
1
75
The idea is left join events and users then ranking the weight based on dates to get the latest ones
from pyspark.sql import functions as F
from pyspark.sql import Window as W
(event
# left join to keep all events
# note the join condition where
# event's date >= user's date
.join(
user,
on=[
event['User_id'] == user['User_id'],
event['Date'] >= user['Date'],
],
how='left'
)
# rank user's weight to get the latest
# based on the dates that already filtered by event's date
.withColumn('rank_weight', F.rank().over(W.partitionBy(user['User_id']).orderBy(User['Date'].desc())))
.where(F.col('rank_weight') == 1)
.drop('rank_weight')
# drop unnecessary columns
.drop(user['User_id'])
.drop(user['Date'])
.drop('Id')
.orderBy('Event_id')
.show()
)
# Output
# +--------+----------+-------+------+
# |Event_id| Date|User_id|Weight|
# +--------+----------+-------+------+
# | 1|2019-04-19| 1| 75|
# | 2|2019-05-30| 2| null|
# | 3|2020-01-20| 1| 75|
# +--------+----------+-------+------+
I have multiple data frames (24 in total) with one column. I need to combine all of them to a single data frame. I created indexes and joined using indexes but it is quite slow to join all of them (All has same number of rows).
Please note that I'm using Pyspark 2.1
w = Window().orderBy(lit('A'))
df1 = df1.withColumn('Index',row_number().over(w))
df2 = df2.withColumn('Index',row_number().over(w))
joined_df = df1.join(df2,df1.Index=df2.Index,'Inner').drop(df2.Index)
df3 = df3.withColumn('Index',row_number().over(w))
joined_df = joined_df.join(df3,joined_df.Index=df3.Index).drop(df3.Index)
But as the joined_df grows, it keeps getting slower
DF1:
Col1
2
8
18
12
DF2:
Col2
abc
bcd
def
bbc
DF3:
Col3
1.0
2.2
12.1
1.9
Expected Results:
joined_df:
Col1 Col2 Col3
2 abc 1.0
8 bcd 2.2
18 def 12.1
12 bbc 1.9
You're doing it the correct way. Unfortunately without a primary key, spark is not suited for this type of operation.
Answer by pault, pulled from comment.
we have two dataframes and we need to filter
data in one dataframe with data in another dataframe column
df1
-------------------------------
name paid_amount date_paid
-------------------------------
aaa 10 2017-10-10
aba 10 2017-01-10
aac 10 2017-10-10
daa 10 2017-16-10
df2
-----------------------------
start_date end_date
-----------------------------
2017-01-01 2018-01-01
------------------------------
we need to create third dataframe by checking
(date_paid) field in df1 falls in between df2(start_date) & df2(end_date)
df1.where($date_paid).isin(df2.start_date && df2.end_date)
Should be:
df1.crossJoin(df2).where($"date_paid".between($"start_date", $"end_date"))
I have a dataset with two variables: journalistName, articleDate
For each journalist (group), I want to create a variable that chronologically categorizes the articles into 1 for "first half" and 2 for "second half".
For example, if a journalist wrote 4 articles, I want first two articles categorized as 1.
If he wrote 5 articles, I want first three articles categorized as 1.
One possibility I thought of is to calculate the midpoint date and then use if condition (gen cat1 = 1 if midpoint > startdate) but I dont know how to generate such midpoint in Stata.
Per your description of which articles to categorize as 1, you're looking for the midpoint of the number of articles rather than the midpoint of the date range.
One solution is to use by group processing, _n, and _N:
gen cat = 2
bysort author (date): replace cat = 1 if _n <= ceil(_N/2)
This sorts by author and date, and then assigns cat = 1 to observations within each group of author where the current observation (_n) number is less than or equal to the median observation (ceil(_N/2)).
Note that you need a numeric (rather than string) date for the sort to work properly. Also, in my opinion, cat = {1,2} is less intuitive than something like firsthalf = {0,1}. Either way, labeling the values (help label) would aid clarity.
For more information, see help by and this article.
Finally, the method in action:
clear all
input str10 author str10 datestr
"Alex" "09may2015"
"Alex" "06apr2015"
"Alex" "15jul2014"
"Alex" "19aug2013"
"Alex" "03mar2009"
"Betty" "09may2015"
"Betty" "06apr2015"
"Betty" "15jul2014"
"Betty" "19aug2013"
end
gen date = daily(datestr, "DMY")
format date %td
gen cat = 2
bysort author (date): replace cat = 1 if _n <= ceil(_N/2)
list , sepby(author) noobs
and the result
+--------------------------------------+
| author datestr date cat |
|--------------------------------------|
| Alex 03mar2009 03mar2009 1 |
| Alex 19aug2013 19aug2013 1 |
| Alex 15jul2014 15jul2014 1 |
| Alex 06apr2015 06apr2015 2 |
| Alex 09may2015 09may2015 2 |
|--------------------------------------|
| Betty 19aug2013 19aug2013 1 |
| Betty 15jul2014 15jul2014 1 |
| Betty 06apr2015 06apr2015 2 |
| Betty 09may2015 09may2015 2 |
+--------------------------------------+
If you are indeed seeking to calculate the midpoint date, you can do so using the same general principle:
bysort author (date): gen beforemiddate = date <= ceil((date[_N] + date[1]) / 2)
Also, to find the last date in the "pre-midpoint" period, you can use the same principles:
bysort author cat (date): gen lastdate = date[_N] if cat == 1
by author: replace lastdate = lastdate[_n-1] if missing(lastdate)
format lastdate %td
or an egen function with a logical test included gets the job done a bit faster:
egen lastdate = max(date * (cat == 1)) , by(author)
format lastdate %td