I have a dataset with two variables: journalistName, articleDate
For each journalist (group), I want to create a variable that chronologically categorizes the articles into 1 for "first half" and 2 for "second half".
For example, if a journalist wrote 4 articles, I want first two articles categorized as 1.
If he wrote 5 articles, I want first three articles categorized as 1.
One possibility I thought of is to calculate the midpoint date and then use if condition (gen cat1 = 1 if midpoint > startdate) but I dont know how to generate such midpoint in Stata.
Per your description of which articles to categorize as 1, you're looking for the midpoint of the number of articles rather than the midpoint of the date range.
One solution is to use by group processing, _n, and _N:
gen cat = 2
bysort author (date): replace cat = 1 if _n <= ceil(_N/2)
This sorts by author and date, and then assigns cat = 1 to observations within each group of author where the current observation (_n) number is less than or equal to the median observation (ceil(_N/2)).
Note that you need a numeric (rather than string) date for the sort to work properly. Also, in my opinion, cat = {1,2} is less intuitive than something like firsthalf = {0,1}. Either way, labeling the values (help label) would aid clarity.
For more information, see help by and this article.
Finally, the method in action:
clear all
input str10 author str10 datestr
"Alex" "09may2015"
"Alex" "06apr2015"
"Alex" "15jul2014"
"Alex" "19aug2013"
"Alex" "03mar2009"
"Betty" "09may2015"
"Betty" "06apr2015"
"Betty" "15jul2014"
"Betty" "19aug2013"
end
gen date = daily(datestr, "DMY")
format date %td
gen cat = 2
bysort author (date): replace cat = 1 if _n <= ceil(_N/2)
list , sepby(author) noobs
and the result
+--------------------------------------+
| author datestr date cat |
|--------------------------------------|
| Alex 03mar2009 03mar2009 1 |
| Alex 19aug2013 19aug2013 1 |
| Alex 15jul2014 15jul2014 1 |
| Alex 06apr2015 06apr2015 2 |
| Alex 09may2015 09may2015 2 |
|--------------------------------------|
| Betty 19aug2013 19aug2013 1 |
| Betty 15jul2014 15jul2014 1 |
| Betty 06apr2015 06apr2015 2 |
| Betty 09may2015 09may2015 2 |
+--------------------------------------+
If you are indeed seeking to calculate the midpoint date, you can do so using the same general principle:
bysort author (date): gen beforemiddate = date <= ceil((date[_N] + date[1]) / 2)
Also, to find the last date in the "pre-midpoint" period, you can use the same principles:
bysort author cat (date): gen lastdate = date[_N] if cat == 1
by author: replace lastdate = lastdate[_n-1] if missing(lastdate)
format lastdate %td
or an egen function with a logical test included gets the job done a bit faster:
egen lastdate = max(date * (cat == 1)) , by(author)
format lastdate %td
Related
Lets say I have flight data (from Foundry Academy).
Starting dataset:
Date
flight_id
origin_state
carrier_name
jan
000000001
California
delta air
jan
000000002
Alabama
delta air
jan
000000003
California
southwest
feb
000000004
California
southwest
...
...
...
...
I'm doing monthly data aggregation by state and by carrier. Header of my aggregated data looks like this:
origin state
carrier name
jan
feb
...
Alabama
delta air
1
0
...
California
delta air
1
0
...
California
southwest
1
1
...
I need to get subtotals for each state;
I need to sort by most flights;
and I want it to be sorted by states, then by carrier.
desired output
origin state
carrier name
jan
feb
...
California
null
2
1
...
California
delta air
1
0
...
California
southwest
1
1
...
Alabama
null
1
0
...
Alabama
delta air
1
0
...
PIVOT - doesn't provide subtotals for categories;
EXPRESSION - doesn't offer possibility to split date column into columns.
I solved it on Contour. not the prettiest solution, but it works.
I've created two paths to the same dataset:
| Date | flight_id | origin_state | carrier_name |
| ---- | --------- | ------------ | ------------ |
| ... | ... | ... | ... |
1st path was used to calculate full aggregation. pivot table and switch to pivoted data:
Switch to pivoted data: using column "date",
grouped by "origin_state" and "carrier_name",
aggregated by Count
2nd path was used to get subtotals:
Switch to pivoted data: using column "date",
grouped by "origin_state",
aggregated by Count
Afterwards I've added empty column "carrier_name" to second dataset. And made union of both datasets
Add rows that appear in "second_path" by column name
After that I've added additional column with expression
Add new column "order" from max("Jan") OVER (
PARTITION BY "origin_state" )
After that I sorted resulting dataset.
Sort dataset by "order" descending, then by "Jan" descending
I received result. but it has additional column, and now I wish to change row formatting of subtotals.
Other approaches are welcome. as my real data has more hierarchical levels.
My date data was pulled from our system with the format "mm/dd" not the year.
So I meet the problem when I subtract value between the old year and the current year.
Example:
Date action Date Check Result Current Date
12/21 01/03 -352 03/18/2022
The correct result is:
Date action Date Check Result Current Date
12/21 01/03 13 03/18/2022
How to subtract correctly? Thanks.
I assume that Excel has treated 12/21 and 01/03 as dates, but in doing this has assumed the current year in all cases. This dedcuts 1 year from cell A2
=DATE(YEAR(A2)-1,MONTH(A2),DAY(A2))
e.g.
Test if the earlier date has a month less than the latter date, if it is not then deduct 1 year from the earlier date and then calculate the difference
+------------+------------+------+
| A | B | C |
1 | earlier | latter | diff |
+------------+------------+------+
2 | 2022-12-21 | 2022-01-03 | 13 |
+------------+------------+------+
in cell C2
=IF(MONTH(A2) > MONTH(B2),B2-DATE(YEAR(A2)-1,MONTH(A2),DAY(A2)),B2-A2)
I have a text file, which contains only numbers.
For example:
2001 31110
199910 311
Its layout can be explained as follows:
1~4th numbers : Year
5~6th numbers : Month
7~8th numbers : Day
9th number : Sex
10th number : Married
However, I can't decide how to import this file into Stata.
For instance, if I use the command:
import delimited input.txt, delimiter(??)
What should I write in delimiter?
I don't necessarily need to use the above. I just want to import the data using whatever method.
The answer depends on what you want to do with the data later.
My understanding is that the spaces indicate a single digit for date-related numbers and that in the text file, only month or day can be single digit but not both. In addition, sex and married are binary indicators taking values 0 and 1.
Assuming the above are correct and the data below are included in a file data.txt:
2001 31110
199910 311
1983 41201
2012121500
Here's one way to do it:
clear
import delimited data.txt, delimiter(" ") stringcols(_all)
list
+--------------------+
| v1 v2 |
|--------------------|
1. | 2001 31110 |
2. | 199910 311 |
3. | 1983 41201 |
4. | 2012121500 |
+--------------------+
replace v2 = "0" + v2 if v2 != ""
generate v3 = v1 + v2
generate year = substr(v3, 1, 4)
generate month = substr(v3, 5, 2)
generate day = substr(v3, 7, 2)
generate date = substr(v3, 1, 8)
generate sex = substr(v3, 9, 1)
generate married = substr(v3, 10, 1)
list
+----------------------------------------------------------------------------------+
| v1 v2 v3 year month day date sex married |
|----------------------------------------------------------------------------------|
1. | 2001 031110 2001031110 2001 03 11 20010311 1 0 |
2. | 199910 0311 1999100311 1999 10 03 19991003 1 1 |
3. | 1983 041201 1983041201 1983 04 12 19830412 0 1 |
4. | 2012121500 2012121500 2012 12 15 20121215 0 0 |
+----------------------------------------------------------------------------------+
You basically import everything in a maximum of two string variables, with a single space " " acting as a separator. The single-digit months or days are changed to two digits by adding a 0 at the front. Then, after you extract the relevant parts of the strings using the substr() function, you can simply convert the resulting variables to numeric as needed.
For example:
destring year month day sex married, replace
generate date2 = daily(date, "YMD")
format date2 %tdDD-NN-CCYY
. list date2
+------------+
| date2 |
|------------|
1. | 11-03-2001 |
2. | 03-10-1999 |
3. | 12-04-1983 |
4. | 15-12-2012 |
+------------+
If in your text file both month and day contain single digits, you follow the same logic as above but you will need to deal with a third variable as well after you import the data.
I am trying to find for each customer the Max consecutive years he buys something. I tried to create a calculated field but to no avail.
I created two calculated fields
Consecutive: if max([Count])>0 then previous_value(0)+1+index()-index() else 0 end
max: window_max([Consecutive])
My data looks something like:
Year | Customer | Count
1996 | a | 2
1996 | b | 1
1997 | a | 1
1997 | b | 2
1998 | b | 1
So the result would be
a:2
b:3
Use nested table calcs.
The first calc, call it running_good_years, is a running count of consecutive years with sales.
If count(Sales) = 0 then 0 else previous_value(0) + 1 end
The second just returns the max
Window_max(running_good_years)
With table calcs, defining the partitioning and addressing is critical. Partition by Customer, Address by year
I'd like to merge two dataframes, df1 & df2, based on whether rows of df2 fall within a 3-6 month date range after rows of df1. For example:
df1 (for each company I have quarterly data):
company DATADATE
0 012345 2005-06-30
1 012345 2005-09-30
2 012345 2005-12-31
3 012345 2006-03-31
4 123456 2005-01-31
5 123456 2005-03-31
6 123456 2005-06-30
7 123456 2005-09-30
df2 (for each company I have event dates that can happen on any day):
company EventDate
0 012345 2005-07-28 <-- won't get merged b/c not within date range
1 012345 2005-10-12
2 123456 2005-05-15
3 123456 2005-05-17
4 123456 2005-05-25
5 123456 2005-05-30
6 123456 2005-08-08
7 123456 2005-11-29
8 abcxyz 2005-12-31 <-- won't be merged because company not in df1
Ideal merged df -- rows with EventDates in df2 that are 3-6 months (i.e. 1 quarter) after DATADATEs in rows of df1 will be merged:
company DATADATE EventDate
0 012345 2005-06-30 2005-10-12
1 012345 2005-09-30 NaN <-- nan because no EventDates fell in this range
2 012345 2005-12-31 NaN
3 012345 2006-03-31 NaN
4 123456 2005-01-31 2005-05-15
5 123456 2005-01-31 2005-05-17
5 123456 2005-01-31 2005-05-25
5 123456 2005-01-31 2005-05-30
6 123456 2005-03-31 2005-08-08
7 123456 2005-06-30 2005-11-19
8 123456 2005-09-30 NaN
I am trying to apply this related topic [ Merge pandas DataFrames based on irregular time intervals ] by adding start_time and end_time columns to df1 denoting 3 months (start_time) to 6 months (end_time) after DATADATE, then using np.searchsorted(), but this case is a bit trickier because I'd like to merge on a company-by-company basis.
This is actually one of those rare questions where the algorithmic complexity might be significantly different for different solutions. You might want to consider this over the niftiness of 1-liner snippets.
Algorithmically:
sort the larger of the dataframes according to the date
for each date in the smaller dataframe, use the bisect module to find the relevant rows in the larger dataframe
For dataframes with lengths m and n, respectively (m < n) the complexity should be O(m log(n)).
This is my solution going off of the algorithm that Ami Tavory suggested below:
#find the date offsets to define date ranges
start_time = df1.DATADATE.apply(pd.offsets.MonthEnd(3))
end_time = df1.DATADATE.apply(pd.offsets.MonthEnd(6))
#make these extra columns
df1['start_time'] = start_time
df1['end_time'] = end_time
#find unique company names in both dfs
unique_companies_df1 = df1.company.unique()
unique_companies_df2 = df2.company.unique()
#sort df1 by company and DATADATE, so we can iterate in a sensible order
sorted_df1=df1.sort(['company','DATADATE']).reset_index(drop=True)
#define empty df to append data
df3 = pd.DataFrame()
#iterate through each company in df1, find
#that company in sorted df2, then for each
#DATADATE quarter of df1, bisect df2 in the
#correct locations (i.e. start_time to end_time)
for cmpny in unique_companies_df1:
if cmpny in unique_companies_df2: #if this company is in both dfs, take the relevant rows that are associated with this company
selected_df2 = df2[df2.company==cmpny].sort('EventDate').reset_index(drop=True)
selected_df1 = sorted_df1[sorted_df1.company==cmpny].reset_index(drop=True)
for quarter in xrange(len(selected_df1.DATADATE)): #iterate through each DATADATE quarter in df1
lo=bisect.bisect_right(selected_df2.EventDate,selected_CS.start_time[quarter]) #bisect_right to ensure that we do not include dates before our date range
hi=bisect.bisect_left(selected_IT.EventDate,selected_CS.end_time[quarter]) #bisect_left here to not include dates after our desired date range
df_right = selected_df2.loc[lo:hi].copy() #grab all rows with EventDates that fall within our date range
df_left = pd.DataFrame(selected_df1.loc[quarter]).transpose()
if len(df_right)==0: # if no EventDates fall within range, create a row with cmpny in the 'company' column, and a NaT in the EventDate column to merge
df_right.loc[0,'company']=cmpny
temp = pd.merge(df_left,df_right,how='inner',on='company') #merge the df1 company quarter with all df2's rows that fell within date range
df3=df3.append(temp)