I am trying to identify traders who place transactions in the same month in each of three consecutive years in one company. Once a trader meets the criteria, these three transactions and all his subsequent transactions in that same month in that company should be identified.
Assume I have a sample data below.
data have;
input ID STOCK trandate $12.;
datalines;
1 1 10/15/2009
1 1 01/01/2010
1 1 01/10/2011
1 1 01/15/2012
1 1 01/01/2013
1 2 01/30/2011
1 2 01/30/2012
1 2 01/30/2012
1 2 01/30/2013
1 2 01/30/2014
1 2 01/30/2015
2 1 01/20/2010
2 1 01/15/2011
2 1 01/16/2012
2 1 02/01/2013
2 2 02/01/2010
2 2 02/10/2011
2 2 02/10/2012
2 2 02/10/2013
2 2 02/10/2014
2 2 01/10/2015
;
run;
What I need:
ID Stock trandate type
1 1 10/15/2009 0
1 1 01/01/2010 1
1 1 01/10/2011 1
1 1 01/15/2012 1
1 1 01/01/2013 1
1 2 01/30/2011 1
1 2 01/30/2012 1
1 2 01/30/2012 1
1 2 01/30/2013 1
1 2 01/30/2014 1
1 2 01/30/2015 1
2 1 01/20/2010 0
2 1 01/15/2011 0
2 1 01/16/2012 0
2 1 02/01/2013 0
2 2 02/01/2010 1
2 2 02/10/2011 1
2 2 02/10/2012 1
2 2 02/10/2013 1
2 2 02/10/2014 1
2 2 01/10/2015 0
I used following code to achieve this:
proc sort data=have;
by id stock trandate;
run;
data have;
set have;
month=month(trandate);
year=year(trandate);
run;
proc sort data=have;
by id stock month year;
run;
data have;
set have;
by personid secid month year;
rungroup + (first.month or not first.month and year - lag(year) > 1);
run;
data temp;
do index = 1 by 1 until (last.rungroup);
set have;
by rungroup;
* distinct number of years in rungroup;
years_runlength = sum (years_runlength, first.rungroup or year ne lag(year));
end;
do index = 1 to index;
set have;
if years_runlength >=4 then output;
end;
run;
The above codes are used to identify traders with transactions in the past three consecutive years. Since I also need the subsequent transactions of these traders. The following codes are further applied.
proc sort data=temp;
by personid secid rungroup;
run;
data temp;
set temp;
by rungroup;
if first.rungroup then fyear=year;
run;
data temp(drop=fyear rename=(Locf=fyear));
do until (last.personid);
set temp;
by id stock;
locf=coalesce(fyear,locf);
output;
end;
run;
data temp;
set temp;
by rungroup;
if first.rungroup then fmonth=month;
run;
data temp;
set temp;
gap=year-fyear;
run;
proc means data=temp;
var gap;
run;
data temp;
set temp;
if gap=3 then type2=1;
type1=1;
run;
The above codes are used to mark the first transaction after the three consecutive years. In this context, when the identified transactions combine with the original dataset, all transactions in that same month below the marked transaction could be identified. Thereby, I can achieve the objective that "these three transactions and all his subsequent transactions in that same month in that company should be identified". The following codes are used to achieve this.
proc sort data=have;
by id stock rungroup;
run;
proc sort data=temp;
by id stock rungroup;
run;
data combine;
merge have temp;
by id stock rungroup;
run;
data combine;
set combine;
month=month(trandate);
run;
data combine1 (drop=fmonth rename=(Locf=fmonth));
do until (last.personid );
set combine;
by id stock;
locf=coalesce(fmonth,locf);
output;
end;
run;
data combine2 (drop=type2 rename=(Locf=type2));
do until (last.personid);
set combine1;
by id stock;
locf=coalesce(type2,locf);
output;
end;
run;
data combine2;
set combine2;
if month^=fmonth then type2=.;
run;
data combine2;
set combine2;
if type1=1 or type2=1 then type=1;
else type=0;
run;
I tried these codes, the results looks right, but I cannot 100% sure. Additionally, as you can see, my codes are relative long and complex. So could anyone give me some suggestions about the code?
Here is a bit of brute force way. For this example I just limited it to the years 2009 to 2015 in your example, but you could just expand the pattern to allow more years. You could use macro logic to generate the wallpaper aspects of the code.
First generate an array you can index by YEAR and MONTH and populate the variables with 1 when the month it represents has a trade. Then check if the series of values for the same month across the years ever has three 1's in a row. You can use two DOW loops to process the data. The first one populates the array and the second tests the array and sets the new flag variable.
data want ;
do until(last.stock) ;
set have ;
by id stock;
array months [1:12,2009:2015]
m1y2009-m1y2015 m2y2009-m2y2015 m3y2009-m3y2015 m4y2009-m4y2015
m5y2009-m5y2015 m6y2009-m6y2015 m7y2009-m7y2015 m8y2009-m8y2015
m9y2009-m9y2015 m10y2009-m10y2015 m11y2009-m11y2015 m12y2009-m12y2015
;
months[month(trandate),year(trandate)]=1;
end;
do until(last.stock);
set have;
by id stock;
select (month(trandate));
when (1) flag=0 ne index(cats(of m1y:),'111');
when (2) flag=0 ne index(cats(of m2y:),'111');
when (3) flag=0 ne index(cats(of m3y:),'111');
when (4) flag=0 ne index(cats(of m4y:),'111');
when (5) flag=0 ne index(cats(of m5y:),'111');
when (6) flag=0 ne index(cats(of m6y:),'111');
when (7) flag=0 ne index(cats(of m7y:),'111');
when (8) flag=0 ne index(cats(of m8y:),'111');
when (9) flag=0 ne index(cats(of m9y:),'111');
when (10) flag=0 ne index(cats(of m10y:),'111');
when (11) flag=0 ne index(cats(of m11y:),'111');
when (12) flag=0 ne index(cats(of m12y:),'111');
otherwise ;
end;
output;
end;
drop m: ;
run;
Related
Refer to below table, an ID is considered as complete if at least one of its group having Day 1 to Day 3 (Duplicate allowed).
I need to remove ID which has Group not having full Day 1 to Day 3.
ID Group Day
1 A 1
1 A 1
1 A 2
1 A 3
1 B 1
1 B 3
2 A 1
2 A 3
2 B 2
Expected result
ID Group Day
1 A 1
1 A 1
1 A 2
1 A 3
1 B 1
1 B 3
With this reference, Delete the group that none of its observation contain the certain value in SAS
I have tried below code but it cannot remove ID 2.
PROC SQL;
CREATE TABLE TEMP AS SELECT
* FROM HAVE
GROUP BY ID
HAVING MIN(DAY)=1 AND MAX(DAY)=3
;QUIT;
PROC SQL;
CREATE TABLE TEMP1 AS SELECT
* FROM TEMP WHERE ID IN
(SELECT ID FROM TEMP
WHERE DAY=2)
;QUIT;
So you want to find the set of ID values where the ID has at least one GROUP that has all three DAY values? Find the list of IDs as a subquery and use it to subset the original data.
The key thing in subquery is you want there to be 3 distinct values of DAY. If your data could have other values of DAY (like missing or 4) then use a WHERE clause to only keep the values you want to count.
proc sql;
create table want as
select * from have
where id in
(select id from have
where day in (1,2,3)
group by id,group
having count(distinct day)=3
)
;
quit;
You can query the dataset with a removal list. For example:
proc sql noprint;
create table want as
select *
from have
where cats(group, id) NOT IN(select cats(group, id) from removal_list)
;
quit;
Creating the Removal List
This method will prevent you from having to do a Cartesian product on all IDs, groups, and days to create your removal list.
Assume that your data is sorted by ID, group, and day.
For each ID, the first day in the group must be 1
For each ID, all days in the group after the first day must have a difference of 1 from the previous day
Code:
data removal_list;
set have;
by ID Group Day;
retain flag_remove_group;
lag_day = lag(day);
/* Reset flag_remove_group at the start of each (ID, Group).
Check if the first day is > 1. If it is, set the removal flag.
*/
if(first.group) then do;
call missing(lag_day);
if(day > 1) then flag_remove_group = 1;
else flag_remove_group = 0;
end;
/* If it's not the first (ID, Group), check if days
are skipped between observations
*/
if(NOT first.group AND (day - lag_day) NE 1) then flag_remove_group = 1;
if(flag_remove_group) then output;
keep id group;
run;
Original data:
subject medgrp stdt endt
1 A 7/1/2014 7/31/2014
1 A 7/29/2014 8/30/2014
1 B 7/1/2014 8/15/2014
1 C 8/1/2014 9/1/2014
2 A 4/15/2014 5/15/2014
2 A 5/10/2014 6/10/2014
2 A 6/5/2014 6/15/2014
2 A 7/1/2014 8/1/2014
3 A 6/5/2014 6/15/2014
3 A 6/16/2014 8/1/2014
Re-structured data:
subject med_pattern stdt_new endt_new
1 A*B 7/1/2014 7/31/2014
1 A*B*C 8/1/2014 8/15/2014
1 A*C 8/16/2014 8/30/2014
1 C 8/31/2014 9/1/2014
2 A 4/15/2014 6/15/2014
2 A 7/1/2014 8/1/2014
3 A 6/5/2014 8/1/2014
I was able to transform original data to re-structured data by outputting stdt to endt for all records, then keep one date for each subject/medgrp, reform date periods and create the variable med_pattern.
However, this method takes a long time to run, especially for big data (>3m records).
Any suggestions to make this more efficient would be greatly appreciated!
By subject you can use a date keyed multi-data hash to track the medgrp activity for each date in the date range defined by stdt and endt. A iteration of the hash will let you compute your medgrps crossings value.
data have; input
subject medgrp $ stdt: mmddyy8. endt: mmddyy8.; format stdt endt mmddyy10.;
datalines;
1 A 7/1/2014 7/31/2014
1 A 7/29/2014 8/30/2014
1 B 7/1/2014 8/15/2014
1 A 7/15/2014 7/15/2014
1 C 8/1/2014 9/1/2014
2 A 4/15/2014 5/15/2014
2 A 5/10/2014 6/10/2014
2 A 6/5/2014 6/15/2014
2 A 7/1/2014 8/1/2014
3 A 6/5/2014 6/15/2014
3 A 6/16/2014 8/1/2014
;
data crossings_by_date / view=crossings_by_date;
if 0 then set have; * prep PDV;
if _n_ then do;
declare hash dg(multidata:'yes', ordered:'a'); %* 1st hash for subject dates;
dg.defineKey('date');
dg.defineData('date', 'medgrp');
dg.defineDone();
call missing (date); format date adate cdate mmddyy10.;
declare hash crossing(ordered:'a'); %* 2nd hash for deduping a list of medgrps ;
crossing.defineKey('medgrp');
crossing.defineData('medgrp');
crossing.defineDone();
declare hiter dgi('dg');
declare hiter xi('crossing');
end;
dg.clear();
do _n_ = 1 by 1 until (last.subject); * process subjects one by one;
set have;
by subject;
do date = stdt to endt; * load multidata hash with medgrp over date range;
dg.add();
end;
end;
* examine each date in which subject had activity;
adate = .;
cdate = -1e9;
do _i_ = 1 by 1 while (dgi.next() = 0);
if date eq adate
then continue; * hiter over multi-data will return each node;
else adate = date; * track activity date;
* load hash to dedupe tracking of medgrp on date;
crossing.clear();
do _i_ = 1 by 1 while (dg.do_over() = 0);
crossing.replace();
end;
* compute crossing representation on date, A*B*... by traversing 2nd hash;
xi.first(); length cross $20;
cross = medgrp;
do while(0 = xi.next());
cross = catx('*',cross,medgrp);
end;
if date - cdate > 1 then cluster + 1; %* track cluster based on date continuities;
cdate = date;
output; * <------------ view OUTPUT;
end;
keep subject date cross cluster;
run;
* 2nd data step processes view (1st data step);
* determine when date continuity ends or medgrp changes;
data want;
length subject 8 medgrps $20;
format stdt endt mmddyy10.;
do _n_ = 1 by 1 until (last.medgrps);
set crossings_by_date (rename=cross=medgrps);
by cluster medgrps notsorted;
if stdt = . then
stdt = date;
end;
endt = date;
keep subject medgrps stdt endt;
run;
I attempt to identify events occurred in at latest four consecutive years. Assuming I have the following sample.
Rungroup Year
1 2003
1 2004
1 2005
1 2006
1 2008
1 2009
2 2003
2 2004
2 2005
2 2007
2 2008
2 2009
3 2003
3 2004
Based on following code, I want to remove the years that are not consecutive for at least four years. This method has two step. The first step is to give serial number to the consecutive years. The second step is based on look ahead method.
data have;
set have;
by rungroup;
lyear=lag(year);
if first.rungroup then lyear=.;
if year =1+ lyear then group1+1;
else group1=0;
run;
data have3;
set have2;
by rungroup;
set have2 ( firstobs = 2 keep = group1 rename = (group1 = next2) )
have2 ( obs = 1 );
next2 = ifn( last.rungroup, (.), next2 );
set have2 ( firstobs = 3 keep = group1 rename = (group1 = next3) )
have2 ( obs = 2 );
next3 = ifn( last.rungroup, (.), next3 );
set have2 ( firstobs = 4 keep = group1 rename = (group1 = next4) )
have2 ( obs = 3 );
next4 = ifn( last.rungroup, (.), next4);
if next4>=3 or next3>=3 or next2>=3 or group1>=3 then output;
run;
Is this an efficient want way to identify consecutive observations? Any comments would be greatly appreciated.
If your goal is to flag all the obs part of a consecutive sequence of at least 4 years within the same group, here is an approach
data have;
input Rungroup Year;
datalines;
1 2003
1 2004
1 2005
1 2006
1 2008
1 2009
2 2003
2 2004
2 2005
2 2007
2 2008
2 2009
3 2003
3 2004
;
data want(drop=y);
if _N_=1 then do;
declare hash h(dataset:'have');
h.definekey('Rungroup', 'Year');
h.definedone();
end;
set have;
array _{-3:3} _temporary_;
do y=-3 to 3;
_[y]=h.check(key:Rungroup, key:Year+y);
end;
if _[-3]=0 & _[-2]=0 & _[-1]=0
| _[-2]=0 & _[-1]=0 & _[ 1]=0
| _[-1]=0 & _[ 1]=0 & _[ 2]=0
| _[ 1]=0 & _[ 2]=0 & _[ 3]=0
then flag=1;
run;
I am stuck with a problem where I have two tables, one at the months and one at the weeks. Here's the format of the tables:
Table1
Customer Date1 Sales
1 Jan2018 1110
1 Feb2018 1245
1 Mar2018 1320
1 Apr2018 1100
...
Table2
Customer Date2
1 01Jan2018
1 08Jan2018
1 15Jan2018
1 22Jan2018
1 29Jan2018
1 05Feb2018
1 12Feb2018
1 19Feb2018
1 26Feb2018
1 05Mar2018
...
I want to create a new column for sales in Table2 that will hold the disaggregated values of sales from Table1. I want to divide the sales by the number of days in that month and then assign the values to the weeks accordingly. Thus the sales in week 01Jan2018 is (1110/31)*7. The weeks that are in transition will get values from both the months. For example 29Jan2018 has 3 days in Jan2018 and 4 days in Feb2018. The sales of one day in Jan2018 is 1110/31 and the sales of one day in Feb2018 is 1245/28.
So the sales in week 29Jan2018 will be 3*(1110/31) + 4*(1245/28)
I want to do this for each distinct customer.
The resulting table should be
Result Table
Customer Date Sales
1 01Jan2018 250.6 i.e (1110/31)*7
1 08Jan2018 250.6
1 15Jan2018 250.6
1 22Jan2018 250.6
1 29Jan2018 282.27
1 05Feb2018 311.25
1 12Feb2018 311.25
1 19Feb2018 311.25
1 26Feb2018 133.39 + 170.32
Thanks!
In DATA Step programming you will be needing some 'FORWARD' data instead of some 'LAG' data. A forward value can be emulated by creating a view to the same data starting one observation forward (obs=2). After understanding the renaming semantics, it is only a matter of some easy 'bookkeeping'.
data customer_months;
attrib Customer length=8 Date1 informat=monyy. format=monyy7.; input
Customer Date1 Sales; datalines;
1 Jan2018 1110
1 Feb2018 1245
1 Mar2018 1320
1 Apr2018 1100
run;
* week data, also with computation for month the week is in;
data customer_weeks;
attrib Customer length=8 Date2 informat=date9. format=date9.; input
Customer Date2;
Date1 = intnx('month', Date2, 0);
datalines;
1 01Jan2018
1 08Jan2018
1 15Jan2018
1 22Jan2018
1 29Jan2018
1 05Feb2018
1 12Feb2018
1 19Feb2018
1 26Feb2018
1 05Mar2018
run;
* next months sales keyed on prior month value;
data customer_next_months_view / view=customer_next_months_view;
set customer_months;
Date1 = intnx('month',Date1,-1); * the month this record will be a forward for;
rename Sales=Sales_next_month;
if _n_ > 1;
run;
* merge original and forward data, rename for making clear the variable roles;
data combined;
length disag_sales 8;
merge
customer_months (rename=Sales=Sales_this_month)
customer_next_months_view
customer_weeks
;
by Date1;
days_in_this_month = intck('day',intnx('month',Date1,0),intnx('month',Date1,1));
days_in_next_month = intck('day',intnx('month',Date1,1),intnx('month',Date1,2));
day_rate_this_month = Sales_this_month / days_in_this_month;
day_rate_next_month = Sales_next_month / days_in_next_month;
if Date2 then
if month(Date2) = month(Date2+6) then
week_days_this_month = 7;
else
week_days_this_month = intck('day', Date2, intnx('month', Date2, 1));
week_days_next_month = 7 - week_days_this_month;
dollars_this_week_this_month = week_days_this_month * day_rate_this_month;
dollars_this_week_next_month = week_days_next_month * day_rate_next_month;
* desired estimated disaggregated sales;
disag_sales = sum (dollars_this_week_this_month,dollars_this_week_next_month);
run;
Say that I have the two following one row datasets:
data have_1;
input message $ order_num time price qty;
datalines;
A 3 34199 10 500
run;
data have_2;
input message $ order_num time delete_qty ;
datalines;
B 2 34200 100
run;
I have another dataset that aggregates previous order_numbers.
data total;
input order_num time price qty;
datalines;
1 34197 11 550
2 34198 10.5 450
run;
My objective is that I need to update the dataset total with the dataset have_1 and have_2 in a loop. When I start with have_1, a message=A implies that I have to update the dataset total by simply adding a new order to the total dataset. I must keep track the changes in the total datasets Hence the dataset total should look like this:
order_num time price qty id;
1 34197 11 550 1
2 34198 10.5 450 1
3 34199 10 500 1
Then, the dataset total needs to be updated with the dataset have_2 where message=B implies that there is an update the qty to an order_num that is already in the the total datasets. I have to update the order_num=2 by removing some of the qty. Hence, the total dataset should look like this:
order_num time price qty id;
1 34197 11 550 2
2 34198 10.5 350 2
3 34199 10 500 2
I have more than 1000 have_ datasets which corresponds to each row in a another datasets.
What's important is that I need to keep track of the changes in total for every messages with an id. Assuming that I have only have_1 and have_2, then here's my tentative code:
%macro loop()
%do i=1 %to 2;
data total_temp;
set total; run;
data total_temp;
set have_&i;
if msg_type='A' then do;
set total have_&i;
drop message;
id=&i;
end;
if msg_type='B' then do;
merge total have_&i;
by order_num;
drop message;
qty=qty-delete_qty;
drop delete_qty;
id=&i
end;
run;
data total; set total_temp; run;
%end;
%mend;
%loop();
This code, say after the first loop, keeps only one line which corresponds to what's in have_1. Hence, can we use a merge and a set command in a then do? What's the proper code that I have to use?
The final datasets should look like this:
order_num time price qty id;
1 34197 11 550 1
2 34198 10.5 450 1
3 34199 10 500 1
1 34197 11 550 2
2 34198 10.5 350 2
3 34199 10 500 2
You don't need to do this in a macro. You CAN use a macro, but it will be slower. Try this:
data have_1;
input message $ order_num time price qty;
datalines;
A 3 34199 10 500
run;
data have_2(index=(order_num));
input message $ order_num time delete_qty ;
datalines;
B 2 34200 100
run;
data total(index=(order_num));
input order_num time price qty;
datalines;
1 34197 11 550
2 34198 10.5 450
run;
/*First, add new orders*/
proc append base=total data=have_1(where=(message="A")) force;
run;
/*Now update for the deletions*/
data total;
modify total have_2(where=(message="B"));
by order_num;
qty = sum(qty,-delete_qty);
drop message delete_qty;
run;
Append the new order to the total data set with PROC APPEND. This maintains the index and allows you to do the update through the MODIFY statement.
This could be done through two modify statements, though I find adding the new records through append to be clearer.