enrollment date 12 months before and after a specific date - date

enter image description hereI am trying to find persons by id who have continuous, 12 months enrollment before the hospitalization date and another 12 months after the hospitalization date. Each member will have one row.
This is using claim database in US. Any help is appreciated.
Example of the dataset:
ID Enr_date End_Date hosp_date
1 1/5/2004 1/6/2008 2/2/2006
2 .... and so on
3
4
id start_e end_e date_h
1 1/1/2005 1/1/2006 2/8/2008
1 2/3/2006 4/5/2013
2 5/7/2005 8/8/2006 4/5/2007
2 1/1/2007 2/2/2012
3 5/9/2005 5/9/2007 1/1/2007
3 6/4/2008 7/7/2012

assuming my last comments have answers there are many ways you can do this. Starting out it may be difficult to get outer joins, cross joins etc working in a way that's easy to understand. With a SAS macro we can break the problem down so it's easy to understand and do any debugging that may be necessary. Here's one approach that may work for you:
%macro hdates;
/* get number of hosp_dates */
proc sql noprint;
select count(*) into: cnt
from date where hosp_date ne .;
quit;
%let cnt = &cnt;
/* place hdates and ids into macro vars */
proc sql noprint;
select enrolid, hosp_date into: id_1 - :id_&cnt, : hdate_1 - :hdate_&cnt
from date;
quit;
proc delete data= hcov; run;
/* for each hdate id pair go through the dataset and test for 12 mo coverage
%do i = 1 %to &cnt;
data new;
set date;
if (enrolid = &&id_&i) then do;
preDays = "&&hdate_&i"d - start_date ;
postDays = end_date - "&&hdate_&i"d;
if (preDays >= 365 and postDays >= 365) then output;
end;
run;
proc append base = hcov data=new;run;
%end;
%mend hdates;
%hdates;

I work in claims data and I think I understand what you are trying to ask. I recommend making one table with the "condensed" enrollment ranges and another with the hospitalization dates. Then you may merge them together and keep only those patients who meet your criteria. The following code will condense the enrollment ranges (assuming good records):
PROC SORT DATA=dset_in; BY id enr_date end_date; RUN;
DATA enrollment (KEEP=id enroll_start enroll_stop);
SET dset_in;
FORMAT enroll_start enroll_stop DATE9.;
BY id enr_date end_date;
RETAIN enroll_start enroll_stop;
IF first.id THEN DO;
enroll_start=enr_date;
enroll_stop=end_date;
END;
ELSE IF enr_date-enroll_stop <= 1 THEN enroll_stop=end_date;
ELSE DO;
OUTPUT;
enroll_start=enr_date;
enroll_stop=end_date;
END;
IF last.id THEN OUTPUT;
RUN;
Then this code will keep only those patients with a hospitalization and 365 days enrollment before and after. If the hosp_claims dataset has more than 1 hospitalization per patient, sort then take the first obs per id after this step:
PROC SQL;
CREATE TABLE hosp_enrolled AS
SELECT DISTINCT a.id, a.hosp_dt, b.enroll_start, b.enroll_stop
FROM hosp_claims AS a, enrollment AS b
WHERE a.id=b.id AND b.enroll_start+365 <= a.hosp_dt <= enroll_stop-365;
QUIT;

Related

SAS RANDOM SAMPLING, WITH GROUP SIZE

I have a data set with accounts and its attribute (there are 9 main attributes).
Each attribute group contains a different amount of account, where 2 groups hold a massively higher amount of accounts. Therefore, when I use PROC SURVEYSELECT, to randomly select 5 accounts per STRATA, and the METHOD=SRS, I get more results from attributes which contain more accounts.
How can I correct that? how can I make SAS consider the group's volume when sampling?
The above mentioned code:
PROC SURVEYSELECT DATA=FINAL_RANDOM OUT=FINAL_RANDOM_1 NOPRINT
METHOD=srs
SAMPSIZE = 5
SELECTALL;
STRATA Account_Branch_Id ;
RUN;
I'm not quite getting your point, if you use Sample Size as number of rows, you'll get exact number of rows by your strata Random Sample node in SAS Guide
And the same goes for your code,
PROC SURVEYSELECT DATA=*YOURDATASET* OUT=DATA
METHOD=srs
SAMPSIZE = 5
SELECTALL;
STRATA *YOUR_STRATA_VARIABLE*;
RUN;
Difference between N number of observations (in 'Random Sample' node in SAS Guide) and SAMPSIZE (in the code) is that N gives you total number of observations, while SAMPSIZE gives you number per strata variable.
Here is some example generated data with two strata (branches) each containing 35% of all the accounts and the other 7 strata containing the remaining 30% of account equally distributed.
The code in the question does select 5 samples from each strata (branch_id), and randomly over attribute (risk).
I am not so good at SurverySelect to say you can't select 'evenly' over risk within strata -- but if able to do so, you will need to specify a METHOD= other than SRS
Example
data have;
call streaminit (123);
do account_id = 1 to 100000;
x = rand('uniform');
select;
when (x>.65) branch_id = 1;
when (x>.30) branch_id = 2;
otherwise branch_id = 3 + floor(rand('uniform',7));
end;
open_date = round(rand('uniform', today())); format open_date yymmdd10.;
balance = round(rand('uniform',1e6)); format balance dollar9.;
if branch_id = 7 then do;
open_date = round(rand('uniform', 4000));
balance = 5e5 + round(rand('uniform',5e5)); format balance dollar9.;
end;
select (branch_id);
when (1) risk = ceil(rand('normal',5,1));
when (5) risk = ceil(rand('uniform', 2)) * 2;
when (6) risk = 3 + ceil(rand('uniform', 4));
when (9) risk = 4 + ceil(rand('uniform', 5));
otherwise risk = ceil(rand('uniform', 9));
end;
output;
end;
drop x;
run;
proc sort data=have;
by branch_id account_id;
run;
proc tabulate data=have;
title "Original risk distribution by branch";
class branch_id risk;
table branch_id='', risk * n='' * format=comma9. * [s=[width=1cm]] / box='branch_id';
run;
* survery select;
PROC SURVEYSELECT NOPRINT DATA=have OUT=want
METHOD=srs
SAMPSIZE = 5
SELECTALL
;
STRATA Branch_Id ;
RUN;
proc sql;
create table sample_classes_with_zeros as
select class.branch_id, class.risk,
case when sample.risk then 1 else 0 end as z
from (
select distinct branch_id, risk from have
) as class
left join sample
on class.branch_id = sample.branch_id
& class.risk = sample.risk
;
proc tabulate data=sample_classes_with_zeros;
title "A SurveySelect SRS SAMPSIZE=5 sampling, risk distribution by branch";
class branch_id risk;
var z;
table branch_id='', risk * z='' * sum='' * f=comma9. * [s=[width=1cm textalign=center]] / box='branch_id';
run;

Defining Fixed SAS Macro Variables

I am trying to have a macro run but I'm not sure if it will resolve since I don't have connection to my database for a little while. I want to know if the macro is written correctly and will resolve the states on each pass through the code (ie do it repetitively and create a table for each state).
The second thing I would like to know is if I can run a macro through a from statement. For example let entpr be the database that I'm pulling from. Would the following resolve correctly:
proc sql;
select * from entpr.&state.; /*Do I need the . after &state?*/
The rest of my code:
libname mdt "........."
%let state = ny il ar ak mi;
proc sql;
create table mdt.&state._members
as select
corp_ent_cd
,mkt_sgmt_admnstn_cd
,fincl_arngmt_cd
,aca_ind
,prod_type
,cvyr
,cvmo
,sum(1) as mbr_cnt
from mbrship1_&state.
group by 1,2,3,4,5,6,7;
quit;
If &state contains ny il ar ak mi then as it is written, the from statement in your code will resolve to: from mbrship1_ny il ar ak mi - which is invalid SQL syntax.
My guess is that you're wanting to run the SQL statement for each of the following tables:
mbrship1_ny
mbrship1_il
mbrship1_ar
mbrship1_ak
mbrship1_mi
In which case the simplest macro would look something like this:
%macro do_sql(state=);
proc sql;
create table mdt.&state._members
as select
...
from mbrship1_&state
group by 1,2,3,4,5,6,7;
quit;
%mend;
%do_sql(state=ny);
%do_sql(state=il);
%do_sql(state=ar);
%do_sql(state=ak);
%do_sql(state=mi);
As to your question regarding whether or not to include the . the rule is that if the character following your macro variable is not a-Z, 0-9, or the underscore, then the period is optional. Those characters are the list of valid characters for a macro variable name, so as long as it's not one of those you don't need it as SAS will be able to identify where the name of the macro finishes. Some people always include it, personally I leave it out unless it's required.
When selecting data from multiple tables, whose names themselves contain some data (in your case the state) you can stack the data with:
UNION ALL in SQL
SET in Data step
As long as you are stacking data, you should also add a new column to the query selection that tracks the state.
Consider this pattern for stacking in SQL
data one;
do index = 1 to 10; do _n_ = 1 to 2; output; end; end;
run;
data two;
do index = 101 to 110; do _n_ = 1 to 2; output; end; end;
run;
proc sql;
create table want as
select
source, index
from
(select 'one' as source, * from one)
union all
(select 'two' as source, * from two)
;
The pattern can be abstracted into a template for SQL source code that will be generated by macro.
%macro my_ultimate_selector (out=, inlib=, prefix= states=);
%local index n state;
%let n = %sysfunc(countw(&states));
proc sql;
create table &out as
select
state
, corp_ent_cd
, mkt_sgmt_admnstn_cd
, fincl_arngmt_cd
, aca_ind
, prod_type
, cvyr
, cvmo
, count(*) as state_7dim_level_cnt
from
%* ----- use the UNION ALL pattern for stacking data -----;
%do index = 1 %to &n;
%let state = %scan(&states, &index);
%if &index > 1 %then %str(UNION ALL);
(select "&state" as state, * from &inlib..&prefix.&state.)
%end;
group by 1,2,3,4,5,6,7,8 %* this seems to be to much grouping ?;
;
quit;
%mend;
%my_ultimate_selector (out=work.want, inlib=mdt, prefix=mbrship1_, states=ny il ar ak mi)
If the columns of the inlib tables are not identical with regard to column order and type, use a UNION ALL CORRESPONDING to have the SQL procedure line up the columns for you.

Macro increment

I have table lookup values as below
sno date
1 200101
2 200102
3 200103
4 200104
I wrote below macro
%let date=200102
proc sql;
select sno into :no from lookup where date=&date.;
quit;
I need a help on how to convert the entire table lookup into macro increment by creating first s.no and date as two macro variable then increment. So that i don’t need to update dates in my table lookup every time. So if i look up for date 201304 i need to get its corresponding s.no
Is there pattern to the SNO values? Are you basically numbering the months since 01JAN2001? If so then use INTCK() function.
data test;
input date yymmdd8. ;
format date yymmdd10. ;
sno = 1+intck('month','01JAN2001'd,date);
cards;
20010112
20010213
20010314
20010415
;
So you could create two macro variables. One with the base date and the other with the base SNO value.
36 %let basedate='01JAN2001'd ;
37 %let basesno=1;
38 %let date='01JAN2001'd ;
39 %let sno=%eval(&basesno + %sysfunc(intck(month,&basedate,&date)));
40 %put &=date &=sno;
DATE='01JAN2001'd SNO=1
41
42 %let date="%sysfunc(today(),date9)"d;
43 %let sno=%eval(&basesno + %sysfunc(intck(month,&basedate,&date)));
44 %put &=date &=sno;
DATE="16NOV2017"d SNO=203
If you want to simply translate one (unique) value into another. You can use (in)formats. They can do much more than just changing how data are read/displayed. They are easy to use, fast (in-memory) and don't depend on the table once created. Change the library to a permanent one if work (=> temporary library) doesn't suit your needs.
options fmtsearch=(formats,work);
data fmt(keep = fmtname type start end label hlo default);
length fmtname $10 type $1 start end $6 label 8 hlo $1 default 8;
fmtname = 'date_to_no';
type = 'I';
label=0;
do y = 2001 to 2099;
do m = 1 to 12;
start = put(y,4.) || put(m,z2.);
end = start;
label + 1;
default=50; /*default length of the string compared when informat is used. Should be higher than both start and end*/
output;
end;
end;
/*if you want to assign a value (=label) to inputs not found. In this case it's -2*/
hlo="O";
start = "";
end = start;
label= -2;
output;
run;
proc format library=work cntlin=fmt;
run;
data test;
no = input('200101',date_to_no.); output;
no = input('201710',date_to_no.); output;
no = input('201713',date_to_no.); output;
run;
Build a lookup table dynamically and create a macro variable for each row in the table. The macro variables will be named date_200101,date_200102,...and so on. They will contain a value equal to the corresponding sno value:
data lookup;
length var_name $20;
do sno = 1 to intck('month','01jan2001'd,date())+1;
date = input(put(intnx('month','01jan2001'd, sno-1, 'beginning'),yymmn6.),best.);
var_name = cats('date_',date);
call symput(var_name, cats(sno));
output;
end;
run;
You can then refer to the macro variables like so:
%let date =200103;
%put &&date_&date;
...or...
%put &date_200101;
The first usage example is using double macro resolution. Basically the macro processes needs to perform 2 iterations of the macro token &&date_&date in order to fully resolve it. On the first pass, it gets resolved to &date_200101. On the second pass, the macro token &date_200101 gets resolved to 1.

In SAS, how do you collapse multiple rows into one row based on some ID variable?

The data I am working with is currently in the form of:
ID Sex Race Drug Dose FillDate
1 M White ziprosidone 100mg 10/01/98
1 M White ziprosidone 100mg 10/15/98
1 M White ziprosidone 100mg 10/29/98
1 M White ambien 20mg 01/07/99
1 M White ambien 20mg 01/14/99
2 F Asian telaprevir 500mg 03/08/92
2 F Asian telaprevir 500mg 03/20/92
2 F Asian telaprevir 500mg 04/01/92
And I would like to write SQL code to get the data in the form of:
ID Sex Race Drug1 DrugDose1 FillDate1_1 FillDate1_2 FillDate1_3 Drug2 DrugDose2 FillDate2_1 FillDate2_2 FillDate2_3
1 M White ziprosidone 100mg 10/01/98 10/15/98 10/29/98 ambien 20mg 01/07/99 01/14/99 null
2 F Asian telaprevir 500mg 03/08/92 03/20/92 04/01/92 null null null null null
I need just one row for each unique ID with all of the unique drug/dose/fill info in columns, not rows. I suppose it can be done using PROC TRANSPOSE, but I am not sure of the most efficient way of doing the multiple transposes. I should note that I have over 50,000 unique IDs, each with varying amounts of drugs, doses, and corresponding fill dates. I would like to return null/empty values for those columns that do not have data to fill in. Thanks in advance.
To some extent, the desired efficiency of this determines the best solution.
For example, assuming you know the maximum reasonable number of fill dates, you could use the following to very quickly get a transposed table - likely the fastest way to do that - but at the cost of needing a large amount of post-processing, as it will output a lot of data you don't really want.
proc summary data=have nway;
class id sex race;
output out=want (drop=_:)
idgroup(out[5] (drug dose filldate)=) / autoname;
run;
On the other side of things, the vertical-and-transpose is the "best" solution in terms of not requiring additional steps; though it might be slow.
data have_t;
set have;
by id sex race drug dose notsorted;
length varname value $64; *some reasonable maximum, particularly for the drug name;
if first.ID then do;
drugcounter=0;
end;
if first.dose then do;
drugcounter+1;
fillcounter=0;
varname = cats('Drug',drugcounter);
value = drug;
output;
varname = cats('DrugDose',drugcounter);
value = dose;
output;
end;
call missing(value);
fillcounter+1;
varname=cats('Filldate',drugcounter,'_',fillcounter);
value_n = filldate;
output;
run;
proc transpose data=have_t(where=(not missing(value))) out=want_c;
by id sex race ;
id varname;
var value;
run;
proc transpose data=have_t(where=(not missing(value_n))) out=want_n;
by id sex race ;
id varname;
var value_n;
run;
data want;
merge want_c want_n;
by id sex race;
run;
It's not crazy slow, really, and odds are it's fine for your 50k IDs (though you don't say how many drugs). 1 or 2 GB of data will work fine here, especially if you don't need to sort them.
Finally, there are some other solutions that are in between. You could do the transpose entirely using arrays in the data step, for one, which might be the best compromise; you have to determine in advance the maximum bounds for the arrays, but that's not the end of the world.
It all depends on your data, though, which is really the best. I would probably try the data step/transpose first: that's the most straightforward, and the one most other programmers will have seen before, so it's most likely the best solution unless it's prohibitively slow.
Consider the following query using two derived tables (inner and outer) that establishes an ordinal row count by the FillDate order. Then, using the row count, if/then or case/when logic is used for iterated columns. Outer query takes the max values grouped by id, sex, race.
The only caveat is knowing ahead how many expected or max number of rows per ID (i.e., another query our table browse). Hence, fill in ellipsis (...) as needed. Do note, missings will generate for columns that do not apply to a particular ID. And of course please adjust to actual dataset name.
proc sql;
CREATE TABLE DrugTableFlat AS (
SELECT id, sex, race,
Max(Drug_1) As Drug1, Max(Drug_2) As Drug2, Max(Drug_3) As Drug3, ...
Max(Dose_1) As Dose1, Max(Dose_2) As Dose2, Max(Dose_3) As Dose3, ...
Max(FillDate_1) As FillDate1, Max(FillDate_2) As FillDate2,
Max(FillDate_3) As FillDate3 ...
FROM
(SELECT id, sex, race,
CASE WHEN RowCount=1 THEN Drug END AS Drug_1,
CASE WHEN RowCount=2 THEN Drug END AS Drug_2,
CASE WHEN RowCount=3 THEN Drug END AS Drug_3,
...
CASE WHEN RowCount=1 THEN Dose END AS Dose_1,
CASE WHEN RowCount=2 THEN Dose END AS Dose_2,
CASE WHEN RowCount=3 THEN Dose END AS Dose_3,
...
CASE WHEN RowCount=1 THEN FillDate END AS FillDate_1,
CASE WHEN RowCount=2 THEN FillDate END AS FillDate_2,
CASE WHEN RowCount=3 THEN FillDate END AS FillDate_3,
...
FROM
(SELECT t1.id, t1.sex, t1.race, t1.drug, t1.dose, t1.filldate,
(SELECT Count(*) FROM DrugTable t2
WHERE t1.filldate >= t2.filldate AND t1.id = t2.id) As RowCount
FROM DrugTable t1) AS dT1
) As dT2
GROUP BY id, sex, race);
Here's my attempt at an array-based solution:
/* Import data */
data have;
input #2 ID #9 Sex $1. #18 Race $5. #31 Drug $11. #44 Dose $5. #58 FillDate mmddyy8.;
format filldate yymmdd10.;
cards;
1 M White ziprosidone 100mg 10/01/98
1 M White ziprosidone 100mg 10/15/98
1 M White ziprosidone 100mg 10/29/98
1 M White ambien 20mg 01/07/99
1 M White ambien 20mg 01/14/99
2 F Asian telaprevir 500mg 03/08/92
2 F Asian telaprevir 500mg 03/20/92
2 F Asian telaprevir 500mg 04/01/92
;
run;
/* Calculate array bounds - SQL version */
proc sql _method noprint;
select DATES into :MAX_DATES_PER_DRUG trimmed from
(select count(ID) as DATES from have group by ID, drug, dose)
having DATES = max(DATES);
select max(DRUGS) into :MAX_DRUGS_PER_ID trimmed from
(select count(DRUG) as DRUGS from
(select distinct DRUG, ID from have)
group by ID
)
;
quit;
/* Calculate array bounds - data step version */
data _null_;
set have(keep = id drug) end = eof;
by notsorted id drug;
retain max_drugs_per_id max_dates_per_drug;
if first.id then drug_count = 0;
if first.drug then do;
drug_count + 1;
date_count = 0;
end;
date_count + 1;
if last.id then max_drugs_per_id = max(max_drugs_per_id, drug_count);
if last.drug then max_dates_per_drug = max(max_dates_per_drug, date_count);
if eof then do;
call symput("max_drugs_per_id" ,cats(max_drugs_per_id));
call symput("max_dates_per_drug",cats(max_dates_per_drug));
end;
run;
/* Check macro vars */
%put MAX_DATES_PER_DRUG = "&MAX_DATES_PER_DRUG";
%put MAX_DRUGS_PER_ID = "&MAX_DRUGS_PER_ID";
/* Transpose */
data want;
if 0 then set have;
array filldates[&MAX_DRUGS_PER_ID,&MAX_DATES_PER_DRUG]
%macro arraydef;
%local i;
%do i = 1 %to &MAX_DRUGS_PER_ID;
filldates&i._1-filldates&i._&MAX_DATES_PER_DRUG
%end;
%mend arraydef;
%arraydef;
array drugs[&MAX_DRUGS_PER_ID] $11;
array doses[&MAX_DRUGS_PER_ID] $5;
drug_count = 0;
do until(last.id);
set have;
by ID drug dose notsorted;
if first.drug then do;
date_count = 0;
drug_count + 1;
drugs[drug_count] = drug;
doses[drug_count] = dose;
end;
date_count + 1;
filldates[drug_count,date_count] = filldate;
end;
drop drug dose filldate drug_count date_count;
format filldates: yymmdd10.;
run;
The data step code for calculating the array bounds is probably more efficient than the SQL version, but it's also bit more verbose. On the other hand, with the SQL version you also have to trim whitespace from the macro vars. Fixed - thanks Tom!
The transposing data step is probably also at the more efficient end of the scale compared to the proc transpose / proc sql options in the other answers, as it makes only 1 further pass through the dataset, but again it's also fairly complex.

How to calculate conditional cumulative sum

I have a dataset like the one below, and I am trying to take a running total of events 2 and 3, with a slight twist. I only want to count these events when the Event_1_dt is less than the date in the current record. I'm currently using a macro %do loop to iterate through each record for that item type. While this produces the desired results, performance is slower than desirable. Each Item_Type may have up to 1250 records, and there are a couple thousand types. Is it possible to exit the loop before it cycles through all 1250 iterations? I am hesitant to try joins because there are some 30+ events to count up, but I'm open to suggestions. An additional complication is that even though Event_1_dt is always greater then Date, is does not have any other limitations.
Item_Type Date Event_1_dt Event_2_flg Event_3Flg Desired_Event_2_Cnt Desired_Event_3_Cnt
A 1/1/2014 1/2/2014 1 1 0 0
A 1/2/2014 1/2/2014 0 1 0 0
A 1/3/2014 1/8/2014 1 0 1 2
B 1/1/2014 1/2/2014 1 0 0 0
B 1/2/2014 1/5/2014 1 0 0 0
B 1/3/2014 1/4/2014 1 1 1 0
B 1/4/2014 1/5/2014 0 1 1 0
B 1/5/2014 . 1 1 2 1
B 1/6/2014 1/7/2014 1 1 3 2
Corresponding Code:
%macro History;
data y;
set x;
Event_1_Cnt = 0;
Event_2_Cnt = 0;
%do i = 1 %to 1250;
lag_Item_Type = lag&i(Item_Type);
lag_Event_2_flg = lag&i(Event_2_flg);
lag_Event_3_flg = lag&i(Event_3_flg);
lag_Event_1_dt = lag&i(Event_1_dt);
if Item_Type = lag_Item_Type and lag_Event_1_dt > . and lag_Event_1_dt < Date then do;
if lag_Event_2_flg = 1 then do;
Event_2_Cnt = Event_2_cnt + 1;
end;
if lag_Event_3_flg = 1 then do;
Event_3_Cnt = Event_3_cnt + 1;
end;
end;
%end;
run;
%mend;
%History;
Well, that's not a trivial task for SAS, but still it can be solved in one DATA-step, without merging. You can use hash objects. The idea is as follows.
Within each item type, going record by record, we 'collect' event flags into 'bins' in a hash object, where each bin is a certain date. All bins are ordered by date in ascending order. Simultaneously, we insert the Date of the current record into the same hash (into corresponding place by date) and then iterate 'up' from this place, summing up all gathered by this moment bins (which will have dates < then date of the current record, since we going up).
Here's the code:
data have;
informat Item_Type $1. Date Event_1_dt mmddyy9. Event_2_flg Event_3_flg 8.;
infile datalines dsd dlm=',';
format Date Event_1_dt date9.;
input Item_Type Date Event_1_dt Event_2_flg Event_3_flg;
datalines;
A,1/1/2014,1/2/2014,1,1
A,1/2/2014,1/2/2014,0,1
A,1/3/2014,1/8/2014,1,0
B,1/1/2014,1/2/2014,1,0
B,1/2/2014,1/5/2014,1,0
B,1/3/2014,1/4/2014,1,1
B,1/4/2014,1/5/2014,0,1
B,1/5/2014,,1,1
B,1/6/2014,1/7/2014,1,1
;
run;
proc sort data=have; by Item_Type; run;
data want;
set have;
by Item_Type;
if _N_=1 then do;
declare hash h(ordered:'a');
h.defineKey('Event_date','type');
h.defineData('event2_cnt','event3_cnt');
h.defineDone();
declare hiter hi('h');
end;
/*for each new Item_type we clear the hash completely*/
if FIRST.Item_Type then h.clear();
/*now if date of Event 1 exists we put it into corresponding */
/* (by date) place of our ordered hash. If such date is already*/
/*in the hash, we increase number of events for this date */
/*adding values of Event2 and Event3 flags. If no - just assign*/
/*current values of these flags.*/
if not missing(Event_1_dt) then do;
Event_date=Event_1_dt;type=1;
rc=h.find();
event2_cnt=coalesce(event2_cnt,0)+Event_2_flg;
event3_cnt=coalesce(event3_cnt,0)+Event_3_flg;
h.replace();
end;
/*now we insert Date of the record into the same oredered hash,*/
/*making type=0 to differ this item from items where date means*/
/*date of Event1 (not date of record)*/
Event_date=Date;
event2_cnt=0; event3_cnt=0; type=0;
h.replace();
Desired_Event_2_Cnt=0;
Desired_Event_3_Cnt=0;
/*now we iterate 'up' from just inserted item, i.e. looping */
/*through all items that have date < the date of the record. */
/*Items with date = the date of the record will be 'below' since*/
/*they have type=1 and our hash is ordered by dates first, and */
/*types afterwards (1's will be below 0's)*/
hi.setcur(key:Date,key:0);
rc=hi.prev();
do while(rc=0);
Desired_Event_2_Cnt+event2_cnt;
Desired_Event_3_Cnt+event3_cnt;
rc=hi.prev();
end;
drop Event_date type rc event2_cnt event3_cnt;
run;
I can't test it with your real number of rows, but I believe it should be pretty fast, since we loop only through a small hash object, which is entirely in memory, and we do only as many loops for each record as necessary (only earlier events) and don't do any IF-checks.
I dont think a Hash is neccessary for this - it seems like a simple data-step will do the trick. This might prevent you (or the next programmer who comes across your code) from needing to 're-read and do research' in order to understand it.
I think the following will work:
data have;
informat Item_Type $1. Date Event_1_dt mmddyy9. Event_2_flg Event_3_flg 8.;
infile datalines dsd dlm=',';
format Date Event_1_dt date9.;
input Item_Type Date Event_1_dt Event_2_flg Event_3_flg;
datalines;
A,1/1/2014,1/2/2014,1,1
A,1/2/2014,1/2/2014,0,1
A,1/3/2014,1/8/2014,1,0
B,1/1/2014,1/2/2014,1,0
B,1/2/2014,1/5/2014,1,0
B,1/3/2014,1/4/2014,1,1
B,1/4/2014,1/5/2014,0,1
B,1/5/2014,,1,1
B,1/6/2014,1/7/2014,1,1
;
data want2 (drop=_: );
set have;
by ITEM_Type;
length _Alldts_event2 _Alldts_event3 $20000;
retain _Alldts_event2 _Alldts_event3;
*Clear _ALLDTS for each ITEM_TYPE;
if first.ITEM_type then Do;
_Alldts_event2 = "";
_Alldts_event3 = "";
END;
*If event is flagged, concatenate the Event_1_dt to the ALLDTS variable;
if event_2_flg = 1 Then _Alldts_event2 = catx(" ", _Alldts_event2,Event_1_dt);
if event_3_flg = 1 Then _Alldts_event3 = catx(" ", _Alldts_event3,Event_1_dt);
_numWords2 = COUNTW(_Alldts_event2);
_numWords3 = COUNTW(_Alldts_event3);
*Loop through alldates, count the number that are < the current records date;
cnt2=0;
do _i = 1 to _NumWords2;
_tempDate = input(scan(_Alldts_event2,_i),Best12.);
if _tempDate < date Then cnt2=cnt2+1;
end;
cnt3=0;
do _i = 1 to _NumWords3;
_tempDate = input(scan(_Alldts_event3,_i),Best12.);
if _tempDate < date Then cnt3=cnt3+1;
end;
run;
I believe the Hash may be faster, but you'll have to decide on what tradeoff of comprehensibility/performance is appropriate.