SAS: Coding a dummy variable for a value of a variable by group within group - group-by

I have a dataset of CASE_ID (x y and z), a set of multiple dates (including duplicate dates) for each CASE_ID, and a variable VAR. I would like to create a dummy variable DUMMYVAR by group within a group whereby if VAR="C" for CASE_ID x on some specific date, then DUMMYVAR=1 for all observations corresponding to CASE_ID x on with that date.
I believe that a Classic 2XDOW would be the key here but this is my third week using SAS and having difficulty getting this by two BY groups here.
I have referenced and attempted to write a variation of Haikuo's code here:
PROC SORT have;
by CASE_ID DATE;
RUN;
data want;
do until (last.DATE);
set HAVE;
by date notsorted;
if var='c' then DUMMYVAR=1;
do until (last.DATE);
set HAVE;
by DATE notsorted;
if DATE=1 then ????????
end;
run;

Change your BY statements to match the grouping you are doing. And in the second loop add a simple OUTPUT; statement. Then your new dataset will have all the rows in your original dataset and the new variable DUMMYVAR.
data want;
do until (last.DATE);
set HAVE;
by case_id date;
if var='c' then DUMMYVAR=1;
end;
do until (last.DATE);
set HAVE;
by case_id date;
output;
end;
run;
This will create the variable DUMMYVAR with values of either 1 or missing. If you want the values to be 1 or 0 then you could either set it to 0 before the first DO loop. Or add if first.date then dummyvar=0; statement before the existing IF statement.

Related

Complete empty cells of a table with specific value SAS

I'm working on a table in SAS with a column that contains dates (FENTREGA in the picture), i want to complete the empty cells of this column with "Empty", can you help me with the code?
This is the structure of my table, the column i need to use is FENTREGA.
Since your column is numeric you cannot put text in the same column. However, you can make the period appear as the text EMPTY if you use a custom format.
Or you can make the whole column text, but then you cannot do date operations/calculations on the column without converting it back.
proc format;
value empty_dates
. = 'Empty'
Other = [mmddyyd10.];
run;
proc sql;
....
t1.FENTREGA format=empty_dates.,
....
EDIT: Fully tested solution, works as expected
DATA have;
informat FENTREGA mmddyy10.;
format FENTREGA date9.;
input FENTREGA;
datalines;
12/10/2003
10/15/2006
07/20/2010
05/11/2006
10/01/2006
07/03/2012
05/08/2015
.
.
.
.
;
RUN;
proc format;
value empty_dates
. = 'Empty'
Other = [mmddyyd10.];
run;
proc sql;
select
FENTREGA format=empty_dates.
from have;
quit;

Calculate duration between dates by group using SAS

I am trying to calculate how long a child has been in foster care. However, I am having some issues. My data should look like something below:
For each individual (ID) I need to calculate the duration (end_date-start_date). However, I also need to apply a rule that states that if there are less than 5 days between the end date and the start date within the same type of foster care, it should be considered as one consecutively placement. If there are more than five days between the end date end the start date within the same type of foster care for the same individual, it is a new placement. If it is a new type of foster care, it is a new placement. The variable “duration” is how, it is supposed to be calculated.
I have tried the following code, but it doesn't work the proper way + I don't know how to apply my "five day"-rule.
Proc sort data=have out=want;
by id type descending start_date;
run;
Data want;
set want;
by id type;
retain last_date;
if first.id or first.type then do;
last_date=end_date;
end;
if last.id or last.type then duration=(end_date-start_date);
run;
Any help is much appreciated!
Using a bunch of retain statements here to achieve this:
data want;
set have;
by id ;
retain true_sd prev_ed prev_type;
if first.id then call missing(prev_type);
if type ~= prev_type then do;
true_sd = sd;
call missing(prev_ed);
call missing(prev_type);
end;
if sd - prev_ed > 5 then true_sd = sd;
duration = ed - true_sd;
output;
prev_type = type;
prev_ed = ed;
format sd ed true_sd prev_ed date.;
run;
(assuming type and id are numeric here. ed is end_date, sd is start_date)

proc ttest class, default group issue

I would like to compare mean values of two groups using proc ttest, and I successfully did it as below.
proc ttest;
class group;
var score;
run;
But, this code just assumes observations with group = 0 as the default group. So, the t-statistics is calculated based on Mean (score of obs with group= 0) minus Mean (score of obs with group= 1). But, I would like to have it the other way around.
It would just change the sign of t-statistics, but it's just what I wanted to do.
Is there an option to do so by simply adding an option?
I know I could have done it if I have made another dummy variable which is exactly the opposite of my group variable. But, I don't want to create more dummy variables.
ORDER=DATA will tell SAS to order the class variable based on when it encounters the values. So if the 1 values are earlier than the 0 values, it will be first in the comparison.
For example:
data for_ttest;
call streaminit(7);
do group = 0 to 1;
do _n_ = 1 to 50;
score = rand('NORMAL',1,0.5)+group;
output;
end;
end;
run;
proc sort data=for_ttest;
by descending group;
run;
proc ttest data=for_ttest order=data;
class group;
var score;
run;
Without ORDER=DATA, it behaves as you saw, but with it, 1 is the first group.
You could also combine ORDER=FORMATTED with a format.
proc format;
value groupf
1="Group 1 (Value=1)"
0="Group 2 (Value=0)"
;
quit;
proc ttest data=for_ttest order=formatted;
class group;
format group groupf.;
var score;
run;
The labels in the PROC FORMAT are irrelevant, other than that they must be alphabetically sorted. Unfortunately the PRELOADFMT option is not available in PROC TTEST, so you can't use the NOTSORTED trick in PROC FORMAT to allow this to work even with the original values (though you can use non-printing characters to mess with sort order if you really want to).

MONYY7. and DATE9. operations

I'm working on a very big data set, (more than 100 variables and 11 millions observations). In this data set, i have a variable named DTDSI (simulation date) in DATE9. format. (For example: 01APR2015 , 02MAR2015...). I have a macro-program to analyse this data set by comparing the observations in 2 different months:
%macro analysis (data_input , m , m_1);
.....
%mend;
The 2 macro-variables m and m_1 are months that i want to compare. Their format is MONYY7.(APR2015 , MAR2015...). Keep in mind that i cannot modify my data_input (its the data of my company). In the beginning of my macro program, i want to create a new data set with only the observations of the &m and &m_1 month. I can easily do that by creating a new date variable from DTDSI (real_month for ex) but in the format MONYY7. Then i just select the observations where real_month equal &m or real_month equal &m:
Data new;
Set &data_input;
mois_real = input(DTDSI,MONYY7);
RUN;
PROC SQL;
CREATE TABLE NEW AS;
SELECT *
WHERE mois_real in ("&m" , "&m_1")
FROM NEW;
....
The problem is that in my first Data Statement, i duplicated my data_input; which is bad because it took 30 minutes. So can you tell me how can i make my selection (DTDSI = m and DTDSI=m_1) right in my first Statement?
You can use formula's in your where/if condition, so apply your formula from step 1 into step 2 or vice versa.
Data new;
set &data_input;
WHERE put(DTDSI,MONYY7) in ("&m" , "&m_1");
run;

References to SAS date macros not working due to differing data types

hoping for some help on this one.
Currently, the query uses the below to create references for m1-m6 and d1-d6.
%let m1=1114;
%let d1 ='30NOV2014'd;
%let m2=1214;
%let d2='31DEC2014'd;
%let m3=0115;
%let d3='31JAN2015'd;
%let m4=0215;
%let d4='28FEB2015'd;
%let m5=0315;
%let d5='31MAR2015'd;
%let m6=0415;
%let d6='30APR2015'd;
Based on the rest of the code, the m1-m6 dates must be formatted as mmyy. I have tried to swap the above out with this:
data _datemacro_;
m1 = put(intnx('day','01NOV2014'd,0),mmyyn4.);
call symput('m1',"'"||put(m1,9.)) ;
d1 = put(intnx('day','30NOV2014'd,0),date9.);
call symput('d1',"'"||put(d1,9.)) ;
m2 = put(intnx('day',&d1,+1),mmyyn4.);
call symput('m2',"'"||put(m2,9.)) ;
d2 = put(intnx('month',&d1,+1,'e'),date9.);
call symput('d2',"'"||put(d2,9.)||"'"||"d") ;
...etc through m6 and d6
run;
Below is the rest of the code that yields a garden variety of errors, including
ERROR 22-322: Syntax error, expecting one of the following: a name, a
quoted string, (, /, ;, DATA, LAST, NULL.
ERROR 200-322: The symbol is not recognized and will be ignored.
proc sql;
create table perf as
select a. field, a. field, a. field, a. reportingdate,
a. field, a. field,
e. field,
f. field
from table a, table2 e, table3 f
where a. reportingdate between &d1. and &d6.
and (a. field=1 or a. field=1)
and a. field = e. field and a. field = f. field;
quit;
/*Creates performance file by month*/
%macro month (mon,date);
data m&mon. (rename=(field=active&mon. field=co&mon. field=es&mon. field=sr&mon.));
set perf;
where datepart(reportingdate)=&date.;
run;
proc sort data=m&mon.; by field descending co&mon.; run;
proc sort data=m&mon. nodupkey out=m2&mon.; by field; run;
%mend month;
%month (&m1.,&d1.);
%month (&m2.,&d2.);
%month (&m3.,&d3.);
%month (&m4.,&d4.);
%month (&m5.,&d5.);
%month (&m6.,&d6.);
I am able to get it to run accurately until the last 6 lines, where it comes up with 78 errors just from running those 6 lines.
Any suggestions on how to write the macro to keep the correct data type while accurately defining the month start and end dates? When I try and change the start date and end date of each month to the same format, something within the rest of the code causes an error stating that it cannot work with two variables of different formats, even when they are clearly the same format as defined in the code.
Please let me know if there is anything I can clarify, as this was a little harder to explain than I intended.
Thank you for your help.
So you're basically trying to build a macro that will run a report month-by-month. I think using a macro is a good idea, but your structure could benefit from a re-org.
The first thing to fix is the hardcoded dates. Hardcoding is bad 99% of the time. Why not use a loop instead?
Initialise the start and end dates at the top of your program. In future they're easy to find and change if they're at the top, and you won't need to search through your code trying to figure out what else needs to change:
* PICK ANY DATES IN THE MONTHS YOU WANT TO START AND END. IE. DOESNT MATTER IF YOU CHOOSE THE FIRST OR THE 20TH. IT WILL RUN FOR THAT MONTH;
%let rpt_start = %sysfunc(mdy(11,1,2014));
%let rpt_end = %sysfunc(mdy( 4,1,2015));
Go and get the data between the start and end dates:
proc sql;
create table perf as
select a. field, a. field, a. field, a. reportingdate,
a. field, a. field,
e. field,
f. field
from table a, table2 e, table3 f
where a. reportingdate between &rpt_start and &rpt_end
and (a. field=1 or a. field=1)
and a. field = e. field and a. field = f. field;
quit;
Now loop over each month inbetween the start and end dates. Create the desired datasets as we go.
%macro create_monthly_datasets;
%local tmp_dt tmp_end rpt_dt;
%let tmp_end = %sysfunc(intnx(month,&rpt_end,0,end)); *CALC END-DATE DESIRED;
%let tmp_dt = %sysfunc(intnx(month,&rpt_start,0,beginning)); *INITIALISE LOOP;
%do %while (&tmp_dt le &tmp_end);
* CALC ACUTAL DATE WANTED AND STORE IT IN RPT_DT;
* CALC THE MMYY VAL YOU NEED;
* PRINT OUT BOTH VALUES TO MAKE SURE THEYRE CORRECT;
%let rpt_dt = %sysfunc(intnx(month,&tmp_dt,0,end));
%let mmyy = %sysfunc(month(&rpt_dt),z2.)%substr(%sysfunc(year(&rpt_dt)),3,2);
%put %sysfunc(sum(&rpt_dt),date9.) &mmyy;
* DO THE WORK;
data m&mmyy (rename=(field=active&mmyy field=co&mmyy field=es&mmyy field=sr&mmyy));
set perf;
where datepart(reportingdate)=&rpt_dt;
run;
proc sort data=m&mmyy; by field descending co&mmyy; run;
proc sort data=m&mmyy nodupkey out=m2&mmyy; by field; run;
%let tmp_dt = %sysfunc(intnx(month,&tmp_dt,1,beginning)); * ITERATE LOOP;
%end;
%mend;
%create_monthly_datasets;
Try the following for creating the macro variables:
data _null_;
START_MTH = '01nov2014'd;
do i = 1 to 6;
T_DATE = intnx('month',START_MTH,i)-1; /*Shift forwards i months then back 1 day*/
call symput(cats('m',i),put(T_DATE,mmyyn4.));
call symput(cats('d',i),cats("'",put(T_DATE,date9.),"'d"));
end;
run;