Create a group id when grouping by several variables - group-by

I would like to create an id variable to identify the unique groups according to several variables.
For example, I have the data cars from sashelp.cars, and I would like to identify unique groups of Make - DriveTrain and Cylinder with the id variable grp_id. So the same Make and same DriveTrain with a different number of Cylinder would be considered as a new group (and hence, a new value in the id variable grp_id).
I tried the following, but this would reset the id variable to 1 when there is a new case, nor really considering every unique combination of Make + DriveTrain + Cylinder as a different group id.
data cars; set sashelp.cars; run;
proc sort data=cars; by Make DriveTrain Cylinders; run;
data cars; set cars;
grp_id + 1;
by Make DriveTrain Cylinders
if first.Make or first.DriverTrain or first.Cylinders then grp_id = 1;
run;
Any idea on how to create this grp_id variable using several variables as the criteria?

You want each combination to have a unique group id. Don't reset the group id. If you were also assigning a seq number within the group, you would reset the sequence variable.
When to reset. For combinations, increment the group id when the last listed by variable is tagged as having state first.
Example:
proc sort data=sashelp.cars out=cars;
by Make DriveTrain Cylinders;
run;
data cars;
set cars;
by Make DriveTrain Cylinders;
if first.Cylinders then grp_id + 1;
* this answer gives you bonus information ! ;
if first.Cylinders
then seq_in_group = 1;
else seq_in_group + 1;
run;
Note: Conceptually BY defines a hierarchy of n variables. When a variable in the hierarchy changes value, during a serial pass through the data, its state is set first.varm=1. Additionally all the subordinate levels first. automatic variables will have the same state. In other words, this assertion is true: first.varm+1=1 ... first.varn=1.

As an alternative, here is a hashing approach that does not require sorting.
data cars;
if _N_ = 1 then do;
declare hash h ();
h.definekey ('Make', 'DriveTrain', 'Cylinders');
h.definedata ('grp_id');
h.definedone();
end;
set sashelp.cars;
if h.find () ne 0 then grp_id + 1;
h.replace();
run;

Using your own code, you only have to make a small change
data cars;
set sashelp.cars;
run;
proc sort data=cars;
by Make DriveTrain Cylinders;
run;
data cars;
set cars;
by Make DriveTrain Cylinders;
if first.Make or first.DriveTrain or first.Cylinders then grp_id + 1;
run;

Related

Calculate duration between dates by group using SAS

I am trying to calculate how long a child has been in foster care. However, I am having some issues. My data should look like something below:
For each individual (ID) I need to calculate the duration (end_date-start_date). However, I also need to apply a rule that states that if there are less than 5 days between the end date and the start date within the same type of foster care, it should be considered as one consecutively placement. If there are more than five days between the end date end the start date within the same type of foster care for the same individual, it is a new placement. If it is a new type of foster care, it is a new placement. The variable “duration” is how, it is supposed to be calculated.
I have tried the following code, but it doesn't work the proper way + I don't know how to apply my "five day"-rule.
Proc sort data=have out=want;
by id type descending start_date;
run;
Data want;
set want;
by id type;
retain last_date;
if first.id or first.type then do;
last_date=end_date;
end;
if last.id or last.type then duration=(end_date-start_date);
run;
Any help is much appreciated!
Using a bunch of retain statements here to achieve this:
data want;
set have;
by id ;
retain true_sd prev_ed prev_type;
if first.id then call missing(prev_type);
if type ~= prev_type then do;
true_sd = sd;
call missing(prev_ed);
call missing(prev_type);
end;
if sd - prev_ed > 5 then true_sd = sd;
duration = ed - true_sd;
output;
prev_type = type;
prev_ed = ed;
format sd ed true_sd prev_ed date.;
run;
(assuming type and id are numeric here. ed is end_date, sd is start_date)

Is there a way to pass a list under a macro code?

I have a customer survey data like this:
data feedback;
length customer score comment $50.;
input customer $ score comment & $;
datalines;
A 3 The is no parking
A 5 The food is expensive
B . I like the food
C 5 It tastes good
C . blank
C 3 I like the drink
D 4 The dessert is tasty
D 2 I don't like the service
;
run;
There is a macro code like this:
%macro subset( cust=);
proc print data= feedback;
where customer = "&cust";
run;
%mend;
I am trying to write a program that call the %subset for each customer value in feedback data. Note that we do not know how many unique values of customer there are in the data set. Also, we cant change the %subset code.
I tried to achieve that by using proc sql to create a unique list of customers to pass into macro code but I think you cannot pass a list in a macro code.
Is there a way to do that? p.s I am beginner in macro
I like to keep things simple. Take a look at the following:
data feedback;
length customer score comment $50.;
input customer $ score comment & $;
datalines;
A 3 The is no parking
A 5 The food is expensive
B . I like the food
C 5 It tastes good
C . blank
C 3 I like the drink
D 4 The dessert is tasty
D 2 I don't like the service
;
run;
%macro subset( cust=);
proc print data= feedback;
where customer = "&cust";
run;
%mend subset;
%macro test;
/* first get the count of distinct customers */
proc sql noprint;
select count(distinct customer) into : cnt
from feedback;quit;
/* do this to remove leading spaces */
%let cnt = &cnt;
/* now get each of the customer names into macro variables
proc sql noprint;
select distinct customer into: cust1 - :cust&cnt
from feedback;quit;
/* use a loop to call other macro program, notice the use of &&cust&i */
%do i = 1 %to &cnt;
%subset(cust=&&cust&i);
%end;
%mend test;
%test;
of course if you want short and sweet you can use (just make sure your data is sorted by customer):
data _null_;
set feedback;
by customer;
if(first.customer)then call execute('%subset(cust='||customer||')');
run;
First fix the SAS code. To test if a value is in a list using the IN operator, not the = operator.
where customer in ('A' 'B')
Then you can pass that list into your macro and use it in your code.
%macro subset(custlist);
proc print data= feedback;
where customer in (&custlist);
run;
%mend;
%subset(custlist='A' 'B')
Notice a few things:
Use quotes around the values since the variable is character.
Use spaces between the values. The IN operator in SAS accepts either spaces or comma (or both) as the delimiter in the list. It is a pain to pass in comma delimited lists in a macro call since the comma is used to delimit the parameters.
You can defined a macro parameter as positional and still call it by name in the macro call.
If the list is in a dataset you can easily generate the list of values into a macro variable using PROC SQL. Just make sure the resulting list is not too long for a macro variable (maximum of 64K bytes).
proc sql noprint;
select distinct quote(trim(customer))
into :custlist separated by ' '
from my_subset
;
quit;
%subset(&custlist)

proc ttest class, default group issue

I would like to compare mean values of two groups using proc ttest, and I successfully did it as below.
proc ttest;
class group;
var score;
run;
But, this code just assumes observations with group = 0 as the default group. So, the t-statistics is calculated based on Mean (score of obs with group= 0) minus Mean (score of obs with group= 1). But, I would like to have it the other way around.
It would just change the sign of t-statistics, but it's just what I wanted to do.
Is there an option to do so by simply adding an option?
I know I could have done it if I have made another dummy variable which is exactly the opposite of my group variable. But, I don't want to create more dummy variables.
ORDER=DATA will tell SAS to order the class variable based on when it encounters the values. So if the 1 values are earlier than the 0 values, it will be first in the comparison.
For example:
data for_ttest;
call streaminit(7);
do group = 0 to 1;
do _n_ = 1 to 50;
score = rand('NORMAL',1,0.5)+group;
output;
end;
end;
run;
proc sort data=for_ttest;
by descending group;
run;
proc ttest data=for_ttest order=data;
class group;
var score;
run;
Without ORDER=DATA, it behaves as you saw, but with it, 1 is the first group.
You could also combine ORDER=FORMATTED with a format.
proc format;
value groupf
1="Group 1 (Value=1)"
0="Group 2 (Value=0)"
;
quit;
proc ttest data=for_ttest order=formatted;
class group;
format group groupf.;
var score;
run;
The labels in the PROC FORMAT are irrelevant, other than that they must be alphabetically sorted. Unfortunately the PRELOADFMT option is not available in PROC TTEST, so you can't use the NOTSORTED trick in PROC FORMAT to allow this to work even with the original values (though you can use non-printing characters to mess with sort order if you really want to).

Outputting conditionally from merge

I want to update a history file in SAS. I have new observations, which may overlap with existing data lines.
What is needed, is a file, which would have lines from dataset (new_data) where they exist and in case the lines do not exist, then from old set (old_data). What I've come up is a clunky merge operation, which is conditional on the order of the datasets. (==Works only if New_data is after Old_data. :?)
data new_data;
input key value;
datalines;
1 10
1 11
2 20
2 21
;
run;
data old_data;
input key value;
datalines;
2 50
2 51
3 30
3 31
;
run;
So I'd like to have the following:
key value
1 10
1 11
2 20
2 21
3 30
3 31
However the following does not work. It produces the output below it.
data updated_history;
merge New_data(in=a) old_data(in=b) ;
by key;
if a or (b and not a );
run;
....
2 50
2 51
...
But for some reason this does:
data updated_history;
merge old_data(in=b) New_data(in=a);
by key;
if a or (b and not a );
run;
Question: Is there an intelligent way to manage from which dataset the values are select from. Something like: if a then value_from_dataset a;
The order in which you list the data sets in the MERGE is the order the data is taken. So when the order is old, new values from old are read and then values from new overwrite the values from old. This is why your second version works and the first does not.
Since you have multiple observations per key value you probably do NOT want to use MERGE to combine these files. You could do it using SET by reading the data twice using two DOW loops. In that case it won't matter the order of the dataset in the SET statement since the records are interleaved instead of being joined. This first loop will calculate which of the two input datasets have any observations for this KEY value.
data want ;
anyold=0;
anynew=0;
do until (last.key);
set old_data (in=inold) new_data(in=innew);
by key ;
if inold then anyold=1;
if innew then anynew=1;
end;
do until (last.key);
set old_data (in=inold) new_data(in=innew);
by key ;
if not (anyold and anynew and inold) then output;
end;
drop anyold anynew;
run;
This type of combination is probably easier to code using SQL.
proc sql ;
create table want as
select key,value from new_data
union
select key,value from old_data
where key in (select key from old_data except select key from new_data)
order by 1
;
quit;

SAS: Coding a dummy variable for a value of a variable by group within group

I have a dataset of CASE_ID (x y and z), a set of multiple dates (including duplicate dates) for each CASE_ID, and a variable VAR. I would like to create a dummy variable DUMMYVAR by group within a group whereby if VAR="C" for CASE_ID x on some specific date, then DUMMYVAR=1 for all observations corresponding to CASE_ID x on with that date.
I believe that a Classic 2XDOW would be the key here but this is my third week using SAS and having difficulty getting this by two BY groups here.
I have referenced and attempted to write a variation of Haikuo's code here:
PROC SORT have;
by CASE_ID DATE;
RUN;
data want;
do until (last.DATE);
set HAVE;
by date notsorted;
if var='c' then DUMMYVAR=1;
do until (last.DATE);
set HAVE;
by DATE notsorted;
if DATE=1 then ????????
end;
run;
Change your BY statements to match the grouping you are doing. And in the second loop add a simple OUTPUT; statement. Then your new dataset will have all the rows in your original dataset and the new variable DUMMYVAR.
data want;
do until (last.DATE);
set HAVE;
by case_id date;
if var='c' then DUMMYVAR=1;
end;
do until (last.DATE);
set HAVE;
by case_id date;
output;
end;
run;
This will create the variable DUMMYVAR with values of either 1 or missing. If you want the values to be 1 or 0 then you could either set it to 0 before the first DO loop. Or add if first.date then dummyvar=0; statement before the existing IF statement.