Getting the most present class by group in SAS - group-by

I have a dataset in SAS, let say:
ID, Class, Goup
There is 4 values for group : {1,2,3,4} and an undetermined number of class {Class1, ... , Class n }.
What is the quickest way to find the most present class for each group?
I can see two options, using a Proc freq or using something like
proq sql;
Select count(*)
From Have
group by group, class;
And after taking the max of each group. But i'm not sure how to finsih..
EDIT
I said the quickest, but it's more about effectiveness , i'm working on a big table 10 Million lines and i'm runnig it really often

The following step by step approach is one method:
data have;
input group : 8.
class : $char8.
;
datalines;
1 class1
1 class1
1 class2
1 class3
2 class2
2 class2
2 class2
2 class3
3 class1
3 class2
3 class3
3 class3
;
/* get frequencies */
proc freq data = have noprint;
tables group*class / out=tmp_freq;
run;
proc sort data = tmp_freq;
by group count;
run;
data want;
set tmp_freq;
by group count;
if last.group;
run;
And the result is
Group Class Count Percent
1 class1 2 16.6
2 class2 3 25
3 class3 2 16.6
Edit in response to the question in the comments:
On the final table, Percentage are from the whole data, do you think
we can have it per class ?
data want2(keep = group class max_count percent_for_group);
/* process data by group */
do until(last.group);
set tmp_freq;
by group;
if count gt max_count then
max_count = count;
sum_count = sum(sum_count,count);
end;
percent_for_group = max_count * 100 / sum_count;
run;

Related

Copy a dataset B for each variable in dataset A

I have the two following datasets:
Dataset A:
ID
A
B
C
Dataset B:
Age
35
49
53
And I want to copy paste B to each ID of A:
ID Age
A 35
A 49
A 53
B 35
B 49
...
For the moment I do this with a %do cicle but is there a more elegant way to do this? With a single PROC SQL or Datastep for example?
Thanks in advance
You can use SQL to perform a Cartesian product to get all combinations.
For example:
/* setup id data */
data have1;
input id $char1.;
datalines;
A
B
C
;
/* setup age data */
data have2;
input age 8.;
datalines;
35
49
53
;
/* perform Cartesian product */
proc sql noprint;
create table
want
as
select
*
from
have1
,have2
;
quit;
There is no need for macro code. You can use the POINT= option on the SET statement to do this in a data step.
data want;
set a;
do p=1 to nobs;
set b point=p nobs=nobs;
output;
end;
run;

proc ttest class, default group issue

I would like to compare mean values of two groups using proc ttest, and I successfully did it as below.
proc ttest;
class group;
var score;
run;
But, this code just assumes observations with group = 0 as the default group. So, the t-statistics is calculated based on Mean (score of obs with group= 0) minus Mean (score of obs with group= 1). But, I would like to have it the other way around.
It would just change the sign of t-statistics, but it's just what I wanted to do.
Is there an option to do so by simply adding an option?
I know I could have done it if I have made another dummy variable which is exactly the opposite of my group variable. But, I don't want to create more dummy variables.
ORDER=DATA will tell SAS to order the class variable based on when it encounters the values. So if the 1 values are earlier than the 0 values, it will be first in the comparison.
For example:
data for_ttest;
call streaminit(7);
do group = 0 to 1;
do _n_ = 1 to 50;
score = rand('NORMAL',1,0.5)+group;
output;
end;
end;
run;
proc sort data=for_ttest;
by descending group;
run;
proc ttest data=for_ttest order=data;
class group;
var score;
run;
Without ORDER=DATA, it behaves as you saw, but with it, 1 is the first group.
You could also combine ORDER=FORMATTED with a format.
proc format;
value groupf
1="Group 1 (Value=1)"
0="Group 2 (Value=0)"
;
quit;
proc ttest data=for_ttest order=formatted;
class group;
format group groupf.;
var score;
run;
The labels in the PROC FORMAT are irrelevant, other than that they must be alphabetically sorted. Unfortunately the PRELOADFMT option is not available in PROC TTEST, so you can't use the NOTSORTED trick in PROC FORMAT to allow this to work even with the original values (though you can use non-printing characters to mess with sort order if you really want to).

Outputting conditionally from merge

I want to update a history file in SAS. I have new observations, which may overlap with existing data lines.
What is needed, is a file, which would have lines from dataset (new_data) where they exist and in case the lines do not exist, then from old set (old_data). What I've come up is a clunky merge operation, which is conditional on the order of the datasets. (==Works only if New_data is after Old_data. :?)
data new_data;
input key value;
datalines;
1 10
1 11
2 20
2 21
;
run;
data old_data;
input key value;
datalines;
2 50
2 51
3 30
3 31
;
run;
So I'd like to have the following:
key value
1 10
1 11
2 20
2 21
3 30
3 31
However the following does not work. It produces the output below it.
data updated_history;
merge New_data(in=a) old_data(in=b) ;
by key;
if a or (b and not a );
run;
....
2 50
2 51
...
But for some reason this does:
data updated_history;
merge old_data(in=b) New_data(in=a);
by key;
if a or (b and not a );
run;
Question: Is there an intelligent way to manage from which dataset the values are select from. Something like: if a then value_from_dataset a;
The order in which you list the data sets in the MERGE is the order the data is taken. So when the order is old, new values from old are read and then values from new overwrite the values from old. This is why your second version works and the first does not.
Since you have multiple observations per key value you probably do NOT want to use MERGE to combine these files. You could do it using SET by reading the data twice using two DOW loops. In that case it won't matter the order of the dataset in the SET statement since the records are interleaved instead of being joined. This first loop will calculate which of the two input datasets have any observations for this KEY value.
data want ;
anyold=0;
anynew=0;
do until (last.key);
set old_data (in=inold) new_data(in=innew);
by key ;
if inold then anyold=1;
if innew then anynew=1;
end;
do until (last.key);
set old_data (in=inold) new_data(in=innew);
by key ;
if not (anyold and anynew and inold) then output;
end;
drop anyold anynew;
run;
This type of combination is probably easier to code using SQL.
proc sql ;
create table want as
select key,value from new_data
union
select key,value from old_data
where key in (select key from old_data except select key from new_data)
order by 1
;
quit;

Is it possible to merge two datasets where a variable's value in the first is used to select a variable in the second?

I would like to know how to merge two datasets in SAS using a variable's value in the first dataset to select and test a variable in the second dataset.
As an example consider two datasets. The first dataset contains four baby names and the days they were born. The second data set contains three doctors and an array of indicator variables noting if each doctor worked on a particular day. For example Dr. Smith worked on days 2 and 3 only. I would like to create a dataset that lists all the possible baby-doctor combinations where the doctor was working on the day the baby was born.
data babies;
input baby_name $ birth_day;
datalines;
Jake 1
Sonny 4
North 5
Apple 6
;
run;
data doctors;
input DrLastname $ day1 day2 day3 day4 day5 day6;
datalines;
Jones 1 0 0 1 1 1
Smith 0 1 1 0 0 0
Lewis 1 1 1 0 0 0
;
run;
The solution seems like it should be something like this
proc sql;
create table merged as
select babies.*, doctors.*
from babies, doctors
where doctors.day(babies.birth_day) = 1; *<--- incorrect;
quit;
The output should be:
baby_name birth_day DrLastName
Jake 1 Jones
Jake 1 Lewis
Sonny 4 Jones
North 5 Jones
Apple 6 Jones
I have run into this problem a few times and would love to know if this is kind of merge is possible in SAS. Thanks for any help you can provide.
While I probably would also transpose the dataset, it is possible to do so without transposing.
data babies_doctors;
set babies;
do _i = 1 to nobs_doctors;
set doctors point=_i nobs=nobs_doctors;
array days day1-day6;
if days[birth_Day] then output;
end;
run;
This will not be fast, as it checks all rows in the dataset, but it's possible.
Fastest is probably to load it into a vertical hash table (which you could do easily) or a temporary array.
data babies_doctors_array;
array drnames[32767] $80 _temporary_;
array drdays[32767,6] _temporary_;
if _n_=1 then do;
do _i = 1 to nobs_doctors;
set doctors point=_i nobs=nobs_doctors;
array days day1-day6;
drnames[_i]=DrLastname;
do _j = 1 to dim(days);
drdays[_i,_j]=days[_j];
end;
end;
end;
set babies;
do _k = 1 to nobs_doctors;
if drdays[_k,birth_day]=1 then do;
baby_drlastname = drnames[_k];
output;
end;
end;
run;
I might shift the second dataset and then merge on day.
Something like (in untested pseudo code):
data new_1-new_6;
set doctor;
array day_1-day_6 day_{6}
for i in 1 to 6:
if day_{i} = 1 then do;
day = i;
output new_{i};
end;
end;
run;
data stacked;
set day_1-day_6;
run;
Then simply merge based on the field day.

SAS Macro - Combining multiple tables into one, controlled by another table

I've come in late to a project and want to write a macro that normalises some data for export to a SQL Server.
There are two control tables...
- Table 1 (customers) has a list of customer unique identifiers
- Table 2 (hierarchy) has a list of table names
There are then n additional tables. One for each record in (hierarchy) (named in the SourceTableName field). With the form of...
- CustomerURN, Value1, Value2
I want to combine all of these tables into a single table (sample_results), with the form of...
- SourceTableName, CustomerURN, Value1, Value2
The only records that should be copied, however, should be for CustomerURNs that exist in the (customers) table.
I could do this in a hard coded format using proc sql, something like...
proc sql;
insert into
SAMPLE_RESULTS
select
'TABLE1',
data.*
from
Table1 data
INNER JOIN
customers
ON data.CustomerURN = customers.CustomerURN
<repeat for every table>
But every week new records are added to the hierarchy table.
Is there any way to write a loop that picks up the table name from the hierarchy table, then calls the proc sql to copy the data into sample_results?
You could concatenate all the hierarchy tables together, and do a single SQL join
proc sql ;
drop table all_hier_tables ;
quit ;
%MACRO FLAG_APPEND(DSN) ;
/* Create new var with tablename */
data &DSN._b ;
length SourceTableName $32. ;
SourceTableName = "&DSN" ;
set &DSN ;
run ;
/* Append to master */
proc append data=&DSN._b base=all_hier_tables force ;
run ;
%MEND ;
/* Append all hierarchy tables together */
data _null_ ;
set hierarchy ;
code = cats('%FLAG_APPEND(' , SourceTableName , ');') ;
call execute(code); /* run the macro */
run ;
/* Now merge in... */
proc sql;
insert into
SAMPLE_RESULTS
select
data.*
from
all_hier_tables data
INNER JOIN
customers
ON data.CustomerURN = customers.CustomerURN
quit;
Another way is to create a view so that it will always reflect the latest data in the metadata tables. The call execute function is used to read in the table names from the hierarchy dataset. Here is an example which you should be able to modify to suit your data, the last bit of code is the relevant one to you.
data class1 class2 class3;
set sashelp.class;
run;
data hierarchy;
input table_name $;
cards;
class1
class2
class3
;
run;
data ages;
input age;
cards;
11
13
15
;
run;
data _null_;
set hierarchy end=last;
if _n_=1 then call execute('proc sql; create view sample_results_view as ' );
if not last then call execute('select * from '||trim(table_name)||' where age in (select age from ages) union all ');
if last then call execute('select * from '||trim(table_name)||' where age in (select age from ages); quit;');
run;