Merging datasets only if they exist - merge

So I'm trying to create a macro in sas and I'm attempting to merge multiple data sets in one data step. This macro also creates a variety of different data sets dynamically so I have no idea what data sets are going to be created and which ones aren't going to. I'm trying to merge four data sets in one data step and I'm trying to only merge the ones that exist and don't merge the ones that don't.
Haven't really tried anything but what I'm trying to do kind of be seen below.
DATA Something;
MERGE Something SomethingElse AnotherThing EXIST(YetAnotherThing)*YetAnotherThing;
RUN;
Well obviously that doesn't work because SAS doesn't work like that but I'm trying to do something like that where YetAnotherThing is one of the data sets that I am testing to see whether or not it exists and to merge it to Something if it does.

If you have a systematic naming convention this can be simplified. For example if you have a common prefix it becomes:
data want;
merge prefix: ;
run;
If they're all in the same library it's also easy. But otherwise you're stuck checking every single name as above.

Something along these lines:
data test1;
do i = 1 to 10;
val1 = i;
output;
end;
run;
data test2;
do i = 1 to 10;
val2 = i*2;
output;
end;
run;
data test3;
do i = 1 to 10;
val3 = i*3;
output;
end;
run;
data test5;
do i = 1 to 10;
val5 = i*4;
output;
end;
run;
%macro multi_merge(varlist);
%local j;
data test_merge;
set %scan(&varlist,1);
run;
%put %sysfunc(countw(&varlist));
%if %sysfunc(countw(&varlist)) > 1 %then %do;
%do j = 2 %to %sysfunc(countw(&varlist));
%if %sysfunc(exist(%scan(&varlist,&j))) %then %do;
data test_merge;
merge test_merge %scan(&varlist,&j);
by i;
run;
%end;
%end;
%end;
%mend;
%multi_merge(test1 test2 test3 test4 test5);
Test4 does not exist.
Same thing with no loop:
if you don't want to loop, you can do this:
%macro if_exists(varlist);
%if %sysfunc(exist(%scan(&varlist,1))) %then %scan(&varlist,1);
%mend;
data test_merge2;
merge test1
%if_exists(test2)
%if_exists(test3)
%if_exists(test4)
%if_exists(test5)
%if_exists(test6);
by i;
run;

I can think of two options:
Loop through the list of input datasets, check if each exists, then merge only those that do.
At the start of your macro, before you conditionally create each of the potential input datasets, create a dummy dataset with the same name containing no rows or columns. Then when you attempt to merge them, they will always exist, without messing up the output with lots of missing values.
Sample code for creating an empty dataset:
data want;
stop;
run;

Related

SAS Macro code error with length statement

I am trying to change the length of the variables based on a list that I have and the code seems to work but the desired output is not achieved. here is the code:
%macro LEN();
Proc sql ;
select count(name) into: varnum from variab;
select name into: varname1-:varname%trim(%left(&varnum)) from Variab;
select length3 into: len from Length;
Quit;
%do i=1 %to &varnum;
data Zero;
length &&varname&i $ &&len&i.;
set desti.test;
length _numeric_ 4.;
format _numeric_ 12.2;
run;
%end;
%mend;
It gives a warning
WARNING: Multiple lengths were specified for the variable fscadl1 by
input data set(s). This can cause truncation
of data.
and it doesnt change the length of the variable. what is wrong in this code?
Are you trying to change a list of variables in one dataset? You're repeating the entire data step for each iteration, but only writing to a constant destination, which is inconsistent.
Probably what you want is:
Proc sql ;
select count(name) into: varnum from variab;
select name into: varname1-:varname%trim(%left(&varnum)) from Variab;
select length3 into: len from Length;
Quit;
%macro set_len(varnum=);
%do i=1 %to &varnum;
length &&varname&i $ &&len&i.;
%end;
%mend;
data Zero;
%set_len(&varnum);
set desti.test;
length _numeric_ 4.;
format _numeric_ 12.2;
run;
Note that you'd need to define &&len&i as you're not doing that currently.
The warning messages suggests that it is working. SAS started throwing that warning when you truncate a variable. You can suppress the warning message with the VARLENCHK option.
Below works:
options varlenchk=nowarn;
data want;
length name $ 3;
set sashelp.class;
length _numeric_ 4;
run;
If your code isn't working, I would turn on MPRINT to see make sure your macro is generating the SAS code you expect.

recode and add prefix to sas variables

Lets's say I have a bunch of variables named the same way and I'd like to recode them and add a prefix to each (the variables are all numeric).
In Stata I would do something like (let's say the variables start with eq)
foreach var of varlist eq* {
recode var (1/4=1) (else=0), pre(r_)
}
How can I do this in SAS? I'd like to use the %DO macros, but I'm not familiar with them (I want to avoid SQL). I'd appreciate if you could include comments explaining each step!
SAS syntax for this would be easier if your variables are named using numeric suffix. That is, if you had ten variables with names of eq1, eq2, .... , eq10, then you could just use variable lists to define both sets of variables.
There are a number of ways to translate your recode logic. If we assume you have clean variables then we can just use a boolean expression to generate a 0/1 result. So if 4 and 5 map to 1 and the rest map to 0 you could use x in (4,5) or x > 3 as the boolean expresson.
data want;
set have;
array old eq1-eq10 ;
array new r_eq1-r_eq10 ;
do i=1 to dim(old);
new(i) = old(i) in (4,5);
end;
run;
If you have missing values or other complications you might want to use IF/THEN logic or a SELECT statement or you could define a format you could use to convert the values.
If your list of names is more random then you might need to use some code generation, such as macro code, to generate the new variable names.
Here is one method that use the eq: variable list syntax in SAS that is similar to the syntax of your variable selection before. Use PROC TRANSPOSE on an empty (obs=0) version of your source dataset to get a dataset with the variable names that match your name pattern.
proc transpose data=have(obs=0) out=names;
var eq: ;
run;
Then generate two macro variables with the list of old and new names.
proc sql noprint ;
select _name_
, cats('r_',_name_)
into :old_list separated by ' '
, :new_list separated by ' '
from names
;
quit;
You can then use the two macro variables in your ARRAY statements.
array old &old_list ;
array new &new_list ;
You can do this with rename and a dash indicating which variables you want to rename. Note the following only renames the col variables, and not the other one:
data have;
col1=1;
col2=2;
col3=3;
col5=5;
other=99;
col12=12;
run;
%macro recoder(dsn = , varname = , prefix = );
/*select all variables that include the string "varname"*/
/*(you can change this if you want to be more specific on the conditions that need to be met to be renamed)*/
proc sql noprint;
select distinct name into: varnames
separated by " "
from dictionary.columns where memname = upcase("&dsn.") and index(name, "&varname.") > 0;
quit;
data want;
set have;
/*loop through that list of variables to recode*/
%do i = 1 %to %sysfunc(countw(&varnames.));
%let this_varname = %scan(&varnames., &i.);
/*create a new variable with desired prefix based on value of old variable*/
if &this_varname. in (1 2 3) then &prefix.&this_varname. = 0;
else if &this_varname. in (4 5) then &prefix.&this_varname. = 1;
%end;
run;
%mend recoder;
%recoder(dsn = have, varname = col, prefix = r_);
PROC TRANSPOSE will give you good flexibility with regards to the way your variables are named.
proc transpose data=have(obs=0) out=vars;
var col1-numeric-col12;
copy col1;
run;
proc transpose data=vars out=revars(drop=_:) prefix=RE_;
id _name_;
run;
data recode;
set have;
if 0 then set revars;
array c[*] col1-numeric-col12;
array r[*] re_:;
call missing(of r[*]);
do _n_ = 1 to dim(c);
if c[_n_] in(1 2 3) then r[_n_] = 0;
else if c[_n_] in(4 5) then r[_n_] = 1;
else r[_n_] = c[_n_];
end;
run;
proc print;
run;
It would be nearly trivial to write a macro to parse almost that exact syntax.
I wouldn't necessarily use this - I like both the transpose and the array methods better, both are more 'SASsy' (think 'pythonic' but for SAS) - but this is more or less exactly what you're doing above.
First set up a dataset:
data class;
set sashelp.class;
age_ly = age-1;
age_ny = age+1;
run;
Then the macro:
%macro do_count(data=, out=, prefix=, condition=, recode=, else=, var_start=);
%local dsid varcount varname rc; *declare local for safety;
%let dsid = %sysfunc(open(&data.,i)); *open the dataset;
%let varcount = %sysfunc(attrn(&dsid,nvars)); *get the count of variables to access;
data &out.; *now start the main data step;
set &data.; *set the original data set;
%do i = 1 %to &varcount; *iterate over the variables;
%let varname= %sysfunc(varname(&dsid.,&i.)); *determine the variable name;
%if %upcase(%substr(&varname.,1,%length(&var_start.))) = %upcase(&var_start.) %then %do; *if it matches your pattern then recode it;
&prefix.&varname. = ifn(&varname. &condition., &recode., &else.); *this uses IFN - only recodes numerics. More complicated code would work if this could be character.;
%end;
%end;
%let rc = %sysfunc(close(&dsid)); *clean up after yourself;
run;
%mend do_count;
%do_count(data=class, out=class_r, var_start=age, condition= > 14, recode=1, else=0, prefix=p_);
The expression (1/4=1) means values {1,2,3,4} should be recoded into
1.
Perhaps you do not need to make new variables at all? If have variables with values 1,2,3,4,5 and you want to treat them as if they have only two groups you could do it with a format.
First define your grouping using a format.
proc format ;
value newgrp 1-4='Group 1' 5='Group 2' ;
run;
Then you can just use a FORMAT statement in your analysis step to have SAS treat your five level variable as it if had only two levels.
proc freq ;
tables eq: ;
format eq: NEWGRP. ;
run;

Bootstrap macro in SAS

I started to learn %macro in SAS and now I'm trying to implement simple bootstrap with histogram as an output.
/*Create K data sets(vectors)*/
%macro datasets(K);
%do i=1 %to &K;
data indata&i;
%do j = 1 %to 50;
x=(rand('normal',2,9));
output;
%end;
run;
%end;
%mend datasets;
%datasets(3);
/*Bootstrap and hist*/
%macro boot (data,res);
%do i=1 %to &res;
%let x = (sample(&data,50));
%let m = (mean(&x));
%end;
proc iml;
read &m into A;
create DataM from A;
append from A;
close Data1;
quit;
proc univariate data=Data1;
histogram m;
run;
%mend boot;
%boot(Indata1,100);
It doesn't work and I can't understand why. Can you point me the mistake?
Use PROC SURVEYSELECT to generate bootstrap samples then do bootstrap analysis by Replication (a variable created by SURVEYSELECT). Your macro idea will be far too slow.
As mentioned use Proc SurveySelect and Proc Means. You can select all 100 samples in one Proc SurveySelect and then apply Proc Means with a BY statement to calculate the means in one step. Macro's don't add anything to the solution here.
I'm posting both solutions - the macro solution does take longer as well.
*Without macro;
proc surveyselect data=indata1 out=rsample method=srs n=50 reps=100;
run;
proc means data=rsample noprint;
by replicate;
var x;
output out=Data1 mean(x)=m;
run;
proc univariate data=Data1;
histogram m;
run;
*Macro solution;
%macro boot(data, res);
%do i=1 %to &res;
%*Currently pulls the same sample every time but you can fix that part;
proc surveyselect data=&data out=x method=srs n=50 reps=1 seed=343434;
run;
proc means data=x noprint;
var x;
output out=m mean(x)=m;
run;
proc append base=DataM data=m;
run;
%end;
%mend;
%boot(Indata1,10);
Perhaps it will help if we outline some of the ways that the posted macro code does NOT work. If nothing else then as examples of things to avoid.
If the first macro , %datasets(), you are using a macro %DO loop where you should use a normal data step DO loop. Also make sure to define your local macro variables as local. This will prevent the macro from modifying the value of an existing macro variable with the same name.
/*Create K data sets(vectors)*/
%macro datasets(K);
%local i ;
%do i=1 %to &K;
data indata&i;
do j = 1 to 50;
x=(rand('normal',2,9));
output;
end;
drop j;
run;
%end;
%mend datasets;
In the second macro you have a %DO loop that does nothing.
%do i=1 %to &res;
%let x = (sample(&data,50));
%let m = (mean(&x));
%end;
You repeat the exact same %LET statements multiple times. The result does not change since the loop variable i is not referenced at all. If you called the macro with data=indata1 then the result of the two statements will be that X=(sample(indata1,50)) and that M=(mean((sample(indata1,50)))). I think that perhaps you intended that the strings sample and mean might take some action, but since they have no macro triggers (& or %) they are just streams of characters to the macro processor.
I am not an expert on IML, but those statements also do not look like they are doing much.

SAS Macro Proc Logistic put P-value in a dataset

I've googled lots papers on the subject but don't seem to find what I want. I'm a beginner at SAS Macro, hoping to get some help here.
Here is what I want:
I have a dataset with 1200 variables. I want a macro to run those 1199 variables as OUTCOME, and store the P-values of logistic regression in a dataset. Also the dependent variable "gender" is character, and so are the outcome variables. But I don't know how to put class statement in the macro. Here is an example of how I run it as a single procedure.
proc logistic data=Baseline_gender ;
class gender(ref="Male") / param=ref;
model N284(event='1')=gender ;
ods output ParameterEstimates=ok;
run;
My idea was to create ODS output and delete the unnecessary variables other than the P-value and merge them into one dataset according to the OUTCOME variable names in the model: e.g.
Variable P-value
A1 0.005
A2 0.018
.. ....
I tried to play with some proc macro but I just cant get it work!!!
I really need help on this, Thank you very much.
SRSwift might be onto something (don't know enough about his method to tell), but here's a way to do it using a macro.
First, count the number of variables in your dataset. Do this by selecting your table from the dictionary.columns table. This puts the number of variables into &sqlobs. Now read the variable names from the dictionary table into macro variables var1-var&sqlobs.
%macro logitall;
proc sql;
create table count as
select name from dictionary.columns
where upcase(libname) = 'WORK'
and upcase(memname) = 'BASELINE_GENDER'
and upcase(name) ne 'GENDER'
;
select name into :var1 - :var&sqlobs
from dictionary.columns
where upcase(libname) = 'WORK'
and upcase(memname) = 'BASELINE_GENDER'
and upcase(name) ne 'GENDER'
;
quit;
Then run proc logistic for each dependent variable, each time outputting a dataset named after dependent variable.;
%do I = 1 %to &sqlobs;
proc logistic data=Baseline_gender ;
class gender(ref="Male") / param=ref;
model &&var&I.(event='1')=gender ;
ods output ParameterEstimates=&&var&I.;
run;
%end;
Now put all the output datasets together, creating a new variable with the dataset name using indsname= in the set statement.
data allvars;
format indsname dsname varname $25.;
set
%do I = 1 %to &sqlobs;
&&var&I.
%end;
indsname=dsname;
varname=dsname;
keep varname ProbChiSq;
where variable ne 'Intercept';
run;
%mend logitall;
%logitall;
Here is a macro free approach. It restructures the data in advance and uses SAS's by grouping. The data is stored in a deep format where the all the outcome variable values are stored in one new variable.
Create some sample data:
data have;
input
outcome1
outcome2
outcome3
gender $;
datalines;
1 1 1 Male
0 1 1 Male
1 0 1 Female
0 1 0 Male
1 1 0 Female
0 0 0 Female
;
run;
Next transpose the data into a deep format using an array:
data trans;
set have;
/* Create an array of all the outcome variables */
array o{*} outcome:;
/* Loop over the outcome variables */
do i = 1 to dim(o);
/* Store the variable name for grouping */
_NAME_ = vname(o[i]);
/* Store the outcome value in the */
outcome = o[i];
output;
end;
keep _NAME_ outcome gender;
run;
proc sort data = trans;
by _NAME_;
run;
Reusing your logistic procedure but with an additional by statement:
proc logistic data = trans;
/* Use the grouping variable to select multiple analyses */
by _NAME_;
class gender(ref = "Male");
/* Use the new variable for the dependant variable */
model outcome = gender / noint;
ods output ParameterEstimates = ok;
run;
Here is another way to do it using macro. First define all the variables to be used as outcome in a global variable and then write the macro script.
%let var = var1 var2 var3 ..... var1199;
%macro log_regression;
%do i=1 %to %eval(%sysfunc(countc(&var., " "))+1);
%let outcome_var = %scan(&var, &i);
%put &outcome_var.;
proc logistic data = baseline_gender desc;
class gender (ref = "Male") / param = ref;
model &outcome_var. = gender;
ods output ParameterEstimates = ParEst_&outcome_var.;
run;
%if %sysfunc(exist(univar_result)) %then %do;
data univar_result;
set univar_result ParEst_&outcome_var.;
run;
%end;
%else %do;
data univar_result;
set ParEst_&outcome_var.;
run;
%end;
%end;
%mend;

Split large SAS dataset into smaller datasets

I need some assistance with splitting a large SAS dataset into smaller datasets.
Each month I'll have a dataset containing a few million records. This number will vary from month to month. I need to split this dataset into multiple smaller datasets containing 250,000 records each. For example, if I have 1,050,000 records in the original dataset then I need the end result to be 4 datasets containing 250,000 records and 1 dataset containing 50,000 records.
From what I've been looking at it appears this will require using macros. Unfortunately I'm fairly new to SAS (unfamiliar with using macros) and don't have a lot of time to accomplish this. Any help would be greatly appreciated.
Building on Joe's answer, maybe you could try something like this :
%MACRO SPLIT(DATASET);
%LET DATASET_ID = %SYSFUNC(OPEN(&DATASET.));
%LET NOBS = %SYSFUNC(ATTRN(&DATASET__ID., NLOBS));
%LET NB_DATASETS = %SYSEVALF(&NOBS. / 250000, CEIL);
DATA
%DO I=1 %TO &NB_DATASETS.;
WANT&I.
%END;;
SET WANT;
%DO I=1 %TO &NB_DATASETS.;
%IF &I. > 1 %THEN %DO; ELSE %END; IF _N_ LE 2.5E5 * &I. THEN OUTPUT WANT&I.;
%END;
RUN;
%MEND SPLIT;
You can do it without macros at all, if you don't mind asking for datasets that may not exist, and have a reasonable bound on things.
data want1 want2 want3 want4 want5 want6 want7 want8 want9;
if _n_ le 2.5e5 then output want1;
else if _n_ le 5e5 then output want2;
else if _n_ le 7.5e5 then output want3;
... etc....
run;
Macros would make that more efficient to program and cleaner to read, but wouldn't change how it actually runs in reality.
You can do it without macros, using CALL EXECUTE(). It creates SAS-code as text strings and then executes it, after your "manually written" code completed.
data _null_;
if 0 then set have nobs=n;
do i=1 to ceil(n/250000);
call execute (cats("data want",i)||";");
call execute ("set have(firstobs="||(i-1)*250000+1||" obs="||i*250000||");");
call execute ("run;");
end;
run;
The first result on Google is from the SAS User Group International (SUGI)
These folks are your friends.
The article is here:
http://www2.sas.com/proceedings/sugi27/p083-27.pdf
The code is:
%macro split(ndsn=2);
data %do i = 1 %to &ndsn.; dsn&i. %end; ;
retain x;
set orig nobs=nobs;
if _n_ eq 1
then do;
if mod(nobs,&ndsn.) eq 0
then x=int(nobs/&ndsn.);
else x=int(nobs/&ndsn.)+1;
end;
if _n_ le x then output dsn1;
%do i = 2 %to &ndsn.;
else if _n_ le (&i.*x)
then output dsn&i.;
%end;
run;
%mend split;
%split(ndsn=10);
All you need to do is replace the 10 digit in "%split(ndsn=10);" with the number you require.
In Line 4, "set orig nobs=nobs;", simply replace orig with your dataset name.
Hey presto!
A more efficient option, if you have room in memory to store one of the smaller datasets, is a hash solution. Here's an example using basically what you're describing in the question:
data in_data;
do recid = 1 to 1.000001e7;
datavar = 1;
output;
end;
run;
data _null_;
if 0 then set in_data;
declare hash h_out();
h_out.defineKey('_n_');
h_out.defineData('recid','datavar');
h_out.defineDone();
do filenum = 1 by 1 until (eof);
do _n_ = 1 to 250000 until (eof);
set in_data end=eof;
h_out.add();
end;
h_out.output(dataset:cats('file_',filenum));
h_out.clear();
end;
stop;
run;
We define a hash object with the appropriate parameters, and simply tell it to output every 250k records, and clear it. We could do a hash-of-hashes here also, particularly if it weren't just "Every 250k records" but some other criteria drove things, but then you'd have to fit all of the records in memory, not just 250k of them.
Note also that we could do this without specifying the variables explicitly, but it requires having a useful ID on the dataset:
data _null_;
if 0 then set in_data;
declare hash h_out(dataset:'in_data(obs=0)');
h_out.defineKey('recid');
h_out.defineData(all:'y');
h_out.defineDone();
do filenum = 1 by 1 until (eof);
do _n_ = 1 to 250000 until (eof);
set in_data end=eof;
h_out.add();
end;
h_out.output(dataset:cats('file_',filenum));
h_out.clear();
end;
stop;
run;
Since we can't use _n_ anymore for the hash ID due to using the dataset option on the constructor (necessary for the all:'y' functionality), we have to have a record ID. Hopefully there is such a variable, or one could be added with a view.
Here is a basic approach. This requires manual adjustment of the intervals, but is easy to understand.
* split data;
data output1;
set df;
if 1 <= _N_ < 5 then output;
run;
data output2;
set df;
if 5 <= _N_ < 10 then output;
run;
data output3;
set df;
if 10 <= _N_ < 15 then output;
run;
data output4;
set df;
if 15 <= _N_ < 22 then output;
run;