Split large SAS dataset into smaller datasets - macros

I need some assistance with splitting a large SAS dataset into smaller datasets.
Each month I'll have a dataset containing a few million records. This number will vary from month to month. I need to split this dataset into multiple smaller datasets containing 250,000 records each. For example, if I have 1,050,000 records in the original dataset then I need the end result to be 4 datasets containing 250,000 records and 1 dataset containing 50,000 records.
From what I've been looking at it appears this will require using macros. Unfortunately I'm fairly new to SAS (unfamiliar with using macros) and don't have a lot of time to accomplish this. Any help would be greatly appreciated.

Building on Joe's answer, maybe you could try something like this :
%MACRO SPLIT(DATASET);
%LET DATASET_ID = %SYSFUNC(OPEN(&DATASET.));
%LET NOBS = %SYSFUNC(ATTRN(&DATASET__ID., NLOBS));
%LET NB_DATASETS = %SYSEVALF(&NOBS. / 250000, CEIL);
DATA
%DO I=1 %TO &NB_DATASETS.;
WANT&I.
%END;;
SET WANT;
%DO I=1 %TO &NB_DATASETS.;
%IF &I. > 1 %THEN %DO; ELSE %END; IF _N_ LE 2.5E5 * &I. THEN OUTPUT WANT&I.;
%END;
RUN;
%MEND SPLIT;

You can do it without macros at all, if you don't mind asking for datasets that may not exist, and have a reasonable bound on things.
data want1 want2 want3 want4 want5 want6 want7 want8 want9;
if _n_ le 2.5e5 then output want1;
else if _n_ le 5e5 then output want2;
else if _n_ le 7.5e5 then output want3;
... etc....
run;
Macros would make that more efficient to program and cleaner to read, but wouldn't change how it actually runs in reality.

You can do it without macros, using CALL EXECUTE(). It creates SAS-code as text strings and then executes it, after your "manually written" code completed.
data _null_;
if 0 then set have nobs=n;
do i=1 to ceil(n/250000);
call execute (cats("data want",i)||";");
call execute ("set have(firstobs="||(i-1)*250000+1||" obs="||i*250000||");");
call execute ("run;");
end;
run;

The first result on Google is from the SAS User Group International (SUGI)
These folks are your friends.
The article is here:
http://www2.sas.com/proceedings/sugi27/p083-27.pdf
The code is:
%macro split(ndsn=2);
data %do i = 1 %to &ndsn.; dsn&i. %end; ;
retain x;
set orig nobs=nobs;
if _n_ eq 1
then do;
if mod(nobs,&ndsn.) eq 0
then x=int(nobs/&ndsn.);
else x=int(nobs/&ndsn.)+1;
end;
if _n_ le x then output dsn1;
%do i = 2 %to &ndsn.;
else if _n_ le (&i.*x)
then output dsn&i.;
%end;
run;
%mend split;
%split(ndsn=10);
All you need to do is replace the 10 digit in "%split(ndsn=10);" with the number you require.
In Line 4, "set orig nobs=nobs;", simply replace orig with your dataset name.
Hey presto!

A more efficient option, if you have room in memory to store one of the smaller datasets, is a hash solution. Here's an example using basically what you're describing in the question:
data in_data;
do recid = 1 to 1.000001e7;
datavar = 1;
output;
end;
run;
data _null_;
if 0 then set in_data;
declare hash h_out();
h_out.defineKey('_n_');
h_out.defineData('recid','datavar');
h_out.defineDone();
do filenum = 1 by 1 until (eof);
do _n_ = 1 to 250000 until (eof);
set in_data end=eof;
h_out.add();
end;
h_out.output(dataset:cats('file_',filenum));
h_out.clear();
end;
stop;
run;
We define a hash object with the appropriate parameters, and simply tell it to output every 250k records, and clear it. We could do a hash-of-hashes here also, particularly if it weren't just "Every 250k records" but some other criteria drove things, but then you'd have to fit all of the records in memory, not just 250k of them.
Note also that we could do this without specifying the variables explicitly, but it requires having a useful ID on the dataset:
data _null_;
if 0 then set in_data;
declare hash h_out(dataset:'in_data(obs=0)');
h_out.defineKey('recid');
h_out.defineData(all:'y');
h_out.defineDone();
do filenum = 1 by 1 until (eof);
do _n_ = 1 to 250000 until (eof);
set in_data end=eof;
h_out.add();
end;
h_out.output(dataset:cats('file_',filenum));
h_out.clear();
end;
stop;
run;
Since we can't use _n_ anymore for the hash ID due to using the dataset option on the constructor (necessary for the all:'y' functionality), we have to have a record ID. Hopefully there is such a variable, or one could be added with a view.

Here is a basic approach. This requires manual adjustment of the intervals, but is easy to understand.
* split data;
data output1;
set df;
if 1 <= _N_ < 5 then output;
run;
data output2;
set df;
if 5 <= _N_ < 10 then output;
run;
data output3;
set df;
if 10 <= _N_ < 15 then output;
run;
data output4;
set df;
if 15 <= _N_ < 22 then output;
run;

Related

How to pad a number with leading zero in a SAS Macro loop counter?

So I have a range of datasets in a specific library. These datasets are named in the format DATASET_YYYYMM, with one dataset for each month. I am trying to append a range of these datasets based on user input for the date range. i.e. If start_date is 01NOV2019 and the end_date is 31JAN2020, I want to append the three datasets: LIBRARY.DATASET_201911, LIBRARY.DATASET_201912 and LIBRARY.DATASET_202001.
The range is obviously variable, so I can't simply name the datasets manually in a set function. Since I need to loop through the years and months in the date range, I believe a macro is the best way to do this. I'm using a loop within the SET statement to append all the datasets. I have copied my example code below. It does work in theory. But in practice, only if we are looping over the months of November and December. As the format of the dataset name has a two digit month, for Jan-Sept it will be 01-09. The month function returns 1-9 however, and of course a 'File DATASET_NAME does not exist' error is thrown.
Problem is I cannot figure out a way to get it to interpret the month with leading 0, without ruining functionality of another part of the loop/macro.
I have tried numerous approaches to format the number as z2, cannot get any to work.
i.e. Including PUTN functions in the DO line for quote_month as follows, it ignores the leading zero when generating the dataset name in the line below.
%DO quote_month = %SYSFUNC(IFN(&quote_year. = &start_year.,%SYSFUNC(PUTN(&start_month.,z2.)),1,.)) %TO %SYSFUNC(IFN(&quote_year. = &end_year.,%SYSFUNC(PUTN(&end_month.,z2.)),12,.));
Below is example code (without any attempt to reformat it to z2) - it will throw an error because it cannot find 'dataset_20201' because it is actually called 'dataset_202001'. The dataset called dataset_combined_example produces the desired output of the code by manually referencing the dataset names which it will be unable to do in practice. Does anyone know how to go about this?
DATA _NULL_;
FORMAT start_date end_date DATE9.;
start_date = '01NOV2019'd;
end_date = '31JAN2020'd;
CALL symput('start_date',start_date);
CALL symput('end_date',end_date);
RUN;
DATA dataset_201911;
input name $;
datalines;
Nov1
Nov2
;
RUN;
DATA dataset_201912;
input name $;
datalines;
Dec1
Dec2
;
RUN;
DATA dataset_202001;
input name $;
datalines;
Jan1
Jan2
;
RUN;
DATA dataset_combined_example;
SET dataset_201911 dataset_201912 dataset_202001;
RUN;
%MACRO get_table(start_date, end_date);
%LET start_year = %SYSFUNC(year(&start_date.));
%LET end_year = %SYSFUNC(year(&end_date.));
%LET start_month = %SYSFUNC(month(&start_date.));
%LET end_month = %SYSFUNC(month(&end_date.));
DATA dataset_combined;
SET
%DO quote_year = &start_year. %TO &end_year.;
%DO quote_month = %SYSFUNC(IFN(&quote_year. = &start_year.,&start_month.,1,.)) %TO %SYSFUNC(IFN(&quote_year. = &end_year.,&end_month.,12,.));
dataset_&quote_year.&quote_month.
%END;
%END;
;
RUN;
%MEND;
%get_table(&start_date.,&end_date.);
You could do this using putn and z2. format.
%DO quote_year = &start_year. %TO &end_year.;
%DO quote_month = %SYSFUNC(IFN(&quote_year. = &start_year.,&start_month.,1,.)) %TO %SYSFUNC(IFN(&quote_year. = &end_year.,&end_month.,12,.));
dataset_&quote_year.%sysfunc(putn(&quote_month.,z2.))
%END;
%END;
You can also do this using the metadata tables without having to resort to macro loops in the first place:
/* A few datasets to combine */
data
DATASET_201910
DATASET_201911
DATASET_201912
DATASET_202001
;
run;
%let START_DATE = '01dec2019'd;
%let END_DATE = '31jan2020'd;
proc sql noprint;
select catx('.', libname, memname) into :DS_LIST separated by ' '
from dictionary.tables
where
&START_DATE <=
case
when prxmatch('/DATASET_\d{6}/', memname)
then input(scan(memname, -1, '_'), yymmn6.)
else -99999
end
<= &END_DATE
and libname = 'WORK'
;
quit;
data combined_datasets /view=combined_datasets;
set &DS_LIST;
run;
The case-when in the where clause ensures that any other datasets present in the same library that don't match the expected naming scheme are ignored.
One key difference with this approach is that you will never end up attempting to read a dataset that doesn't exist if one of the expected datasets in your range is missing.
You can use the Z format to generate strings with leading zeros.
But your problem is much easier if you use SAS date functions and formats to generate the YYYYMM strings. Just use a normal iterative %DO loop to cycle the month offset from zero to the number of months between the two dates.
%macro get_table(start_date, end_date);
%local offset dsname ;
data dataset_combined;
set
%do offset=0 %to %sysfunc(intck(month,&start_date,&end_date));
%let dsname=dataset_%sysfunc(intnx(month,&start_date,&offset),yymmn6);
&dsname.
%end;
;
run;
%mend get_table;
Result:
445 options mprint;
446 %get_table(start_date='01NOV2019'd,end_date='31JAN2020'd);
MPRINT(GET_TABLE): data dataset_combined;
MPRINT(GET_TABLE): set dataset_201911 dataset_201912 dataset_202001 ;
MPRINT(GET_TABLE): run;
In a macro
Use INTNX to compute the bounds for a loop over date values. Within the loop:
Compute the candidate data set name according to specified lib, prefix and desired date value format. <yyyy><mm> is output by format yymmn6.
Use EXIST to check candidate data sets for existence.
Alternatively, do not check, but make sure to set OPTIONS NODSNFERR prior to combining. The setting will prevent errors when specifying a non-existent data set.
Update the loop index to the end of the month so the next increment takes the index to the start of the next month.
%macro names_by_month(lib=work, prefix=data_, start_date=today(), end_date=today(), format=yymmn6.);
%local index name;
%* loop over first-of-the-month date values;
%do index = %sysfunc(intnx(month, &start_date, 0)) %to %sysfunc(intnx(month, &end_date, 0));
%* compute month dependent name;
%let name = &lib..&prefix.%sysfunc(putn(&index,&format));
%* emit name if it exists;
%if %sysfunc(exist(&name)) or %sysfunc(exist(&name,VIEW)) %then %str(&name);
%* prepare index for loop +1 increment so it goes to start of next month;
%let index = %sysfunc(intnx(month, &index, 0, E));
%end;
%mend;
* example usage:
data combined_imports(label="nov2019 to jan2020");
set
%names_by_month(
prefix=import_,
start_date='01NOV2019'd,
end_date = '31JAN2020'd
)
;
run;

Merging datasets only if they exist

So I'm trying to create a macro in sas and I'm attempting to merge multiple data sets in one data step. This macro also creates a variety of different data sets dynamically so I have no idea what data sets are going to be created and which ones aren't going to. I'm trying to merge four data sets in one data step and I'm trying to only merge the ones that exist and don't merge the ones that don't.
Haven't really tried anything but what I'm trying to do kind of be seen below.
DATA Something;
MERGE Something SomethingElse AnotherThing EXIST(YetAnotherThing)*YetAnotherThing;
RUN;
Well obviously that doesn't work because SAS doesn't work like that but I'm trying to do something like that where YetAnotherThing is one of the data sets that I am testing to see whether or not it exists and to merge it to Something if it does.
If you have a systematic naming convention this can be simplified. For example if you have a common prefix it becomes:
data want;
merge prefix: ;
run;
If they're all in the same library it's also easy. But otherwise you're stuck checking every single name as above.
Something along these lines:
data test1;
do i = 1 to 10;
val1 = i;
output;
end;
run;
data test2;
do i = 1 to 10;
val2 = i*2;
output;
end;
run;
data test3;
do i = 1 to 10;
val3 = i*3;
output;
end;
run;
data test5;
do i = 1 to 10;
val5 = i*4;
output;
end;
run;
%macro multi_merge(varlist);
%local j;
data test_merge;
set %scan(&varlist,1);
run;
%put %sysfunc(countw(&varlist));
%if %sysfunc(countw(&varlist)) > 1 %then %do;
%do j = 2 %to %sysfunc(countw(&varlist));
%if %sysfunc(exist(%scan(&varlist,&j))) %then %do;
data test_merge;
merge test_merge %scan(&varlist,&j);
by i;
run;
%end;
%end;
%end;
%mend;
%multi_merge(test1 test2 test3 test4 test5);
Test4 does not exist.
Same thing with no loop:
if you don't want to loop, you can do this:
%macro if_exists(varlist);
%if %sysfunc(exist(%scan(&varlist,1))) %then %scan(&varlist,1);
%mend;
data test_merge2;
merge test1
%if_exists(test2)
%if_exists(test3)
%if_exists(test4)
%if_exists(test5)
%if_exists(test6);
by i;
run;
I can think of two options:
Loop through the list of input datasets, check if each exists, then merge only those that do.
At the start of your macro, before you conditionally create each of the potential input datasets, create a dummy dataset with the same name containing no rows or columns. Then when you attempt to merge them, they will always exist, without messing up the output with lots of missing values.
Sample code for creating an empty dataset:
data want;
stop;
run;

Can I do hash merge by multiple keys in SAS

I would like to hash merge in SAS using two keys;
The variable names for the lookup dataset called link_id 8. and ref_date 8.;
The variable names for the merged dataset called link_id 8. and drug_date 8.;
The code I used is as following:
data elig_bene_pres;
length link_id ref_date 8.;
call missing(link_id,ref_date):
if _N_=1 then do;
declare hash elig_bene(dataset:"bene.elig_bene_uid");
elig_bene.defineKey("link_id","ref_date");
elig_bene.defineDone();
end;
set data;
if elig_bene.find(key:Link_ID,key:drug_dt)=0 then output;
run;
But it seems that it is not found by these two keys. I just want to know whether my method is doable.
Thanks!
There are no obvious problems with the code.
To troubleshoot, try merge-sort: PROC SORT both data sets, then merge them by the two key variables. This will show which values look similar but are not exactly the same.
This sample shows you have the correct approach.
data elig;
input lukey1 lukey2;
datalines;
1 1
1 2
2 4
3 6
3 7
run;
data all;
do key1 = 1 to 10; do key2 = 1 to 10;
array x(5) (1:5);
output;
end; end;
run;
data all_elig;
length lukey1 lukey2 8;
call missing (lukey1,lukey2);
if _n_ = 1 then do;
declare hash elig (dataset:"elig");
elig.defineKey ('lukey1','lukey2');
elig.defineDone ();
end;
set all;
if 0 = elig.find(key:key1, key:key2);
run;
The process as shown is not really a merge because the lookup hash has no explicit data elements. The keys are implicit data when no data is specified.
If you are selecting all data rows, the first item to troubleshoot is the bene.elig_bene_uid. Are it's keys accidentally a superset of data's ?

SAS Macro Proc Logistic put P-value in a dataset

I've googled lots papers on the subject but don't seem to find what I want. I'm a beginner at SAS Macro, hoping to get some help here.
Here is what I want:
I have a dataset with 1200 variables. I want a macro to run those 1199 variables as OUTCOME, and store the P-values of logistic regression in a dataset. Also the dependent variable "gender" is character, and so are the outcome variables. But I don't know how to put class statement in the macro. Here is an example of how I run it as a single procedure.
proc logistic data=Baseline_gender ;
class gender(ref="Male") / param=ref;
model N284(event='1')=gender ;
ods output ParameterEstimates=ok;
run;
My idea was to create ODS output and delete the unnecessary variables other than the P-value and merge them into one dataset according to the OUTCOME variable names in the model: e.g.
Variable P-value
A1 0.005
A2 0.018
.. ....
I tried to play with some proc macro but I just cant get it work!!!
I really need help on this, Thank you very much.
SRSwift might be onto something (don't know enough about his method to tell), but here's a way to do it using a macro.
First, count the number of variables in your dataset. Do this by selecting your table from the dictionary.columns table. This puts the number of variables into &sqlobs. Now read the variable names from the dictionary table into macro variables var1-var&sqlobs.
%macro logitall;
proc sql;
create table count as
select name from dictionary.columns
where upcase(libname) = 'WORK'
and upcase(memname) = 'BASELINE_GENDER'
and upcase(name) ne 'GENDER'
;
select name into :var1 - :var&sqlobs
from dictionary.columns
where upcase(libname) = 'WORK'
and upcase(memname) = 'BASELINE_GENDER'
and upcase(name) ne 'GENDER'
;
quit;
Then run proc logistic for each dependent variable, each time outputting a dataset named after dependent variable.;
%do I = 1 %to &sqlobs;
proc logistic data=Baseline_gender ;
class gender(ref="Male") / param=ref;
model &&var&I.(event='1')=gender ;
ods output ParameterEstimates=&&var&I.;
run;
%end;
Now put all the output datasets together, creating a new variable with the dataset name using indsname= in the set statement.
data allvars;
format indsname dsname varname $25.;
set
%do I = 1 %to &sqlobs;
&&var&I.
%end;
indsname=dsname;
varname=dsname;
keep varname ProbChiSq;
where variable ne 'Intercept';
run;
%mend logitall;
%logitall;
Here is a macro free approach. It restructures the data in advance and uses SAS's by grouping. The data is stored in a deep format where the all the outcome variable values are stored in one new variable.
Create some sample data:
data have;
input
outcome1
outcome2
outcome3
gender $;
datalines;
1 1 1 Male
0 1 1 Male
1 0 1 Female
0 1 0 Male
1 1 0 Female
0 0 0 Female
;
run;
Next transpose the data into a deep format using an array:
data trans;
set have;
/* Create an array of all the outcome variables */
array o{*} outcome:;
/* Loop over the outcome variables */
do i = 1 to dim(o);
/* Store the variable name for grouping */
_NAME_ = vname(o[i]);
/* Store the outcome value in the */
outcome = o[i];
output;
end;
keep _NAME_ outcome gender;
run;
proc sort data = trans;
by _NAME_;
run;
Reusing your logistic procedure but with an additional by statement:
proc logistic data = trans;
/* Use the grouping variable to select multiple analyses */
by _NAME_;
class gender(ref = "Male");
/* Use the new variable for the dependant variable */
model outcome = gender / noint;
ods output ParameterEstimates = ok;
run;
Here is another way to do it using macro. First define all the variables to be used as outcome in a global variable and then write the macro script.
%let var = var1 var2 var3 ..... var1199;
%macro log_regression;
%do i=1 %to %eval(%sysfunc(countc(&var., " "))+1);
%let outcome_var = %scan(&var, &i);
%put &outcome_var.;
proc logistic data = baseline_gender desc;
class gender (ref = "Male") / param = ref;
model &outcome_var. = gender;
ods output ParameterEstimates = ParEst_&outcome_var.;
run;
%if %sysfunc(exist(univar_result)) %then %do;
data univar_result;
set univar_result ParEst_&outcome_var.;
run;
%end;
%else %do;
data univar_result;
set ParEst_&outcome_var.;
run;
%end;
%end;
%mend;

Get out the value of a variable in each observation to a macro variable

I have a table called term_table containing the below columns
comp, term_type, term, score, rank
I go through every observation and at each obs, I want to store the value of variable rank to a macro variable called curr_r. The code I created below does not work
Data Work.term_table;
input Comp $
Term_type $
Term $
Score
Rank
;
datalines;
comp1 term_type1 A 1 1
comp2 term_type2 A 2 10
comp3 term_type3 A 3 20
comp4 term_type4 B 4 20
comp5 term_type5 B 5 40
comp6 term_type6 B 6 100
;
Run;
%local j;
DATA tmp;
SET term_table;
LENGTH freq 8;
BY &by_var term_type term;
RETAIN freq;
CALL SYMPUT('curr_r', rank);
IF first.term THEN DO;
%do j = 1 %to &curr_r;
do some thing
%end;
END;
RUN;
Could you help me to solve the problem
Thanks a lot
Hung
The call symput statement does create the macro var &curr_r with the value of rank, but it is not available until after the data step.
However, I don't think you need to create the macro var &curr_r. I don't think a macro is needed at all.
I think the below should work: (Untested)
DATA tmp;
SET term_table;
LENGTH freq 8;
BY &by_var term_type term;
RETAIN freq;
IF first.term THEN DO;
do j = 1 to rank;
<do some thing>
end;
END;
RUN;
If you needed to use the rank from a prior obs, use the LAG function.
Start=Lag(rank);
To store each value of RANK in a macro variable, the below will do that:
Proc Sql noprint;
select count(rank)
into :cnt
from term_table;
%Let cnt=&cnt;
select rank
into :curr_r1 - :curr_r&cnt
from term_table;
quit;