I would like to hash merge in SAS using two keys;
The variable names for the lookup dataset called link_id 8. and ref_date 8.;
The variable names for the merged dataset called link_id 8. and drug_date 8.;
The code I used is as following:
data elig_bene_pres;
length link_id ref_date 8.;
call missing(link_id,ref_date):
if _N_=1 then do;
declare hash elig_bene(dataset:"bene.elig_bene_uid");
elig_bene.defineKey("link_id","ref_date");
elig_bene.defineDone();
end;
set data;
if elig_bene.find(key:Link_ID,key:drug_dt)=0 then output;
run;
But it seems that it is not found by these two keys. I just want to know whether my method is doable.
Thanks!
There are no obvious problems with the code.
To troubleshoot, try merge-sort: PROC SORT both data sets, then merge them by the two key variables. This will show which values look similar but are not exactly the same.
This sample shows you have the correct approach.
data elig;
input lukey1 lukey2;
datalines;
1 1
1 2
2 4
3 6
3 7
run;
data all;
do key1 = 1 to 10; do key2 = 1 to 10;
array x(5) (1:5);
output;
end; end;
run;
data all_elig;
length lukey1 lukey2 8;
call missing (lukey1,lukey2);
if _n_ = 1 then do;
declare hash elig (dataset:"elig");
elig.defineKey ('lukey1','lukey2');
elig.defineDone ();
end;
set all;
if 0 = elig.find(key:key1, key:key2);
run;
The process as shown is not really a merge because the lookup hash has no explicit data elements. The keys are implicit data when no data is specified.
If you are selecting all data rows, the first item to troubleshoot is the bene.elig_bene_uid. Are it's keys accidentally a superset of data's ?
Related
I'm testing out how to use hash objects in SAS 9.4 M6 to do fuzzy joins since PROC SQL just runs for hours on my larger dataset. I created some sample datasets (below) and what I want is for the merge to pull in exact matches on the "name" fields AND any matches that have a COMPLEV score less than 10. Right now, this code still only pulls in the exact matches.
I'm very new to hash objects so I'm sure it's a simple fix but I've tried am in need of help.
data A;
infile datalines missover;
length nameA $50;
input nameA $ ;
datalines;
MICKEYMOUSE2000-01-02
DAFDUCK1990-09-23
GOOFYMAN1993-05-11
;
run;
*second dataset with one exact match and two that differ slightly from those in dataset A;
data B;
infile datalines missover;
length nameB $50;
input nameB $ VDAY :ddmmyy10.;
format VDAY ddmmyy10.;
datalines;
MICKEYMOUSE2000-01-01 07/08/2021
DAFFYDUCK1990-09-23 05/11/2021
GOOFYMAN1993-05-11 08/11/2021
;
run;
*only pulling in exact matches, want it to pull in other fuzzy matches;
data simplemerge ;
if 0 then set work.B ; *load var properties into hash table;
if _n_ = 1 then do;
dcl hash B (dataset: 'work.B'); *declare the name B for hash using B dataset;
B.definekey('nameB');*identify var in B to use as key;
B.definedata('nameB','vday');*identify columns of data to bring in from B dataset;
B.definedone();*complete hash table definition;
end;
set work.A; *bring in A data;
if B.Find(KEY: nameA) ne 0 then do;
if complev(nameA, nameB) < 10 then do;
B.ref(key : nameB,data : nameB, data : vday);
end;
end;
RUN;
Fuzzy match in hash is not necessarily better than SQL - in fact, good chance it's identical. SQL joins are often done with a hash table behind the scenes.
That said, here's how you'd do the naive hash fuzzy lookup - with a hash iterator (hiter).
data simplemerge ;
if 0 then set work.B ; *load var properties into hash table;
if _n_ = 1 then do;
dcl hash B (dataset: 'work.B'); *declare the name B for hash using B dataset;
B.definekey('nameB');*identify var in B to use as key;
B.definedata('nameB','vday');*identify columns of data to bring in from B dataset;
B.definedone();*complete hash table definition;
dcl hiter hi_b('B');
end;
set work.A; *bring in A data;
done = 0;
if B.Find(KEY: nameA) ne 0 then do;
rc = hi_b.first();
do while (rc eq 0 );
if complev(nameA, nameB) lt 10 then do;
put "Found one!" namea= nameb=;
leave;
end;
else call missing(of nameB vday);
rc = hi_b.next();
end;
end;
else put "Found one!" namea=;
RUN;
This will be ... not fast ... if work.B has a lot of rows. It goes over every row of B once for every row of A that doesn't have an exact match.
One thing you can do to make this more efficient is not search all of B. Instead, have some smaller subset of B that you find with an exact match, and then iterate over that smaller subset; instead of using hiter just use the find_next. This may not work for your exact requirements, but if it's feasible, this would be ideal.
Here's one example of doing that. It's not particularly efficient since sex has only two values (so I'm looking through half of the rows anyway), but it does work.
data have;
set sashelp.class;
if mod(_n_,3) eq 0 then do;
name = cats(name,'Z');
end;
if mod(_n_,5) eq 0 then do;
name = cats('Row_',_n_);
end;
run;
data want;
if 0 then set sashelp.classfit;
length name_fuzz $8;
*first define two hash tables - one for exact match, one for fuzzy. Only do this if exact matches are reasonably common;
if _n_ eq 1 then do;
declare hash h_exact(dataset:'sashelp.classfit');
h_exact.defineKey('name');
h_exact.defineData(all:'Y');
h_exact.defineDone();
declare hash h_fuzzy(dataset:'sashelp.classfit(rename=name=name_fuzz keep=name sex predict lower upper lowermean uppermean)',multidata:'y');
h_fuzzy.defineKey('sex');
h_fuzzy.defineData(all:'Y');
h_fuzzy.defineDone();
call missing(name_fuzz);
end;
set have;
*now check exact - if it matches then do not try further;
rc_exact = h_exact.find();
if rc_exact eq 0 then do;
output;
return;
end;
*now try fuzzy - first look up the first match by the chunk criteria;
rc_fuzzy = h_fuzzy.find();
*now iterate over all of the matches of the chunk, and if you find a close-enough match then output that row and stop trying;
do while (rc_fuzzy eq 0);
if complev(name,name_fuzz) lt 2 then do;
output;
return;
end;
rc_fuzzy = h_fuzzy.find_next();
end;
*if you are still here, you failed to find a fuzzy match - so clear the values from the variables you are merging on and output a blank row, assuming you are doing a "left join" [if it is inner join, then just skip these next two lines];
call missing(of predict lower upper lowermean uppermean);
output;
run;
A better version of this would have a more discriminating key for the fuzzy match - the more discriminating the better. The key might not be something related at all to your fuzzy match - for example, maybe your fuzzy match is looking for names, but you also their year of birth. Match on year of birth, then iterate over complev(namea,nameb), since year of birth is quite discriminating.
I was trying to learn HASH joins in SAS but I am stuck on the case where I have multiple tables with the same variable names (not for the key, that's okay, but the other variables)
I want to join tables A, B and C each with two variables Key and Dat. The name Key and Dat is common in all three
This syntax works for me if I rename Dat in all three tables before hand to DAT_A, DAT_B, DAT_C but that defeats the purpose since I have to call all three tables which takes time
This code works:
data merged(keep=KEY DAT_A DAT_B DAT_C);
if 0 then
set A B C;
if _N_ = 1 then
do;
declare hash A(dataset:'A');
A.defineKey('KEY');
A.defineData('DAT_A');
A.defineDone();
declare hash B(dataset:'B');
B.defineKey('KEY');
B.defineData('DAT_B');
B.defineDone();
end;
set C;
if A.find(key:KEY) = 0 and B.find(key:KEY) = 0 then
output; run;
Its mentioned on the SAS website that you can specify options in data in the hash declare so I thought this might work
data merged(keep=KEY DAT_A DAT_B DAT_C DAT);
if 0 then
set A B C;
if _N_ = 1 then
do;
declare hash A(dataset:'A (rename=(DAT=DAT_A))');
A.defineKey('KEY');
A.defineData('DAT_A');
A.defineDone();
declare hash B(dataset:'B (rename=(DAT=DAT_B))');
B.defineKey('KEY');
B.defineData('DAT_B');
B.defineDone();
end;
set C (rename=(DAT=DAT_C));
if A.find(key:KEY) = 0 and B.find(key:KEY) = 0 then
output; run;
However running this gives the following error
ERROR: Variable DAT is not on file WORK.A.
ERROR: Hash data set load failed at line 33 column 4.
ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION phase.
Does anyone have any ideas
Thanks a lot
You are including DAT in the keep= dataset option on your output dataset. But your data step doesn't have the variable DAT anymore. You have renamed all copies of it.
Your error message about dataset A not having DAT is probably because of your earlier attempts to rename the variable to DAT_A.
Here is example using SASHELP.CLASS.
data merged ;
keep NAME AGE_A AGE_B AGE_C ;
if 0 then set
sashelp.class(rename=(AGE=AGE_A))
sashelp.class(rename=(AGE=AGE_B))
sashelp.class(rename=(AGE=AGE_C))
;
if _N_ = 1 then do;
declare hash A(dataset:'sashelp.class (rename=(AGE=AGE_A) where=(age_a ne 14))');
A.defineKey('NAME');
A.defineData('AGE_A');
A.defineDone();
declare hash B(dataset:'sashelp.class (rename=(AGE=AGE_B) where=(age_b ne 13))');
B.defineKey('NAME');
B.defineData('AGE_B');
B.defineDone();
end;
set sashelp.class (rename=(AGE=AGE_C));
/* if A.find(key:NAME) = 0 and B.find(key:NAME) = 0 then output; */
if A.find(key:NAME) then call missing(age_a);
if B.find(key:NAME) then call missing(age_b);
run;
I am very new to SAS and I am very eager to learn it. My question is about subsetting. I have 2 data sets; a and b namely consisting og two columns a and b respectively:
a b
3 4
5
6
data a;
set a;
run;
data b;
set b;
run;
data merged;
merge a b;
run;
proc print data=merged(firstobs= a[1] obs=a[1] keep= b);
run;
In this code I get invalid conversion type error and I could not figure out why I am getting this error because when I write like:
proc print data=merged(firstobs= 3 obs= 3 keep= b);
run;
I get the result as 6.
I know it seems very simple but I am stuck with this error. If you help me I would really appreciate. Thanks
You want to print the row from the dataset b whose number is the same as the value of a in row 1 of the dataset a.
You can't pass a value into a proc directly like that, but you can generate a macro variable from your dataset and pass it into the proc, e.g.
data _null_;
set a(obs = 1);
call symput('ROW_NUMBER',a);
run;
proc print data = b(keep = b obs = &ROW_NUMBER firstobs = &ROW_NUMBER);
run;
What would be the data step equivalent of this proc sql?
proc sql;
create table issues2 as(
select request,
area,
sum(issue_count) as issue_count,
sum(resolved_count) as resolved_count
from
issues1
group by request, area
);
PROC MEANS/SUMMARY is better, but if it's relevant, the actual data step solution is as follows. Basically you just reset the counter to 0 on first.<var> and output on last.<var>, where <var> is the last variable in your by group.
Note: This assumes the data is sorted by request area. Sort it if it is not.
data issues2(rename=(issue_count_sum=issue_count resolved_count_sum=resolved_count) drop=issue_count resolved_count);
set issues1;
by request area;
if first.area then do;
issue_count_sum=0;
resolved_count_sum=0;
end;
issue_count_sum+issue_count;
resolved_count_sum+resolved_count;
if last.area then output;
run;
The functional equivalent of what you're trying to do is the following:
data _null_;
set issues1(rename=(issue_count=_issue_count
resolved_count=_resolved_count)) end=done;
if _n_=1 then do;
declare hash total_issues();
total_issues.defineKey("request", "area");
total_issues.defineData("request", "area", "issue_count", "resolved_count");
total_issues.defineDone();
end;
if total_issues.find() ne 0 then do;
issue_count = _issue_count;
resolved_count = _resolved_count;
end;
else do;
issue_count + _issue_count;
resolved_count + _resolved_count;
end;
total_issues.replace();
if done then total_issues.output(dataset: "issues2");
run;
This method does not require you to to pre-sort the dataset. I wanted to see what kind of performance you'd get with using different methods so I did a few tests on a 74M row dataset. I got the following run-times (your results may vary):
Unsorted Dataset:
Proc SQL - 12.18 Seconds
Data Step With Hash Object Method (above) - 26.68 Seconds
Proc Means using a class statement (nway) - 5.13 Seconds
Sorted Dataset (36.94 Seconds to do a proc sort):
Proc SQL - 10.82 Seconds
Proc Means using a by statement - 9.31 Seconds
Proc Means using a class statement (nway) - 6.07 Seconds
Data Step using by statement (I used the code from Joe's answer) - 8.97 Seconds
As you can see, I wouldn't recommend using the data step with the hash object method shown above since it took twice as long as the proc sql.
I'm not sure why proc means with a bystatement took longer then proc means with a class statement, but I tried this on a bunch of different datasets and saw similar differences in runtimes (I'm using SAS 9.3 on Linux 64).
Something to keep in mind is that these runtimes might be completely different for your situation but I would recommend using the the following code to do the summation:
proc means data=issues1 noprint nway;
class request area;
var issue_count resolved_count;
output out=issues2(drop=_:) sum=;
run;
Awkward, I think, to do it in a data step at all - summing and resetting variables at each level of the by variables would work. A hash object might also do the trick.
Perhaps the simplest non-Proc SQL method would be to use Proc Summary:-
proc summary data = issues1 nway missing;
class request area;
var issue_count resolved_count;
output out = issues2 sum(issue_count) = issue_count sum(resolved_count) = resolved_count ;
run;
Here's the temporary array method. This is the "simplest" of them, making some assumptions about the request and area values; if those assumptions are faulty, as they often are in real data, it may not be quite as easy as this. Note that while in the below the data does happen to be sorted, I don't rely on it being sorted and the process don't gain any advantage from it being sorted.
data issues1;
do request=1 to 1e5;
do area = 1 to 7;
do issueNum = 1 to 1e2;
issue_count = floor(rand('Uniform')*7);
resolved_count = floor(rand('Uniform')*issue_count);
output;
end;
end;
end;
run;
data issues2;
set issues1 end=done;
array ra_issue[1100000] _temporary_;
array ra_resolved[1100000] _temporary_;
*array index = request||area, so request 9549 area 6 = 95496.;
ra_issue[input(cats(request,area),best7.)] + issue_count;
ra_resolved[input(cats(request,area),best7.)] + resolved_count;
if done then do;
do _t = 1 to dim(ra_issue);
if not missing(ra_issue[_t]) then do;
request = floor(_t/10);
area = mod(_t,10);
issue_count=ra_issue[_t];
resolved_count=ra_resolved[_t];
output;
keep request area issue_count resolved_count;
end;
end;
end;
run;
That performed comparably to PROC MEANS with CLASS, given the simple data I started it with. If you can't trivially generate a key from a combination of area and request (if they're character variables, for example), you would have to store another array of name-to-key relationships which would make it quite a lot slower if there are a lot of combinations (though if there are relatively few combinations, it's not necessarily all that bad). If for some reason you were doing this in production, I would first create a table of unique request+area combinations, create a Format and an Informat to convert back and forth from a unique key (which should be very fast AND give you a reliable index), and then do this using that format/informat rather than the cats / division-modulus that I do here.
I need some assistance with splitting a large SAS dataset into smaller datasets.
Each month I'll have a dataset containing a few million records. This number will vary from month to month. I need to split this dataset into multiple smaller datasets containing 250,000 records each. For example, if I have 1,050,000 records in the original dataset then I need the end result to be 4 datasets containing 250,000 records and 1 dataset containing 50,000 records.
From what I've been looking at it appears this will require using macros. Unfortunately I'm fairly new to SAS (unfamiliar with using macros) and don't have a lot of time to accomplish this. Any help would be greatly appreciated.
Building on Joe's answer, maybe you could try something like this :
%MACRO SPLIT(DATASET);
%LET DATASET_ID = %SYSFUNC(OPEN(&DATASET.));
%LET NOBS = %SYSFUNC(ATTRN(&DATASET__ID., NLOBS));
%LET NB_DATASETS = %SYSEVALF(&NOBS. / 250000, CEIL);
DATA
%DO I=1 %TO &NB_DATASETS.;
WANT&I.
%END;;
SET WANT;
%DO I=1 %TO &NB_DATASETS.;
%IF &I. > 1 %THEN %DO; ELSE %END; IF _N_ LE 2.5E5 * &I. THEN OUTPUT WANT&I.;
%END;
RUN;
%MEND SPLIT;
You can do it without macros at all, if you don't mind asking for datasets that may not exist, and have a reasonable bound on things.
data want1 want2 want3 want4 want5 want6 want7 want8 want9;
if _n_ le 2.5e5 then output want1;
else if _n_ le 5e5 then output want2;
else if _n_ le 7.5e5 then output want3;
... etc....
run;
Macros would make that more efficient to program and cleaner to read, but wouldn't change how it actually runs in reality.
You can do it without macros, using CALL EXECUTE(). It creates SAS-code as text strings and then executes it, after your "manually written" code completed.
data _null_;
if 0 then set have nobs=n;
do i=1 to ceil(n/250000);
call execute (cats("data want",i)||";");
call execute ("set have(firstobs="||(i-1)*250000+1||" obs="||i*250000||");");
call execute ("run;");
end;
run;
The first result on Google is from the SAS User Group International (SUGI)
These folks are your friends.
The article is here:
http://www2.sas.com/proceedings/sugi27/p083-27.pdf
The code is:
%macro split(ndsn=2);
data %do i = 1 %to &ndsn.; dsn&i. %end; ;
retain x;
set orig nobs=nobs;
if _n_ eq 1
then do;
if mod(nobs,&ndsn.) eq 0
then x=int(nobs/&ndsn.);
else x=int(nobs/&ndsn.)+1;
end;
if _n_ le x then output dsn1;
%do i = 2 %to &ndsn.;
else if _n_ le (&i.*x)
then output dsn&i.;
%end;
run;
%mend split;
%split(ndsn=10);
All you need to do is replace the 10 digit in "%split(ndsn=10);" with the number you require.
In Line 4, "set orig nobs=nobs;", simply replace orig with your dataset name.
Hey presto!
A more efficient option, if you have room in memory to store one of the smaller datasets, is a hash solution. Here's an example using basically what you're describing in the question:
data in_data;
do recid = 1 to 1.000001e7;
datavar = 1;
output;
end;
run;
data _null_;
if 0 then set in_data;
declare hash h_out();
h_out.defineKey('_n_');
h_out.defineData('recid','datavar');
h_out.defineDone();
do filenum = 1 by 1 until (eof);
do _n_ = 1 to 250000 until (eof);
set in_data end=eof;
h_out.add();
end;
h_out.output(dataset:cats('file_',filenum));
h_out.clear();
end;
stop;
run;
We define a hash object with the appropriate parameters, and simply tell it to output every 250k records, and clear it. We could do a hash-of-hashes here also, particularly if it weren't just "Every 250k records" but some other criteria drove things, but then you'd have to fit all of the records in memory, not just 250k of them.
Note also that we could do this without specifying the variables explicitly, but it requires having a useful ID on the dataset:
data _null_;
if 0 then set in_data;
declare hash h_out(dataset:'in_data(obs=0)');
h_out.defineKey('recid');
h_out.defineData(all:'y');
h_out.defineDone();
do filenum = 1 by 1 until (eof);
do _n_ = 1 to 250000 until (eof);
set in_data end=eof;
h_out.add();
end;
h_out.output(dataset:cats('file_',filenum));
h_out.clear();
end;
stop;
run;
Since we can't use _n_ anymore for the hash ID due to using the dataset option on the constructor (necessary for the all:'y' functionality), we have to have a record ID. Hopefully there is such a variable, or one could be added with a view.
Here is a basic approach. This requires manual adjustment of the intervals, but is easy to understand.
* split data;
data output1;
set df;
if 1 <= _N_ < 5 then output;
run;
data output2;
set df;
if 5 <= _N_ < 10 then output;
run;
data output3;
set df;
if 10 <= _N_ < 15 then output;
run;
data output4;
set df;
if 15 <= _N_ < 22 then output;
run;