I'm new to hash objects, but I'd like to learn more about them. I'm trying to find ways to substitute all possible proc sql and regular merges with hash whenever possible. While playing around with SASHELP datasets, I ran into the following issue:
Let's say I have a dataset of 10 unique observations (car manufacturer) and I want to match it up with another table that contains various models of these cars, so the car make repeats in that table. The other important aspect to note is that not all car makes are present in the table I'm looking up, but I still would like to retain those in my table.
Consider the code below:
proc sql noprint;
create table x as select distinct make
from sashelp.cars;
quit;
data x;
set x (obs = 10);
if make = "GMC" then make = "XYZ";
run;
data hx (drop = rc);
if 0 then set sashelp.cars(keep = make model);
if _n_ = 1 then do;
declare hash hhh(dataset: 'sashelp.cars(keep = make model)', multidata:'y');
hhh.DefineKey('make');
hhh.DefineData('model');
hhh.DefineDone();
end;
set x;
rc = hhh.find();
do while(rc = 0);
output;
rc = hhh.find_next();
end;
if rc ne 0 then do;
call missing(model);
output;
end;
run;
If all makes in table X were also in table cars, then removing output command after call missing(model) would do exactly what I want. But I also want to make sure that make "XYZ" will remain in the table.
The existing code, however, produces a blank after it find all matching models, like so:
make model
==========
Acura MDX
Acura RSX Type S 2dr
Acura TSX 4dr
... (skipping a few rows)
Acura NSX coupe 2dr manual S
Acura
Audi A4 1.8T 4dr
As you can see, in the above table, there is a missing model in the second to last row. This pattern appears in the end of every make.
Any suggestions on how to fix this would be highly appreciated!
Many thanks
The direct answer: you need to consider this section.
rc = hhh.find();
do while(rc = 0);
output;
rc = hhh.find_next();
end;
if rc ne 0 then do;
call missing(model);
output;
end;
What's happening here is you are repeatedly trying to find next, fine, until you fail. Okay. Now you're in rc ne 0 condition, though, even though you really mean that last step to only be used if you didn't even find one.
You can handle this a couple of ways. You can do this:
rc = hhh.find();
if rc ne 0 then do;
call missing(model);
output;
end;
else
do while(rc = 0);
output;
rc = hhh.find_next();
end;
Or, you can add a counter to the do while loop, and then execute the call missing/output if that counter stores a 0. The above is probably easier.
Further, you probably should consider whether a hash is the right solution for this problem. While it is possible to solve this with multidata hashes, keyed set is usually more efficient for something like this, and much easier to code.
Related
I'm testing out how to use hash objects in SAS 9.4 M6 to do fuzzy joins since PROC SQL just runs for hours on my larger dataset. I created some sample datasets (below) and what I want is for the merge to pull in exact matches on the "name" fields AND any matches that have a COMPLEV score less than 10. Right now, this code still only pulls in the exact matches.
I'm very new to hash objects so I'm sure it's a simple fix but I've tried am in need of help.
data A;
infile datalines missover;
length nameA $50;
input nameA $ ;
datalines;
MICKEYMOUSE2000-01-02
DAFDUCK1990-09-23
GOOFYMAN1993-05-11
;
run;
*second dataset with one exact match and two that differ slightly from those in dataset A;
data B;
infile datalines missover;
length nameB $50;
input nameB $ VDAY :ddmmyy10.;
format VDAY ddmmyy10.;
datalines;
MICKEYMOUSE2000-01-01 07/08/2021
DAFFYDUCK1990-09-23 05/11/2021
GOOFYMAN1993-05-11 08/11/2021
;
run;
*only pulling in exact matches, want it to pull in other fuzzy matches;
data simplemerge ;
if 0 then set work.B ; *load var properties into hash table;
if _n_ = 1 then do;
dcl hash B (dataset: 'work.B'); *declare the name B for hash using B dataset;
B.definekey('nameB');*identify var in B to use as key;
B.definedata('nameB','vday');*identify columns of data to bring in from B dataset;
B.definedone();*complete hash table definition;
end;
set work.A; *bring in A data;
if B.Find(KEY: nameA) ne 0 then do;
if complev(nameA, nameB) < 10 then do;
B.ref(key : nameB,data : nameB, data : vday);
end;
end;
RUN;
Fuzzy match in hash is not necessarily better than SQL - in fact, good chance it's identical. SQL joins are often done with a hash table behind the scenes.
That said, here's how you'd do the naive hash fuzzy lookup - with a hash iterator (hiter).
data simplemerge ;
if 0 then set work.B ; *load var properties into hash table;
if _n_ = 1 then do;
dcl hash B (dataset: 'work.B'); *declare the name B for hash using B dataset;
B.definekey('nameB');*identify var in B to use as key;
B.definedata('nameB','vday');*identify columns of data to bring in from B dataset;
B.definedone();*complete hash table definition;
dcl hiter hi_b('B');
end;
set work.A; *bring in A data;
done = 0;
if B.Find(KEY: nameA) ne 0 then do;
rc = hi_b.first();
do while (rc eq 0 );
if complev(nameA, nameB) lt 10 then do;
put "Found one!" namea= nameb=;
leave;
end;
else call missing(of nameB vday);
rc = hi_b.next();
end;
end;
else put "Found one!" namea=;
RUN;
This will be ... not fast ... if work.B has a lot of rows. It goes over every row of B once for every row of A that doesn't have an exact match.
One thing you can do to make this more efficient is not search all of B. Instead, have some smaller subset of B that you find with an exact match, and then iterate over that smaller subset; instead of using hiter just use the find_next. This may not work for your exact requirements, but if it's feasible, this would be ideal.
Here's one example of doing that. It's not particularly efficient since sex has only two values (so I'm looking through half of the rows anyway), but it does work.
data have;
set sashelp.class;
if mod(_n_,3) eq 0 then do;
name = cats(name,'Z');
end;
if mod(_n_,5) eq 0 then do;
name = cats('Row_',_n_);
end;
run;
data want;
if 0 then set sashelp.classfit;
length name_fuzz $8;
*first define two hash tables - one for exact match, one for fuzzy. Only do this if exact matches are reasonably common;
if _n_ eq 1 then do;
declare hash h_exact(dataset:'sashelp.classfit');
h_exact.defineKey('name');
h_exact.defineData(all:'Y');
h_exact.defineDone();
declare hash h_fuzzy(dataset:'sashelp.classfit(rename=name=name_fuzz keep=name sex predict lower upper lowermean uppermean)',multidata:'y');
h_fuzzy.defineKey('sex');
h_fuzzy.defineData(all:'Y');
h_fuzzy.defineDone();
call missing(name_fuzz);
end;
set have;
*now check exact - if it matches then do not try further;
rc_exact = h_exact.find();
if rc_exact eq 0 then do;
output;
return;
end;
*now try fuzzy - first look up the first match by the chunk criteria;
rc_fuzzy = h_fuzzy.find();
*now iterate over all of the matches of the chunk, and if you find a close-enough match then output that row and stop trying;
do while (rc_fuzzy eq 0);
if complev(name,name_fuzz) lt 2 then do;
output;
return;
end;
rc_fuzzy = h_fuzzy.find_next();
end;
*if you are still here, you failed to find a fuzzy match - so clear the values from the variables you are merging on and output a blank row, assuming you are doing a "left join" [if it is inner join, then just skip these next two lines];
call missing(of predict lower upper lowermean uppermean);
output;
run;
A better version of this would have a more discriminating key for the fuzzy match - the more discriminating the better. The key might not be something related at all to your fuzzy match - for example, maybe your fuzzy match is looking for names, but you also their year of birth. Match on year of birth, then iterate over complev(namea,nameb), since year of birth is quite discriminating.
I would like to hash merge in SAS using two keys;
The variable names for the lookup dataset called link_id 8. and ref_date 8.;
The variable names for the merged dataset called link_id 8. and drug_date 8.;
The code I used is as following:
data elig_bene_pres;
length link_id ref_date 8.;
call missing(link_id,ref_date):
if _N_=1 then do;
declare hash elig_bene(dataset:"bene.elig_bene_uid");
elig_bene.defineKey("link_id","ref_date");
elig_bene.defineDone();
end;
set data;
if elig_bene.find(key:Link_ID,key:drug_dt)=0 then output;
run;
But it seems that it is not found by these two keys. I just want to know whether my method is doable.
Thanks!
There are no obvious problems with the code.
To troubleshoot, try merge-sort: PROC SORT both data sets, then merge them by the two key variables. This will show which values look similar but are not exactly the same.
This sample shows you have the correct approach.
data elig;
input lukey1 lukey2;
datalines;
1 1
1 2
2 4
3 6
3 7
run;
data all;
do key1 = 1 to 10; do key2 = 1 to 10;
array x(5) (1:5);
output;
end; end;
run;
data all_elig;
length lukey1 lukey2 8;
call missing (lukey1,lukey2);
if _n_ = 1 then do;
declare hash elig (dataset:"elig");
elig.defineKey ('lukey1','lukey2');
elig.defineDone ();
end;
set all;
if 0 = elig.find(key:key1, key:key2);
run;
The process as shown is not really a merge because the lookup hash has no explicit data elements. The keys are implicit data when no data is specified.
If you are selecting all data rows, the first item to troubleshoot is the bene.elig_bene_uid. Are it's keys accidentally a superset of data's ?
I was trying to learn HASH joins in SAS but I am stuck on the case where I have multiple tables with the same variable names (not for the key, that's okay, but the other variables)
I want to join tables A, B and C each with two variables Key and Dat. The name Key and Dat is common in all three
This syntax works for me if I rename Dat in all three tables before hand to DAT_A, DAT_B, DAT_C but that defeats the purpose since I have to call all three tables which takes time
This code works:
data merged(keep=KEY DAT_A DAT_B DAT_C);
if 0 then
set A B C;
if _N_ = 1 then
do;
declare hash A(dataset:'A');
A.defineKey('KEY');
A.defineData('DAT_A');
A.defineDone();
declare hash B(dataset:'B');
B.defineKey('KEY');
B.defineData('DAT_B');
B.defineDone();
end;
set C;
if A.find(key:KEY) = 0 and B.find(key:KEY) = 0 then
output; run;
Its mentioned on the SAS website that you can specify options in data in the hash declare so I thought this might work
data merged(keep=KEY DAT_A DAT_B DAT_C DAT);
if 0 then
set A B C;
if _N_ = 1 then
do;
declare hash A(dataset:'A (rename=(DAT=DAT_A))');
A.defineKey('KEY');
A.defineData('DAT_A');
A.defineDone();
declare hash B(dataset:'B (rename=(DAT=DAT_B))');
B.defineKey('KEY');
B.defineData('DAT_B');
B.defineDone();
end;
set C (rename=(DAT=DAT_C));
if A.find(key:KEY) = 0 and B.find(key:KEY) = 0 then
output; run;
However running this gives the following error
ERROR: Variable DAT is not on file WORK.A.
ERROR: Hash data set load failed at line 33 column 4.
ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION phase.
Does anyone have any ideas
Thanks a lot
You are including DAT in the keep= dataset option on your output dataset. But your data step doesn't have the variable DAT anymore. You have renamed all copies of it.
Your error message about dataset A not having DAT is probably because of your earlier attempts to rename the variable to DAT_A.
Here is example using SASHELP.CLASS.
data merged ;
keep NAME AGE_A AGE_B AGE_C ;
if 0 then set
sashelp.class(rename=(AGE=AGE_A))
sashelp.class(rename=(AGE=AGE_B))
sashelp.class(rename=(AGE=AGE_C))
;
if _N_ = 1 then do;
declare hash A(dataset:'sashelp.class (rename=(AGE=AGE_A) where=(age_a ne 14))');
A.defineKey('NAME');
A.defineData('AGE_A');
A.defineDone();
declare hash B(dataset:'sashelp.class (rename=(AGE=AGE_B) where=(age_b ne 13))');
B.defineKey('NAME');
B.defineData('AGE_B');
B.defineDone();
end;
set sashelp.class (rename=(AGE=AGE_C));
/* if A.find(key:NAME) = 0 and B.find(key:NAME) = 0 then output; */
if A.find(key:NAME) then call missing(age_a);
if B.find(key:NAME) then call missing(age_b);
run;
What would be the data step equivalent of this proc sql?
proc sql;
create table issues2 as(
select request,
area,
sum(issue_count) as issue_count,
sum(resolved_count) as resolved_count
from
issues1
group by request, area
);
PROC MEANS/SUMMARY is better, but if it's relevant, the actual data step solution is as follows. Basically you just reset the counter to 0 on first.<var> and output on last.<var>, where <var> is the last variable in your by group.
Note: This assumes the data is sorted by request area. Sort it if it is not.
data issues2(rename=(issue_count_sum=issue_count resolved_count_sum=resolved_count) drop=issue_count resolved_count);
set issues1;
by request area;
if first.area then do;
issue_count_sum=0;
resolved_count_sum=0;
end;
issue_count_sum+issue_count;
resolved_count_sum+resolved_count;
if last.area then output;
run;
The functional equivalent of what you're trying to do is the following:
data _null_;
set issues1(rename=(issue_count=_issue_count
resolved_count=_resolved_count)) end=done;
if _n_=1 then do;
declare hash total_issues();
total_issues.defineKey("request", "area");
total_issues.defineData("request", "area", "issue_count", "resolved_count");
total_issues.defineDone();
end;
if total_issues.find() ne 0 then do;
issue_count = _issue_count;
resolved_count = _resolved_count;
end;
else do;
issue_count + _issue_count;
resolved_count + _resolved_count;
end;
total_issues.replace();
if done then total_issues.output(dataset: "issues2");
run;
This method does not require you to to pre-sort the dataset. I wanted to see what kind of performance you'd get with using different methods so I did a few tests on a 74M row dataset. I got the following run-times (your results may vary):
Unsorted Dataset:
Proc SQL - 12.18 Seconds
Data Step With Hash Object Method (above) - 26.68 Seconds
Proc Means using a class statement (nway) - 5.13 Seconds
Sorted Dataset (36.94 Seconds to do a proc sort):
Proc SQL - 10.82 Seconds
Proc Means using a by statement - 9.31 Seconds
Proc Means using a class statement (nway) - 6.07 Seconds
Data Step using by statement (I used the code from Joe's answer) - 8.97 Seconds
As you can see, I wouldn't recommend using the data step with the hash object method shown above since it took twice as long as the proc sql.
I'm not sure why proc means with a bystatement took longer then proc means with a class statement, but I tried this on a bunch of different datasets and saw similar differences in runtimes (I'm using SAS 9.3 on Linux 64).
Something to keep in mind is that these runtimes might be completely different for your situation but I would recommend using the the following code to do the summation:
proc means data=issues1 noprint nway;
class request area;
var issue_count resolved_count;
output out=issues2(drop=_:) sum=;
run;
Awkward, I think, to do it in a data step at all - summing and resetting variables at each level of the by variables would work. A hash object might also do the trick.
Perhaps the simplest non-Proc SQL method would be to use Proc Summary:-
proc summary data = issues1 nway missing;
class request area;
var issue_count resolved_count;
output out = issues2 sum(issue_count) = issue_count sum(resolved_count) = resolved_count ;
run;
Here's the temporary array method. This is the "simplest" of them, making some assumptions about the request and area values; if those assumptions are faulty, as they often are in real data, it may not be quite as easy as this. Note that while in the below the data does happen to be sorted, I don't rely on it being sorted and the process don't gain any advantage from it being sorted.
data issues1;
do request=1 to 1e5;
do area = 1 to 7;
do issueNum = 1 to 1e2;
issue_count = floor(rand('Uniform')*7);
resolved_count = floor(rand('Uniform')*issue_count);
output;
end;
end;
end;
run;
data issues2;
set issues1 end=done;
array ra_issue[1100000] _temporary_;
array ra_resolved[1100000] _temporary_;
*array index = request||area, so request 9549 area 6 = 95496.;
ra_issue[input(cats(request,area),best7.)] + issue_count;
ra_resolved[input(cats(request,area),best7.)] + resolved_count;
if done then do;
do _t = 1 to dim(ra_issue);
if not missing(ra_issue[_t]) then do;
request = floor(_t/10);
area = mod(_t,10);
issue_count=ra_issue[_t];
resolved_count=ra_resolved[_t];
output;
keep request area issue_count resolved_count;
end;
end;
end;
run;
That performed comparably to PROC MEANS with CLASS, given the simple data I started it with. If you can't trivially generate a key from a combination of area and request (if they're character variables, for example), you would have to store another array of name-to-key relationships which would make it quite a lot slower if there are a lot of combinations (though if there are relatively few combinations, it's not necessarily all that bad). If for some reason you were doing this in production, I would first create a table of unique request+area combinations, create a Format and an Informat to convert back and forth from a unique key (which should be very fast AND give you a reliable index), and then do this using that format/informat rather than the cats / division-modulus that I do here.
Why does my hash exceed the memory limits when I use the replace() method, when if I use the same code without the replace method the hash fits just fine? It seems like the hash would remain the same size either way. I am running the code on unix. In the code below, if I comment out ht.replace() the code runs fine. If I leave it in (don't have it commented out) then I receive a message saying "Hash object added 2490352 items when memory failure occurred." The "series" data set which is fed into the hash has 13 variables and 6912 rows. The "data1" dataset has 26970 rows and 4 columns. Is there any way to resolve this without messing the memsize?
data _null_;
if 0 then set series;
if _n_ = 1 then do;
declare hash ht(dataset:"series", ordered:"a", multidata:"yes");
rc = ht.defineKey("one", "two", "three");
rc = ht.defineData(all:"yes");
declare hiter hi("ht");
rc = ht.defineDone();
end;
set data1 end=eof;
rc = hi.first();
do while (rc = 0);
if low <= code1 <= high then do;
sum = sum + value1;
ht.replace();
end;
rc = hi.next();
end;
if eof then ht.output(dataset:"sum1");
run;
Probably, the problem is that your hash is multidata one, i.e. one key can correspond to many data items. For multidata hashes you have to use REPLACEDUP-method, unambiguously selecting not only a specific key, but also a specific data item within this key.
So your iterating over hash ht should look like this:
rc = hi.first();
do while (rc = 0);
rc=ht.find_next();
do while(rc=0);
if low <= code1 <= high then do;
sum = sum + value1;
ht.replacedup();
end;
rc=ht.find_next();
end;
rc = hi.next();
end;