Hash memory usage with replace() method in SAS - hash

Why does my hash exceed the memory limits when I use the replace() method, when if I use the same code without the replace method the hash fits just fine? It seems like the hash would remain the same size either way. I am running the code on unix. In the code below, if I comment out ht.replace() the code runs fine. If I leave it in (don't have it commented out) then I receive a message saying "Hash object added 2490352 items when memory failure occurred." The "series" data set which is fed into the hash has 13 variables and 6912 rows. The "data1" dataset has 26970 rows and 4 columns. Is there any way to resolve this without messing the memsize?
data _null_;
if 0 then set series;
if _n_ = 1 then do;
declare hash ht(dataset:"series", ordered:"a", multidata:"yes");
rc = ht.defineKey("one", "two", "three");
rc = ht.defineData(all:"yes");
declare hiter hi("ht");
rc = ht.defineDone();
end;
set data1 end=eof;
rc = hi.first();
do while (rc = 0);
if low <= code1 <= high then do;
sum = sum + value1;
ht.replace();
end;
rc = hi.next();
end;
if eof then ht.output(dataset:"sum1");
run;

Probably, the problem is that your hash is multidata one, i.e. one key can correspond to many data items. For multidata hashes you have to use REPLACEDUP-method, unambiguously selecting not only a specific key, but also a specific data item within this key.
So your iterating over hash ht should look like this:
rc = hi.first();
do while (rc = 0);
rc=ht.find_next();
do while(rc=0);
if low <= code1 <= high then do;
sum = sum + value1;
ht.replacedup();
end;
rc=ht.find_next();
end;
rc = hi.next();
end;

Related

How Can I Do a Fuzzy Character Merge using Hash Objects in SAS?

I'm testing out how to use hash objects in SAS 9.4 M6 to do fuzzy joins since PROC SQL just runs for hours on my larger dataset. I created some sample datasets (below) and what I want is for the merge to pull in exact matches on the "name" fields AND any matches that have a COMPLEV score less than 10. Right now, this code still only pulls in the exact matches.
I'm very new to hash objects so I'm sure it's a simple fix but I've tried am in need of help.
data A;
infile datalines missover;
length nameA $50;
input nameA $ ;
datalines;
MICKEYMOUSE2000-01-02
DAFDUCK1990-09-23
GOOFYMAN1993-05-11
;
run;
*second dataset with one exact match and two that differ slightly from those in dataset A;
data B;
infile datalines missover;
length nameB $50;
input nameB $ VDAY :ddmmyy10.;
format VDAY ddmmyy10.;
datalines;
MICKEYMOUSE2000-01-01 07/08/2021
DAFFYDUCK1990-09-23 05/11/2021
GOOFYMAN1993-05-11 08/11/2021
;
run;
*only pulling in exact matches, want it to pull in other fuzzy matches;
data simplemerge ;
if 0 then set work.B ; *load var properties into hash table;
if _n_ = 1 then do;
dcl hash B (dataset: 'work.B'); *declare the name B for hash using B dataset;
B.definekey('nameB');*identify var in B to use as key;
B.definedata('nameB','vday');*identify columns of data to bring in from B dataset;
B.definedone();*complete hash table definition;
end;
set work.A; *bring in A data;
if B.Find(KEY: nameA) ne 0 then do;
if complev(nameA, nameB) < 10 then do;
B.ref(key : nameB,data : nameB, data : vday);
end;
end;
RUN;
Fuzzy match in hash is not necessarily better than SQL - in fact, good chance it's identical. SQL joins are often done with a hash table behind the scenes.
That said, here's how you'd do the naive hash fuzzy lookup - with a hash iterator (hiter).
data simplemerge ;
if 0 then set work.B ; *load var properties into hash table;
if _n_ = 1 then do;
dcl hash B (dataset: 'work.B'); *declare the name B for hash using B dataset;
B.definekey('nameB');*identify var in B to use as key;
B.definedata('nameB','vday');*identify columns of data to bring in from B dataset;
B.definedone();*complete hash table definition;
dcl hiter hi_b('B');
end;
set work.A; *bring in A data;
done = 0;
if B.Find(KEY: nameA) ne 0 then do;
rc = hi_b.first();
do while (rc eq 0 );
if complev(nameA, nameB) lt 10 then do;
put "Found one!" namea= nameb=;
leave;
end;
else call missing(of nameB vday);
rc = hi_b.next();
end;
end;
else put "Found one!" namea=;
RUN;
This will be ... not fast ... if work.B has a lot of rows. It goes over every row of B once for every row of A that doesn't have an exact match.
One thing you can do to make this more efficient is not search all of B. Instead, have some smaller subset of B that you find with an exact match, and then iterate over that smaller subset; instead of using hiter just use the find_next. This may not work for your exact requirements, but if it's feasible, this would be ideal.
Here's one example of doing that. It's not particularly efficient since sex has only two values (so I'm looking through half of the rows anyway), but it does work.
data have;
set sashelp.class;
if mod(_n_,3) eq 0 then do;
name = cats(name,'Z');
end;
if mod(_n_,5) eq 0 then do;
name = cats('Row_',_n_);
end;
run;
data want;
if 0 then set sashelp.classfit;
length name_fuzz $8;
*first define two hash tables - one for exact match, one for fuzzy. Only do this if exact matches are reasonably common;
if _n_ eq 1 then do;
declare hash h_exact(dataset:'sashelp.classfit');
h_exact.defineKey('name');
h_exact.defineData(all:'Y');
h_exact.defineDone();
declare hash h_fuzzy(dataset:'sashelp.classfit(rename=name=name_fuzz keep=name sex predict lower upper lowermean uppermean)',multidata:'y');
h_fuzzy.defineKey('sex');
h_fuzzy.defineData(all:'Y');
h_fuzzy.defineDone();
call missing(name_fuzz);
end;
set have;
*now check exact - if it matches then do not try further;
rc_exact = h_exact.find();
if rc_exact eq 0 then do;
output;
return;
end;
*now try fuzzy - first look up the first match by the chunk criteria;
rc_fuzzy = h_fuzzy.find();
*now iterate over all of the matches of the chunk, and if you find a close-enough match then output that row and stop trying;
do while (rc_fuzzy eq 0);
if complev(name,name_fuzz) lt 2 then do;
output;
return;
end;
rc_fuzzy = h_fuzzy.find_next();
end;
*if you are still here, you failed to find a fuzzy match - so clear the values from the variables you are merging on and output a blank row, assuming you are doing a "left join" [if it is inner join, then just skip these next two lines];
call missing(of predict lower upper lowermean uppermean);
output;
run;
A better version of this would have a more discriminating key for the fuzzy match - the more discriminating the better. The key might not be something related at all to your fuzzy match - for example, maybe your fuzzy match is looking for names, but you also their year of birth. Match on year of birth, then iterate over complev(namea,nameb), since year of birth is quite discriminating.

SAS Hash join returns unexpected results

For a project at work we're joining two large datasets together multiple times. We're using a hash join because it's much faster. Recently we found that occasionally the HASH join was returning the wrong value but we don't really know why. A coworker went through and changed a column name in the HASH so we are now using a different name for the returned value and the column name from the HASH table (instead of Liab_ILF_Factor = ROUND(LIAB_ILF_Factor, .001) we're using Liab_ILF_Factor = ROUND(Liab_ILF_Fact, .001) and that seems to be working but just concerned that it happened in the first place and want to make sure we have actually fixed the underlying issue. What was weird was that the join seemed to correctly match several of the fields (industry, state, weight class, type etc) but would pull the ILF for say 1,000,000 instead of 250,000 which is what it should have pulled.
data liabumuimilffactor (drop=liabumuimilf_factors_dte_key liabumuimilf_factor_state_join veh_wgt_class industry Limit ILF_Bucket Eff_dt Exp_Dt Created_Dt state dte_key
LIAB_ILF_Fact UM_UIM_ILF_Fact PIP_ILF_Fact);
if 0 Then Set ilf_input Liab_UM_UIM_ILF_Factor;
if _N_ = 1 then do;
Declare Hash ILF_Factor_Hash(Dataset:"Liab_UM_UIM_ILF_Factor");
ILF_Factor_Hash.DefineKey('state', 'dte_key', 'Veh_Type', 'Veh_Wgt_Class', 'Industry', 'Limit');
ILF_Factor_Hash.DefineData('ILF_Bucket', 'LIAB_ILF_Fact', 'UM_UIM_ILF_Fact', 'PIP_ILF_Fact');
ILF_Factor_Hash.DefineDone();
end;
set ilf_input;
if ILF_Factor_Hash.Find(Key:liabumuimilf_factor_state_join, Key:liabumuimilf_factors_dte_key, Key:Veh_Type, Key:Veh_Wgt_Class_Mapped, Key:IndustryGroup, Key:Liab_Limit) = 0 then do;
Liab_ILF_Factor = round(LIAB_ILF_Fact, .001);
Liab_ILF_Bucket = ILF_Bucket;
end;
else do;
Liab_ILF_Factor = .;
Liab_ILF_Bucket = "";
end;
if ILF_Factor_Hash.Find(Key:liabumuimilf_factor_state_join, Key:liabumuimilf_factors_dte_key, Key:Veh_Type, Key:Veh_Wgt_Class_Mapped, Key:IndustryGroup, Key:UM_UIM_Limit) = 0 then do;
UM_UIM_ILF_Factor = round(UM_UIM_ILF_Fact, .001);
UM_UIM_ILF_Bucket = ILF_Bucket;
end;
else do;
UM_UIM_ILF_Factor = .;
UM_UIM_ILF_Bucket = "";
end;
if ILF_Factor_Hash.Find(Key:liabumuimilf_factor_state_join, Key:liabumuimilf_factors_dte_key, Key:Veh_Type, Key:Veh_Wgt_Class_Mapped, Key:IndustryGroup, Key:PIP_Limit) = 0 then do;
PIP_ILF_Factor = round(PIP_ILF_Fact, .001);
PIP_ILF_Bucket = ILF_Bucket;
end;
else do;
PIP_ILF_Factor = .;
PIP_ILF_Bucket = "";
end;
run;

Can I do hash merge by multiple keys in SAS

I would like to hash merge in SAS using two keys;
The variable names for the lookup dataset called link_id 8. and ref_date 8.;
The variable names for the merged dataset called link_id 8. and drug_date 8.;
The code I used is as following:
data elig_bene_pres;
length link_id ref_date 8.;
call missing(link_id,ref_date):
if _N_=1 then do;
declare hash elig_bene(dataset:"bene.elig_bene_uid");
elig_bene.defineKey("link_id","ref_date");
elig_bene.defineDone();
end;
set data;
if elig_bene.find(key:Link_ID,key:drug_dt)=0 then output;
run;
But it seems that it is not found by these two keys. I just want to know whether my method is doable.
Thanks!
There are no obvious problems with the code.
To troubleshoot, try merge-sort: PROC SORT both data sets, then merge them by the two key variables. This will show which values look similar but are not exactly the same.
This sample shows you have the correct approach.
data elig;
input lukey1 lukey2;
datalines;
1 1
1 2
2 4
3 6
3 7
run;
data all;
do key1 = 1 to 10; do key2 = 1 to 10;
array x(5) (1:5);
output;
end; end;
run;
data all_elig;
length lukey1 lukey2 8;
call missing (lukey1,lukey2);
if _n_ = 1 then do;
declare hash elig (dataset:"elig");
elig.defineKey ('lukey1','lukey2');
elig.defineDone ();
end;
set all;
if 0 = elig.find(key:key1, key:key2);
run;
The process as shown is not really a merge because the lookup hash has no explicit data elements. The keys are implicit data when no data is specified.
If you are selecting all data rows, the first item to troubleshoot is the bene.elig_bene_uid. Are it's keys accidentally a superset of data's ?

Renaming Variables in HASH merge in SAS

I was trying to learn HASH joins in SAS but I am stuck on the case where I have multiple tables with the same variable names (not for the key, that's okay, but the other variables)
I want to join tables A, B and C each with two variables Key and Dat. The name Key and Dat is common in all three
This syntax works for me if I rename Dat in all three tables before hand to DAT_A, DAT_B, DAT_C but that defeats the purpose since I have to call all three tables which takes time
This code works:
data merged(keep=KEY DAT_A DAT_B DAT_C);
if 0 then
set A B C;
if _N_ = 1 then
do;
declare hash A(dataset:'A');
A.defineKey('KEY');
A.defineData('DAT_A');
A.defineDone();
declare hash B(dataset:'B');
B.defineKey('KEY');
B.defineData('DAT_B');
B.defineDone();
end;
set C;
if A.find(key:KEY) = 0 and B.find(key:KEY) = 0 then
output; run;
Its mentioned on the SAS website that you can specify options in data in the hash declare so I thought this might work
data merged(keep=KEY DAT_A DAT_B DAT_C DAT);
if 0 then
set A B C;
if _N_ = 1 then
do;
declare hash A(dataset:'A (rename=(DAT=DAT_A))');
A.defineKey('KEY');
A.defineData('DAT_A');
A.defineDone();
declare hash B(dataset:'B (rename=(DAT=DAT_B))');
B.defineKey('KEY');
B.defineData('DAT_B');
B.defineDone();
end;
set C (rename=(DAT=DAT_C));
if A.find(key:KEY) = 0 and B.find(key:KEY) = 0 then
output; run;
However running this gives the following error
ERROR: Variable DAT is not on file WORK.A.
ERROR: Hash data set load failed at line 33 column 4.
ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION phase.
Does anyone have any ideas
Thanks a lot
You are including DAT in the keep= dataset option on your output dataset. But your data step doesn't have the variable DAT anymore. You have renamed all copies of it.
Your error message about dataset A not having DAT is probably because of your earlier attempts to rename the variable to DAT_A.
Here is example using SASHELP.CLASS.
data merged ;
keep NAME AGE_A AGE_B AGE_C ;
if 0 then set
sashelp.class(rename=(AGE=AGE_A))
sashelp.class(rename=(AGE=AGE_B))
sashelp.class(rename=(AGE=AGE_C))
;
if _N_ = 1 then do;
declare hash A(dataset:'sashelp.class (rename=(AGE=AGE_A) where=(age_a ne 14))');
A.defineKey('NAME');
A.defineData('AGE_A');
A.defineDone();
declare hash B(dataset:'sashelp.class (rename=(AGE=AGE_B) where=(age_b ne 13))');
B.defineKey('NAME');
B.defineData('AGE_B');
B.defineDone();
end;
set sashelp.class (rename=(AGE=AGE_C));
/* if A.find(key:NAME) = 0 and B.find(key:NAME) = 0 then output; */
if A.find(key:NAME) then call missing(age_a);
if B.find(key:NAME) then call missing(age_b);
run;

SAS hash objects: multidata merge

I'm new to hash objects, but I'd like to learn more about them. I'm trying to find ways to substitute all possible proc sql and regular merges with hash whenever possible. While playing around with SASHELP datasets, I ran into the following issue:
Let's say I have a dataset of 10 unique observations (car manufacturer) and I want to match it up with another table that contains various models of these cars, so the car make repeats in that table. The other important aspect to note is that not all car makes are present in the table I'm looking up, but I still would like to retain those in my table.
Consider the code below:
proc sql noprint;
create table x as select distinct make
from sashelp.cars;
quit;
data x;
set x (obs = 10);
if make = "GMC" then make = "XYZ";
run;
data hx (drop = rc);
if 0 then set sashelp.cars(keep = make model);
if _n_ = 1 then do;
declare hash hhh(dataset: 'sashelp.cars(keep = make model)', multidata:'y');
hhh.DefineKey('make');
hhh.DefineData('model');
hhh.DefineDone();
end;
set x;
rc = hhh.find();
do while(rc = 0);
output;
rc = hhh.find_next();
end;
if rc ne 0 then do;
call missing(model);
output;
end;
run;
If all makes in table X were also in table cars, then removing output command after call missing(model) would do exactly what I want. But I also want to make sure that make "XYZ" will remain in the table.
The existing code, however, produces a blank after it find all matching models, like so:
make model
==========
Acura MDX
Acura RSX Type S 2dr
Acura TSX 4dr
... (skipping a few rows)
Acura NSX coupe 2dr manual S
Acura
Audi A4 1.8T 4dr
As you can see, in the above table, there is a missing model in the second to last row. This pattern appears in the end of every make.
Any suggestions on how to fix this would be highly appreciated!
Many thanks
The direct answer: you need to consider this section.
rc = hhh.find();
do while(rc = 0);
output;
rc = hhh.find_next();
end;
if rc ne 0 then do;
call missing(model);
output;
end;
What's happening here is you are repeatedly trying to find next, fine, until you fail. Okay. Now you're in rc ne 0 condition, though, even though you really mean that last step to only be used if you didn't even find one.
You can handle this a couple of ways. You can do this:
rc = hhh.find();
if rc ne 0 then do;
call missing(model);
output;
end;
else
do while(rc = 0);
output;
rc = hhh.find_next();
end;
Or, you can add a counter to the do while loop, and then execute the call missing/output if that counter stores a 0. The above is probably easier.
Further, you probably should consider whether a hash is the right solution for this problem. While it is possible to solve this with multidata hashes, keyed set is usually more efficient for something like this, and much easier to code.