Renaming Variables in HASH merge in SAS - hash

I was trying to learn HASH joins in SAS but I am stuck on the case where I have multiple tables with the same variable names (not for the key, that's okay, but the other variables)
I want to join tables A, B and C each with two variables Key and Dat. The name Key and Dat is common in all three
This syntax works for me if I rename Dat in all three tables before hand to DAT_A, DAT_B, DAT_C but that defeats the purpose since I have to call all three tables which takes time
This code works:
data merged(keep=KEY DAT_A DAT_B DAT_C);
if 0 then
set A B C;
if _N_ = 1 then
do;
declare hash A(dataset:'A');
A.defineKey('KEY');
A.defineData('DAT_A');
A.defineDone();
declare hash B(dataset:'B');
B.defineKey('KEY');
B.defineData('DAT_B');
B.defineDone();
end;
set C;
if A.find(key:KEY) = 0 and B.find(key:KEY) = 0 then
output; run;
Its mentioned on the SAS website that you can specify options in data in the hash declare so I thought this might work
data merged(keep=KEY DAT_A DAT_B DAT_C DAT);
if 0 then
set A B C;
if _N_ = 1 then
do;
declare hash A(dataset:'A (rename=(DAT=DAT_A))');
A.defineKey('KEY');
A.defineData('DAT_A');
A.defineDone();
declare hash B(dataset:'B (rename=(DAT=DAT_B))');
B.defineKey('KEY');
B.defineData('DAT_B');
B.defineDone();
end;
set C (rename=(DAT=DAT_C));
if A.find(key:KEY) = 0 and B.find(key:KEY) = 0 then
output; run;
However running this gives the following error
ERROR: Variable DAT is not on file WORK.A.
ERROR: Hash data set load failed at line 33 column 4.
ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION phase.
Does anyone have any ideas
Thanks a lot

You are including DAT in the keep= dataset option on your output dataset. But your data step doesn't have the variable DAT anymore. You have renamed all copies of it.
Your error message about dataset A not having DAT is probably because of your earlier attempts to rename the variable to DAT_A.
Here is example using SASHELP.CLASS.
data merged ;
keep NAME AGE_A AGE_B AGE_C ;
if 0 then set
sashelp.class(rename=(AGE=AGE_A))
sashelp.class(rename=(AGE=AGE_B))
sashelp.class(rename=(AGE=AGE_C))
;
if _N_ = 1 then do;
declare hash A(dataset:'sashelp.class (rename=(AGE=AGE_A) where=(age_a ne 14))');
A.defineKey('NAME');
A.defineData('AGE_A');
A.defineDone();
declare hash B(dataset:'sashelp.class (rename=(AGE=AGE_B) where=(age_b ne 13))');
B.defineKey('NAME');
B.defineData('AGE_B');
B.defineDone();
end;
set sashelp.class (rename=(AGE=AGE_C));
/* if A.find(key:NAME) = 0 and B.find(key:NAME) = 0 then output; */
if A.find(key:NAME) then call missing(age_a);
if B.find(key:NAME) then call missing(age_b);
run;

Related

How Can I Do a Fuzzy Character Merge using Hash Objects in SAS?

I'm testing out how to use hash objects in SAS 9.4 M6 to do fuzzy joins since PROC SQL just runs for hours on my larger dataset. I created some sample datasets (below) and what I want is for the merge to pull in exact matches on the "name" fields AND any matches that have a COMPLEV score less than 10. Right now, this code still only pulls in the exact matches.
I'm very new to hash objects so I'm sure it's a simple fix but I've tried am in need of help.
data A;
infile datalines missover;
length nameA $50;
input nameA $ ;
datalines;
MICKEYMOUSE2000-01-02
DAFDUCK1990-09-23
GOOFYMAN1993-05-11
;
run;
*second dataset with one exact match and two that differ slightly from those in dataset A;
data B;
infile datalines missover;
length nameB $50;
input nameB $ VDAY :ddmmyy10.;
format VDAY ddmmyy10.;
datalines;
MICKEYMOUSE2000-01-01 07/08/2021
DAFFYDUCK1990-09-23 05/11/2021
GOOFYMAN1993-05-11 08/11/2021
;
run;
*only pulling in exact matches, want it to pull in other fuzzy matches;
data simplemerge ;
if 0 then set work.B ; *load var properties into hash table;
if _n_ = 1 then do;
dcl hash B (dataset: 'work.B'); *declare the name B for hash using B dataset;
B.definekey('nameB');*identify var in B to use as key;
B.definedata('nameB','vday');*identify columns of data to bring in from B dataset;
B.definedone();*complete hash table definition;
end;
set work.A; *bring in A data;
if B.Find(KEY: nameA) ne 0 then do;
if complev(nameA, nameB) < 10 then do;
B.ref(key : nameB,data : nameB, data : vday);
end;
end;
RUN;
Fuzzy match in hash is not necessarily better than SQL - in fact, good chance it's identical. SQL joins are often done with a hash table behind the scenes.
That said, here's how you'd do the naive hash fuzzy lookup - with a hash iterator (hiter).
data simplemerge ;
if 0 then set work.B ; *load var properties into hash table;
if _n_ = 1 then do;
dcl hash B (dataset: 'work.B'); *declare the name B for hash using B dataset;
B.definekey('nameB');*identify var in B to use as key;
B.definedata('nameB','vday');*identify columns of data to bring in from B dataset;
B.definedone();*complete hash table definition;
dcl hiter hi_b('B');
end;
set work.A; *bring in A data;
done = 0;
if B.Find(KEY: nameA) ne 0 then do;
rc = hi_b.first();
do while (rc eq 0 );
if complev(nameA, nameB) lt 10 then do;
put "Found one!" namea= nameb=;
leave;
end;
else call missing(of nameB vday);
rc = hi_b.next();
end;
end;
else put "Found one!" namea=;
RUN;
This will be ... not fast ... if work.B has a lot of rows. It goes over every row of B once for every row of A that doesn't have an exact match.
One thing you can do to make this more efficient is not search all of B. Instead, have some smaller subset of B that you find with an exact match, and then iterate over that smaller subset; instead of using hiter just use the find_next. This may not work for your exact requirements, but if it's feasible, this would be ideal.
Here's one example of doing that. It's not particularly efficient since sex has only two values (so I'm looking through half of the rows anyway), but it does work.
data have;
set sashelp.class;
if mod(_n_,3) eq 0 then do;
name = cats(name,'Z');
end;
if mod(_n_,5) eq 0 then do;
name = cats('Row_',_n_);
end;
run;
data want;
if 0 then set sashelp.classfit;
length name_fuzz $8;
*first define two hash tables - one for exact match, one for fuzzy. Only do this if exact matches are reasonably common;
if _n_ eq 1 then do;
declare hash h_exact(dataset:'sashelp.classfit');
h_exact.defineKey('name');
h_exact.defineData(all:'Y');
h_exact.defineDone();
declare hash h_fuzzy(dataset:'sashelp.classfit(rename=name=name_fuzz keep=name sex predict lower upper lowermean uppermean)',multidata:'y');
h_fuzzy.defineKey('sex');
h_fuzzy.defineData(all:'Y');
h_fuzzy.defineDone();
call missing(name_fuzz);
end;
set have;
*now check exact - if it matches then do not try further;
rc_exact = h_exact.find();
if rc_exact eq 0 then do;
output;
return;
end;
*now try fuzzy - first look up the first match by the chunk criteria;
rc_fuzzy = h_fuzzy.find();
*now iterate over all of the matches of the chunk, and if you find a close-enough match then output that row and stop trying;
do while (rc_fuzzy eq 0);
if complev(name,name_fuzz) lt 2 then do;
output;
return;
end;
rc_fuzzy = h_fuzzy.find_next();
end;
*if you are still here, you failed to find a fuzzy match - so clear the values from the variables you are merging on and output a blank row, assuming you are doing a "left join" [if it is inner join, then just skip these next two lines];
call missing(of predict lower upper lowermean uppermean);
output;
run;
A better version of this would have a more discriminating key for the fuzzy match - the more discriminating the better. The key might not be something related at all to your fuzzy match - for example, maybe your fuzzy match is looking for names, but you also their year of birth. Match on year of birth, then iterate over complev(namea,nameb), since year of birth is quite discriminating.

Can I do hash merge by multiple keys in SAS

I would like to hash merge in SAS using two keys;
The variable names for the lookup dataset called link_id 8. and ref_date 8.;
The variable names for the merged dataset called link_id 8. and drug_date 8.;
The code I used is as following:
data elig_bene_pres;
length link_id ref_date 8.;
call missing(link_id,ref_date):
if _N_=1 then do;
declare hash elig_bene(dataset:"bene.elig_bene_uid");
elig_bene.defineKey("link_id","ref_date");
elig_bene.defineDone();
end;
set data;
if elig_bene.find(key:Link_ID,key:drug_dt)=0 then output;
run;
But it seems that it is not found by these two keys. I just want to know whether my method is doable.
Thanks!
There are no obvious problems with the code.
To troubleshoot, try merge-sort: PROC SORT both data sets, then merge them by the two key variables. This will show which values look similar but are not exactly the same.
This sample shows you have the correct approach.
data elig;
input lukey1 lukey2;
datalines;
1 1
1 2
2 4
3 6
3 7
run;
data all;
do key1 = 1 to 10; do key2 = 1 to 10;
array x(5) (1:5);
output;
end; end;
run;
data all_elig;
length lukey1 lukey2 8;
call missing (lukey1,lukey2);
if _n_ = 1 then do;
declare hash elig (dataset:"elig");
elig.defineKey ('lukey1','lukey2');
elig.defineDone ();
end;
set all;
if 0 = elig.find(key:key1, key:key2);
run;
The process as shown is not really a merge because the lookup hash has no explicit data elements. The keys are implicit data when no data is specified.
If you are selecting all data rows, the first item to troubleshoot is the bene.elig_bene_uid. Are it's keys accidentally a superset of data's ?

SAS - Find Palindromes for all Datasets in a Directory

Task:
I need to identify all palindromes within a directory. I use a proc contents and proc sort to identify the datasets within a directory, like so:
proc contents data = dPath._all_ out = dFiles (keep = memname);
run;
proc sort data = dFiles nodupkey; by memname;run;
I want to identify palindromes within this directory.
Issue:
I plan to use macros because I need to do this for all datasets within a directory. So, instead of the user inputting the string to check if there is a palindrome, I need that to be done dynamically, i.e. identify any palindromes within a dataset.
Updates:
As you can see in the above pictures, I am able to successfully flag the palindromes for case sensitive and case insensitive situations. I would like to output the specific element that is a palindrome to a separate dataset. Currently, I am only able to output the entire row with the palindrome in it.
Code:
data palindrome_set (drop = i) palindrome_case_sensitive palindrome_case_insensitive;
set reverse_rows;
array palindrome[*] _all_ ;
do i = 1 to dim(palindrome);
palindrome_cs = (trim(palindrome[i]) eq reverse(trim(palindrome[i])));
/* if palindrome_cs = 1 then output palindrome[i]; WANT TO OUTPUT SPECIFIC ELEMENT, NOT ENTIRE ROW*/
palindrome_cis = (lowcase(trim(palindrome[i])) eq reverse(lowcase(trim(palindrome[i]))));
end;
output palindrome_set;
if palindrome_cs = 1 then output palindrome_case_sensitive; *WANT TO OUTPUT SPECIFIC ELEMENT, NOT ENTIRE ROW;
if palindrome_cis = 1 then output palindrome_case_insensitive; *WANT TO OUTPUT SPECIFIC ELEMENT, NOT ENTIRE ROW;
run;
If memtype ="DATA" then the Memname in your code will hold the table names only.
To check palindromes in table names using your code above; try:
%macro palindrome (parameter = );
%let string = %sysfunc(reverse(%sysfunc(compress("&parameter ",,sp);
%let reverse = %sysfunc(compress(["&parameter ");
%if %upcase(&string.) = %upcase(&reverse.) %then %do;
ods output = "/palindrome"
%end;
data work.palindromes;
set work.dfiles;
%macro palindrome (parameter = Memname);
run;
Not sure why your image showed a reversal of the variable names as well.
The underlying variable name corresponding to an array reference can be retrieved using the data step function VNAME(). Also, the formatted value of a variable can be obtained using the data step function VVALUE. Both these functions have a dynamic version -- VNAMEX and VVALUEX. An array based solution will not need to utilize the X versions of the functions.
Processing all variables via an array is a little tricky because you need additional variables to perform the processing, and you don't want those tested for palindromicity. In this example, worker variable names use the convention of starting with _pal in the hopes of avoiding variable name collision with the data sets being processed. The example processes a single data set, but it should be obvious how to macro-ize the code and have it work for a data set name that is passed.
data want(keep=_palds_ _palrow_ _palvar_ _palval_);
set sashelp.class;
array _pals_ _character_; * array elements are those character variables in the pdv at this point in the data step;
array _palx_ _numeric_; * array elements are those numeric variables in the pdv at this point in the data step;
attrib
_palds_ length = $42
_palrow_ length = 8
_palvar_ length = $32
_palval_ length = $500
;
* check raw character value;
do _palindex_ = 1 to dim(_pals_);
if length (_pals_(_palindex_)) > lengthm(_palval_) then do;
_palvar_ = vname (_pals_(_palindex_));
put "NOTE: sashelp.class " _n_= _palVar_ " had a value that is longer than _palval_ container";
continue;
end;
if _pals_(_palindex_) = reverse(trim(_pals_(_palindex_))) then do;
_palds_ = "sashelp.class";
_palrow_ = _n_;
_palvar_ = vname (_pals_(_palindex_));
_palval_ = _pals_(_palindex_);
end;
end;
* check formatted numeric value;
do _palindex_ = 1 to dim(_palx_);
if left(vvalue(_palx_(_palindex_))) = reverse(trim(vvalue(_palx_(_palindex_)))) then do;
_palds_ = "sashelp.class";
_palrow_ = _n_;
_palvar_ = vname (_palx_(_palindex_));
_palval_ = _palx_(_palindex_);
end;
end;
run;
A macro that wants to explicitly avoid name collision must perform some navel-gazing on the data set to be processed in order to generate worker variable names that do not collide.
Processing all members of the libref can be very resource intensive if the libref connects to a remote host -- so a robust solution may want skip over those.
Some other approaches:
use CALL VNEXT routine to iterate through the pdv variables
use the dictionary table or proc contents output as a basis for generating a wall of variable tests in a data step that does not rely on arrays.

SAS hash objects: multidata merge

I'm new to hash objects, but I'd like to learn more about them. I'm trying to find ways to substitute all possible proc sql and regular merges with hash whenever possible. While playing around with SASHELP datasets, I ran into the following issue:
Let's say I have a dataset of 10 unique observations (car manufacturer) and I want to match it up with another table that contains various models of these cars, so the car make repeats in that table. The other important aspect to note is that not all car makes are present in the table I'm looking up, but I still would like to retain those in my table.
Consider the code below:
proc sql noprint;
create table x as select distinct make
from sashelp.cars;
quit;
data x;
set x (obs = 10);
if make = "GMC" then make = "XYZ";
run;
data hx (drop = rc);
if 0 then set sashelp.cars(keep = make model);
if _n_ = 1 then do;
declare hash hhh(dataset: 'sashelp.cars(keep = make model)', multidata:'y');
hhh.DefineKey('make');
hhh.DefineData('model');
hhh.DefineDone();
end;
set x;
rc = hhh.find();
do while(rc = 0);
output;
rc = hhh.find_next();
end;
if rc ne 0 then do;
call missing(model);
output;
end;
run;
If all makes in table X were also in table cars, then removing output command after call missing(model) would do exactly what I want. But I also want to make sure that make "XYZ" will remain in the table.
The existing code, however, produces a blank after it find all matching models, like so:
make model
==========
Acura MDX
Acura RSX Type S 2dr
Acura TSX 4dr
... (skipping a few rows)
Acura NSX coupe 2dr manual S
Acura
Audi A4 1.8T 4dr
As you can see, in the above table, there is a missing model in the second to last row. This pattern appears in the end of every make.
Any suggestions on how to fix this would be highly appreciated!
Many thanks
The direct answer: you need to consider this section.
rc = hhh.find();
do while(rc = 0);
output;
rc = hhh.find_next();
end;
if rc ne 0 then do;
call missing(model);
output;
end;
What's happening here is you are repeatedly trying to find next, fine, until you fail. Okay. Now you're in rc ne 0 condition, though, even though you really mean that last step to only be used if you didn't even find one.
You can handle this a couple of ways. You can do this:
rc = hhh.find();
if rc ne 0 then do;
call missing(model);
output;
end;
else
do while(rc = 0);
output;
rc = hhh.find_next();
end;
Or, you can add a counter to the do while loop, and then execute the call missing/output if that counter stores a 0. The above is probably easier.
Further, you probably should consider whether a hash is the right solution for this problem. While it is possible to solve this with multidata hashes, keyed set is usually more efficient for something like this, and much easier to code.

Hash memory usage with replace() method in SAS

Why does my hash exceed the memory limits when I use the replace() method, when if I use the same code without the replace method the hash fits just fine? It seems like the hash would remain the same size either way. I am running the code on unix. In the code below, if I comment out ht.replace() the code runs fine. If I leave it in (don't have it commented out) then I receive a message saying "Hash object added 2490352 items when memory failure occurred." The "series" data set which is fed into the hash has 13 variables and 6912 rows. The "data1" dataset has 26970 rows and 4 columns. Is there any way to resolve this without messing the memsize?
data _null_;
if 0 then set series;
if _n_ = 1 then do;
declare hash ht(dataset:"series", ordered:"a", multidata:"yes");
rc = ht.defineKey("one", "two", "three");
rc = ht.defineData(all:"yes");
declare hiter hi("ht");
rc = ht.defineDone();
end;
set data1 end=eof;
rc = hi.first();
do while (rc = 0);
if low <= code1 <= high then do;
sum = sum + value1;
ht.replace();
end;
rc = hi.next();
end;
if eof then ht.output(dataset:"sum1");
run;
Probably, the problem is that your hash is multidata one, i.e. one key can correspond to many data items. For multidata hashes you have to use REPLACEDUP-method, unambiguously selecting not only a specific key, but also a specific data item within this key.
So your iterating over hash ht should look like this:
rc = hi.first();
do while (rc = 0);
rc=ht.find_next();
do while(rc=0);
if low <= code1 <= high then do;
sum = sum + value1;
ht.replacedup();
end;
rc=ht.find_next();
end;
rc = hi.next();
end;