Create dictionary from table using hash object - hash

I'm sure it's an easy task, but I'm still neophyte at SAS, so I have some problems :(.
Consider, I have some data set Table with column Column $24 (length is important). And I want to create a data set HashTable with only one column Key $11, where Key consists of those unique values of Column, which length equals 11.
So, I'm trying to use a hash object, but I feel I'm doing something wrong :).
data _null_;
length Key $11;
set Table end = _end;
if _N_ = 1 then do;
declare hash h();
h.defineKey('Key');
h.defineDone();
end;
if length(Column) = 11 then
rc = h.add(Key: Column);
if _end then
rc = h.output(dataset: 'HashTable');
run;
When I submit program I get errors:
7904 rc = h.add(Key: Column);
ERROR: Incorrect number of data entries specified at line 7904 column 14.
ERROR: DATA STEP Component Object failure. Aborted during the EXECUTION phase.

This issue has been identified by Paul Dorfman in 2007 on SAS-L, Paul Dorfman on Hash error
Simply put, ADD() method is expecting either both 'key' and 'data' or none (which implies both). The following also works, even you 'data' hasn't been defined explicitly.
data _null_ ;
length key $ 5;
set sashelp.shoes end = _end ;
if _N_ = 1 then do ;
declare hash h() ;
h.defineKey('key') ;
h.defineDone() ;
end ;
if length(Subsidiary)=5 then rc=h.add(key:subsidiary, data:Subsidiary) ;
if _end then h.output(dataset: 'HashTable') ;
run ;

Your code looks pretty good to me. The one mistake I notice is that your statements:
if length(Column) = 11 then;
rc = h.add(Key: Column);
if _end then;
rc = h.output(dataset: 'HashTable');
have an extra semicolon. There should be no semicolon after the then. As written, the h.add and h.output are executed unconditionally.
Here is an example using sashelp.shoes along the lines of your examole:
data _null_ ;
set sashelp.shoes end = _end ;
if _N_ = 1 then do ;
declare hash h() ;
h.defineKey('Subsidiary') ;
h.defineDone() ;
end ;
if length(Subsidiary)=5 then rc=h.add() ;
if _end then h.output(dataset: 'HashTable') ;
run ;
Which returns:
115 data _null_ ;
116 set HashTable ;
117 put (_all_)(=) ;
118 run ;
Subsidiary=Tokyo
Subsidiary=Cairo
Subsidiary=Paris
Subsidiary=Seoul
Subsidiary=Dubai
NOTE: There were 5 observations read from the data set WORK.HASHTABLE.

A mistake was stupid, as I expected:)
The right code (I hope):
data _null_;
length Key $11;
set Table end = _end;
if _N_ = 1 then do;
declare hash h();
h.defineKey('Key');
h.defineDone();
end;
if length(Column) = 11 then do;
Key = Column;
rc = h.add();
end;
if _end then
rc = h.output(dataset: 'HashTable');
run;

Related

Modification of a SAS macro to print dichotomous variable information

I'm trying to modify the following SAS macro so that it includes includes percentages for the variable CHD when it is equal to both 0 and 1. Currently this macro is only set up to print out the results of baseline variables when the CHD (chronic heart disease) is equal to 1. I think the modification needs to occur within the data routfreq&i step but I'm not quite sure how to set it up. I would then also need an additional column to print out 'No Coronary Heart Disease * % (n)".
%macro categ(pred,i);
proc freq data = heart;
tables &pred * chd / chisq sparse outpct out = outfreq&i ;
output out = stats&i chisq;
run;
proc sort data = outfreq&i;
by &pred;
run;
proc means data = outfreq&i noprint;
where chd ne . and &pred ne .;
by &pred;
var COUNT;
output out=moutfreq&i(keep=&pred total rename=(&pred=variable)) sum=total;
run;
data routfreq&i(rename = (&pred = variable));
set outfreq&i;
length varname $20.;
if chd = 1 and &pred ne .;
rcount = put(count,8.);
rcount = "(" || trim(left(rcount)) || ")";
pctnum = round(pct_row,0.1) || " " || (rcount);
index = &i;
varname = vlabel(&pred);
keep &pred pctnum index varname;
run;
data rstats&i;
set stats&i;
length p_value $8.;
if P_PCHI <= 0.05 then do;
p_value = round(P_PCHI,0.0001) || "*";
if P_PCHI < 0.0001 then p_value = "<0.0001" || "*";
end;
else p_value = put(P_PCHI,8.4);
keep p_value index;
index = &i;
run;
data _null_;
set heart;
call symput("fmt",vformat(&pred));
run;
proc sort data = moutfreq&i;
by variable;
run;
proc sort data = routfreq&i;
by variable;
run;
data temp&i;
merge moutfreq&i routfreq&i;
by variable;
run;
data final&i;
merge temp&i rstats&i;
by index;
length formats $20.;
formats=put(variable,&fmt);
if not first.index then do;
varname = " ";
p_value = " ";
end;
drop variable;
run;
%mend;
%categ(gender,1);
%categ(smoke,2);
%categ(age_group,3);
%macro names(j,k,dataname);
%do i=&j %to &k;
&dataname&i
%end;
%mend names;
data categ_chd;
set %names(1,3,final);
label varname = "Demographic Characteristic"
total = "Total"
pctnum = "Coronary Heart Disease * % (n)"
p_value = "p-value * (2 sided)"
formats = "Category";
run;
ods listing close;
ods rtf file = "c:\nesug\table1a.rtf" style = forNESUG;
proc report data = categ_chd nowd split = "*";
column index varname formats total pctnum p_value;
define index /group noprint;
compute before index;
line ' ';
endcomp;
define varname / order = data style(column) = [just=left] width = 40;
define formats / order = data style(column) = [just=left];
define total / order = data style(column) = [just=center];
define pctnum / order = data style(column) = [just=center];
define p_value / order = data style(column) = [just=center];
title1 " NESUG PRESENTATION: TABLE 1A (NESUG 2004)";
title2 " CROSSTABS OF CATEGORICAL VARIABLES WITH CORONARY HEART DISEASE OUTCOME";
run;
ods rtf close;
ods listing;
Also, this code has the following error when it is run:
NOTE: PROCEDURE MEANS used (Total process time):
real time 0.01 seconds
cpu time 0.01 seconds
NOTE: Character values have been converted to numeric values at the places given by:
(Line):(Column).
1:2
NOTE: Numeric values have been converted to character values at the places given by:
(Line):(Column).
3:111
I think this macro needs to be modified so that it doesn't crash when it runs with categorical/character variables.
The line
if chd = 1 and &pred ne .;
Is what is causing your output to only have CHD = "1".. You would change that to:
if chd = 1 and &pred ne .;
I do not understand your request for an additional column. Perhaps post an example of the current output and the output that you want?
As for the "errors" (actually notes as they do not cause the system to stop processing), the occur when a variable is automatically converted from numeric to character or vice-versa. It provides the code line where it is happening and how many times it happened. I prefer to eliminate these notes as often as possible to avoid unintended consequences of inappropriate coercion. To do this, you would make use of the PUT and INPUT functions.

Hash Merge Macro - using a file record indicator "HASH + point = Key"

Looking to update this macro to be HASH + point = key. We have started to exceed our memory limits with our current version of this macro for one of our data runs. The reason I'm asking for help is because I don't have a lot of time and have never really analyzed this code since it wasn't part of my process until recently.
What I don't really understand from, https://www.lexjansen.com/nesug/nesug11/ld/ld01.pdf, is how does the RID get set and how to incorporate it into our macro. I actually don't even know if it is possible to do it this way with our current macro.
Any help would be greatly appreciated.
%macro hashmerge2(varnm,onto,from,byvars,obsqty);
%let data_vars = %trim (&varnm);
%let data_vars_a = %sysfunc(tranwrd(&data_vars.,%str( ),%str(" , ")));
%let data_vars_b = %sysfunc(tranwrd(&data_vars.,%str( ), %str(,)));
%let data_key = %trim (&byvars);
%let data_key = %sysfunc(tranwrd(&data_key.,%str( ), %str(" , ")));
%if %index(&varnm,' ') > 0 %then %let varnm3=%substr(%substr(&varnm,1,%index(&varnm,' ')),1,4);
%else %let varnm3=%substr(&varnm,1,4);
data &onto(drop=rc) miss&varnm3(drop=rc);
if 0 then set &onto &from(keep=&varnm. &byvars.);
declare hash h_merge (dataset: "&from.");
rc = h_merge.DefineKey ("&data_key.");
rc = h_merge.DefineData ("&data_vars_a.");
rc = h_merge.DefineDone ();
do until (eof);
set &onto end = eof;
call missing(&data_vars_b.);
rc = h_merge.find ();
if rc = 0 then do;
output &onto;
from = "&from.";
end;
else do;
output miss&varnm3 &onto;
from = "&onto.";
end;
end;
stop;
run;
%mend;
So I think this is what you are looking for, but it still needs to load all of the key values from the "lookup" table into the hash object. But it could save space by instead of also loading the non-key variables it just needs to load the observation number that matches the key variables.
%macro hash_merge_point
/*-----------------------------------------------------------------------------
Merge variables ONTO large table FROM small table using POINT= dataset option.
-----------------------------------------------------------------------------*/
(varnm /* Space delimited list of variable to retrieve */
,onto /* Dataset to update */
,from /* Dataset to get values from */
,byvars /* Space delimited list of key variables to match on */
);
%local missds key_vars;
%let missds=%scan(&varnm,1,%str( ));
%let missds=miss%substr(&missds,1,%sysfunc(min(28,%length(&missds))));
%let key_vars="%sysfunc(tranwrd(%sysfunc(compbl(&byvars)),%str( )," "))";
data &onto(drop=rc) &missds(drop=rc);
if 0 then set &onto &from(keep=&varnm. &byvars.);
declare hash h_merge ();
rc = h_merge.DefineKey (&key_vars);
rc = h_merge.DefineData ('_point');
rc = h_merge.DefineDone ();
do _point=1 to _nobs;
set &from(keep=&byvars) point=_point nobs=_nobs;
rc = h_merge.add();
end;
do until (eof);
set &onto end = eof;
rc = h_merge.find ();
if rc = 0 then do;
set &from (keep=&varnm) point=_point;
from = "&from.";
output &onto;
end;
else do;
call missing(of &varnm);
from = "&onto.";
output ;
end;
end;
stop;
run;
%mend hash_merge_point;
So here is an trivial example:
data lookup;
input id age sex $1.;
cards;
1 10 F
2 20 .
4 30 M
;
data master ;
input id wt ;
cards;
1 100
2 150
3 180
4 200
;
%hash_merge_point
/*-----------------------------------------------------------------------------
Merge variables ONTO large table FROM small table using POINT= dataset option.
-----------------------------------------------------------------------------*/
(varnm=age sex /* Space delimited list of variable to retrieve */
,onto=master /* Dataset to update */
,from=lookup /* Dataset to get values from */
,byvars=id /* Space delimited list of key variables to match on */
);
If the target table already has the variables being created by the merge (so you just want to overwrite the current values) then you can use the MODIFY statement instead of the SET statement to modify the dataset in place. But you might want to make sure you have a backup of the table before trying this. Also note that if you want flag for the source, the from variable, then that variable also needs to exist.
So with this updated master table:
data master ;
input id wt ;
length age 8 sex $1 from $50;
cards;
1 100
2 150
3 180
4 200
;
And this version of the macro:
%macro hash_merge_point
/*-----------------------------------------------------------------------------
Merge variables ONTO large table FROM small table using POINT= dataset option.
-----------------------------------------------------------------------------*/
(varnm /* Space delimited list of variable to retrieve */
,onto /* Dataset to update */
,from /* Dataset to get values from */
,byvars /* Space delimited list of key variables to match on */
);
%local key_vars;
%let key_vars="%sysfunc(tranwrd(%sysfunc(compbl(&byvars)),%str( )," "))";
data &onto;
if 0 then set &onto (keep=&byvars.);
declare hash h_merge ();
rc = h_merge.DefineKey (&key_vars);
rc = h_merge.DefineData ('_point');
rc = h_merge.DefineDone ();
do _point=1 to _nobs;
set &from(keep=&byvars) point=_point nobs=_nobs;
rc = h_merge.add();
end;
do until (eof);
modify &onto end = eof;
rc = h_merge.find ();
if rc = 0 then do;
set &from (keep=&varnm) point=_point;
from = "&from.";
end;
else from = "&onto.";
replace;
end;
stop;
run;
%mend hash_merge_point;
If you run this code:
proc print data=master;
title 'BEFORE';
run;
%hash_merge_point
/*-----------------------------------------------------------------------------
Merge variables ONTO large table FROM small table using POINT= dataset option.
-----------------------------------------------------------------------------*/
(varnm=age sex /* Space delimited list of variable to retrieve */
,onto=master /* Dataset to update */
,from=lookup /* Dataset to get values from */
,byvars=id /* Space delimited list of key variables to match on */
);
proc print data=master;
title 'AFTER';
run;
You get this result:

SAS: formatting multiple proc freq using macros

I don't have another analyst on my team at work and have a question about the most efficient way to run several proc freq concurrently.
My goal is to run about 160 different frequencies, and include formatting for all of them. I assume a macro is the fastest way, but I only have experience with basic macros. Below is my thought process assuming the data was already formatted:
%macro survey(question, formatA formatB);
proc freq;
table &question;
format &formatA &formatB;
%mend;
%survey (question, formatA, formatB);
"question", "formatA" and "formatB" will be strings of data for example:
-"question" would be KCI_1 KCI_2 through KCI_80
- "formatA" would be KCI_1fmt KCI_2fmt through KCI_80fmt
- "formatB" would be KCI_1fmt. KCI_2fmt. through KCI_80fmt.
Danielle:
You can use macro to assign known formats to variables that are not already formatted. The rest of the FREQ does not have to be macro-ized.
* make some survey data with unformatted responses;
data have;
do respondent_id = 1 to 10000;
array responses KCI_1-KCI_80;
do _n_ = 1 to dim(responses);
responses(_n_) = ceil(4*ranuni(123));
end;
output;
end;
run;
* make some format data for each question;
data responseMeanings;
length questionID 8 responseValue 8 responseMeaning $50;
do questionID = 1 to 80;
fmtname = cats('Q',questionID,'_fmt');
peg = ranuni (1234); drop peg;
do responseValue = 1 to 4;
select;
when (peg < 0.4) responseMeaning = scan('Never,Seldom,Often,Always', responseValue);
when (peg < 0.8) responseMeaning = scan('Yes,No,Don''t Ask,Don''t Tell', responseValue);
otherwise responseMeaning = scan('Nasty,Sour,Sweet,Tasty', responseValue);
end;
output;
end;
end;
run;
* create a custom format for the responses of each question;
proc format cntlin=responseMeanings(rename=(responseValue=start responseMeaning=label));
run;
* macro to associate variables with the corresponding custom format;
%macro format_each_response;
%local i;
format
%do i = 1 %to 80;
KCI_&i Q&i._fmt.
%end;
;
%mend;
* compute frequency counts;
proc freq data=have;
table KCI_1-KCI_80;
%format_each_response;
run;

Count consecutive consonants in e-mail address in SAS SQL

I would like to identify the max number of consecutive consonants and vowels in an e-mail address, using SAS SQL (proc sql). The output should look like the one below in columns Max of consecutive consonants and max of consecutive vowels (I listed characters in first row for illustrative purposes only).
A few things to note:
treat special and numeric characters as a count terminator (e.g. 3rd email is a good example where you've got 3 consonants (hf) then numbers (98) and then again 2 consonants (jl). The output should be just 2 (hf).
I am only interested in the first part of the email (before #).
How do I achieve this, dear community?
E-mail Max of consecutive consonants Max of consecutive vowels
asifhajhtysiofh#gmail.com 5 (jhtys) 2 (io)
chris.nashfield#hotmail.com 3 2
ahf98jla#gmail.com 2 1
There is a routine called prxnext that proves very handy here.
Generate sample data
data emails;
input email $32.;
datalines;
asifhajhtysiofh#gmail.com
chris.nashfield#hotmail.com
ahf98jla#gmail.com
;
Do the counting
data checkEmails(keep = email maxCons maxVow);
set emails;
* Consonants;
re = prxparse("/[bcdfghjklmnpqrstvwxyz]+/");
start = 1;
stop = index(email,"#");
do until (pos = 0);
call prxnext(re,start,stop,email,pos,len);
maxCons = max(maxCons, len);
end;
* Vowels;
re = prxparse("/[aeiouy]+/");
start = 1;
stop = index(email,"#");
do until (pos = 0);
call prxnext(re,start,stop,email,pos,len);
maxVow = max(maxVow, len);
end;
run;
Results
Email MaxCons MaxVow
asifhajhtysiofh#gmail.com 5 2
chris.nashfield#hotmail.com 3 2
ahf98jla#gmail.com 2 1
This was much trickier than I expected it to be, but I have a solution using macro loops that roughly follows the logic in #DaBigNikoladze's comment:
data temp;
input email $40.;
datalines;
asifhajhtysiofh#gmail.com
chris.nashfield#hotmail.com
ahf98jla#gmail.com
;
run;
proc sql noprint;
select max(length(email)) into: max_email_length from temp;
quit;
%let vowels = "a" "e" "i" "o" "u";
%let consonants = "q" "w" "r" "t" "y" "p" "s" "d" "f" "g" "h" "j" "k" "l" "z" "x" "c" "v" "b" "n" "m";
%macro counter;
data temp_count;
set temp;
/* limit email to just the part before the #*/
email_short = substrn(email, 0, find(email, "#"));
email_vowels_only = email_short;
email_consonants_only = email_short;
/* keep only the vowels or consonants, respectively*/
%do i = 1 %to &max_email_length.;
if substr(email_vowels_only, &i., 1) notin(&vowels.) then substr(email_vowels_only, &i., 1) = " ";
if substr(email_consonants_only, &i., 1) notin(&consonants.) then substr(email_consonants_only, &i., 1) = " ";
%end;
run;
/* determine the max number of strings we have to scan through*/
proc sql noprint;
select max(max(countw(email_vowels_only)), max(countw(email_consonants_only))) into: loops from temp_count;
quit;
/* separate each string out into its own variable, find the max length of those variables, and drop those variables*/
proc sql;
create table temp_count_expand (drop = vowel_word: consonant_word:) as select
*,
%do j = 1 %to &loops.; scan(email_vowels_only, &j.) as vowel_word&j., %end;
%do k = 1 %to &loops.; scan(email_consonants_only, &k.) as consonant_word&k., %end;
max(%do j = 1 %to &loops.; length(calculated vowel_word&j.), %end; .) as max_vowels,
max(%do k = 1 %to &loops.; length(calculated consonant_word&k.), %end; .) as max_consonants
from temp_count;
quit;
%mend counter;
%counter;
I'm not sure why you specify proc sql for this task. A data step is much more suitable as you can loop through the email, treating everything that is either a non-consonant or a non-vowel as a delimiter. I've used a regular expression (prxchange) to remove the # portion of the email, although substr works just as well.
data have;
input Email $50.;
datalines;
asifhajhtysiofh#gmail.com
chris.nashfield#hotmail.com
ahf98jla#gmail.com
;
run;
data want;
set have;
length _w1 _w2 $50;
_short_email=prxchange('s/#.+//',-1,email); /* remove everything from # onwards */
do _i = 1 by 1 until (_w1=''); /* loop through email, using everything other than consonants as the delimiter */
_w1 = scan(_short_email,_i,'bcdfghjklmnpqrstvwxyz','ki');
consonant = max(consonant,ifn(missing(_w1),0,length(_w1))); /* keep longest value */
end;
do _j = 1 by 1 until (_w2=''); /* loop through email, using everything other than vowels as the delimiter */
_w2 = scan(_short_email,_j,'aeiou','ki');
vowel = max(vowel,ifn(missing(_w2),0,length(_w2))); /* keep longest value */
end;
drop _: ; /* drop temprorary variables */
run;

Regex IBM DB2 iSeries

Can anyone give me a clue, how to create/call function regular expression syntax in DB2 iSeries.
Example:
DECLARE VAL VARCHAR (16) DEFAULT 'abcde1235876e' ;
DECLARE RET INT DEFAULT 0;
I'm just checking VARIABLE VAL must only contain numeric value and return true/false
SET VAL = I_NEED_FUNCTION_REGEX(VAL);
IF (VAL = true) THEN
SET RET = 1;
ELSE
SET RET = 0;
END IF;
as simple as that, but i've been searching in IBM as follows:
http://www.ibm.com/developerworks/data/library/techarticle/0301stolze/0301stolze.html
but i don't quite understand.
Can u help me ?
UPDATE
I'm back to the old way and simple for now.
CREATE FUNCTION TEST.VALIDATE_NUMERIC (VAL CHARACTER VARYING(1))
RETURNS INTEGER
LANGUAGE SQL
SPECIFIC TEST.VALIDATE_NUMERIC
MODIFIES SQL DATA
CALLED ON NULL INPUT
FENCED
DISALLOW PARALLEL
NO EXTERNAL ACTION
BEGIN ATOMIC
DECLARE RET INT DEFAULT 0 ;
DECLARE CONTINUE HANDLER FOR SQLEXCEPTION , SQLWARNING , NOT FOUND
IF ( VAL IS NOT NULL ) THEN
CASE VAL
WHEN 0 THEN -- (0)
SET RET = 1 ;
WHEN 1 THEN -- (1)
SET RET = 1 ;
WHEN 2 THEN -- (2)
SET RET = 1 ;
WHEN 3 THEN -- (3)
SET RET = 1 ;
WHEN 4 THEN -- (4)
SET RET = 1 ;
WHEN 5 THEN -- (5)
SET RET = 1 ;
WHEN 6 THEN -- (6)
SET RET = 1 ;
WHEN 7 THEN -- (7)
SET RET = 1 ;
WHEN 8 THEN -- (8)
SET RET = 1 ;
WHEN 9 THEN -- (9)
SET RET = 1 ;
ELSE
SET RET = 0 ;
END CASE ;
END IF ;
RETURN RET ;
END
GO
Thanks
MRizq
Out-of-the-box, DB2 does not come with the capability to handle regex. There are some functions to handle some pattern matching, but it's severly restricted.
The article you linked is how to set up a UDF (user-defined function) to call out to an external (C) library to provide this functionality. While the steps are shown for LUW, the iSeries version should be roughly equivalent; you're going to have to talk your DBAs into implementing the call out to relevant libraries.
You can use LOCATE(VAL, '0123456789') to return a 0 if not numeric and the digit + 1 if found:
CASE LOCATE(VAL, '0123456789') WHEN > 0 THEN 1 ELSE 0 END
For a multi-character string you can use the following:
CASE WHEN TRANSLATE(TRIM(VAL), '0', '0123456789', '0')
= REPEAT('0', LENGTH(TRIM(VAL)))
THEN 1 ELSE 0 END