How do you set retained values to missing if there's no match in a many-to-many SAS merge ? - merge

I have two datasets that I want to merge together that have repeated by variables, but an unequal number of rows. In SAS, the default behavior is to retain values for all the rows that don't have a match.
For example:
data a;
input i x y;
datalines;
1 1 5
1 2 6
1 3 7
1 4 8
;
run;
data b;
input i f g $;
datalines;
1 9 aa
1 8 bb
;
run;
Here dataset a has four rows of the by variable i, while dataset b only has two.
Merging solely with the by variable i produces this:
data c;
merge a b;
by i;
run;
Obs i x y f g
1 1 1 5 9 aa
2 1 2 6 8 bb
3 1 3 7 8 bb
4 1 4 8 8 bb
You can see that for variables f and g in obs 3 and 4 the values have been retained since they didn't have a match in dataset a.
What I am trying to produce is this output:
Obs i x y f g
1 1 1 5 9 aa
2 1 2 6 8 bb
3 1 3 7 .
4 1 4 8 .
I am using SAS 9.4 and this is what I've tried:
data c;
if _n_>1 then do;
array num{*} _numeric_;
array char{*} _character_;
call missing(of num{*});
call missing(of char{*});
end;
merge a b;
by i;
run;
My thinking is that for every row after the first, I want to set all variables to missing and so that if they don't have a matching row their values won't be overwritten and will remain missing. This would eliminate retained values.
By the second row the PDV should be created and all the metadata should be available to create these arrays and set them to missing, but I am getting this error:
WARNING: Defining an array with zero elements.
Any suggestions on how to fix this code or other code that would do the trick?

You would want to override the default behavior of the run statement, namely the automatic output and automatic call missing of certain variables.
Here you output; to force the automatic output (same as default behavior) and then call missing(of _all_); which sets all variables to missing (as opposed to only ones not appearing on the merge or set statements).
data c;
merge a b;
by i;
output;
call missing(of _all_);
run;
The reason you have to do it at the end and not the beginning is that you haven't defined any variables yet at the beginning - so _numeric_, _character_, or _all_ don't have anything to refer to.
You can fix this, with an if 0 then set a b;, but I find the above solution a bit more straightforward. Really either works fine and has the same speed and benefit.
data c;
if 0 then set a b; *defines all of the variables, but `if 0` means it will not pull any data;
call missing(of _all_); *sets everything missing;
merge a b;
by i;
run;

Related

Can I do hash merge by multiple keys in SAS

I would like to hash merge in SAS using two keys;
The variable names for the lookup dataset called link_id 8. and ref_date 8.;
The variable names for the merged dataset called link_id 8. and drug_date 8.;
The code I used is as following:
data elig_bene_pres;
length link_id ref_date 8.;
call missing(link_id,ref_date):
if _N_=1 then do;
declare hash elig_bene(dataset:"bene.elig_bene_uid");
elig_bene.defineKey("link_id","ref_date");
elig_bene.defineDone();
end;
set data;
if elig_bene.find(key:Link_ID,key:drug_dt)=0 then output;
run;
But it seems that it is not found by these two keys. I just want to know whether my method is doable.
Thanks!
There are no obvious problems with the code.
To troubleshoot, try merge-sort: PROC SORT both data sets, then merge them by the two key variables. This will show which values look similar but are not exactly the same.
This sample shows you have the correct approach.
data elig;
input lukey1 lukey2;
datalines;
1 1
1 2
2 4
3 6
3 7
run;
data all;
do key1 = 1 to 10; do key2 = 1 to 10;
array x(5) (1:5);
output;
end; end;
run;
data all_elig;
length lukey1 lukey2 8;
call missing (lukey1,lukey2);
if _n_ = 1 then do;
declare hash elig (dataset:"elig");
elig.defineKey ('lukey1','lukey2');
elig.defineDone ();
end;
set all;
if 0 = elig.find(key:key1, key:key2);
run;
The process as shown is not really a merge because the lookup hash has no explicit data elements. The keys are implicit data when no data is specified.
If you are selecting all data rows, the first item to troubleshoot is the bene.elig_bene_uid. Are it's keys accidentally a superset of data's ?

kdb apply function in select by row

I have a table
t: flip `S`V ! ((`$"|A|B|"; `$"|B|C|D|"; `$"|B|"); 1 2 3)
and some dicts
t1: 4 10 15 20 ! 1 2 3 5;
t2: 4 10 15 20 ! 0.5 2 4 5;
Now I need to add a column with values on the the substrings in S and the function below (which is a bit pseudocode because I am stuck here).
f:{[s;v];
if[`A in "|" vs string s; t:t1;];
else if[`B in "|" vs string s; t:t2;];
k: asc key t;
:t k k binr v;
}
problems are that s and v are passed in as full column vectors when I do something like
update l:f[S,V] from t;
How can I make this an operation that works by row?
How can I make this a vectorized function?
Thanks
You will want to use the each-both adverb to apply a function over two columns by row.
In your case:
update l:f'[S;V] from t;
To help with your pseudocode function, you might want to use $, the if-else operator, e.g.
f:{[s;v]
t:$["A"in ls:"|"vs string s;t1;"B"in ls;t2;()!()];
k:asc key t;
:t k k binr v;
};
You've not mentioned a final else clause in your pseudocode but $ expects one hence the empty dictionary at the end.
Also note that in your table the columns S and V have been cast to a symbol. vs expects a string to split so I've had to use the stringoperation - this could be removed if you are able to redefine your original table.
Hope this helps!

Give each variable a name based on an already existing logical-ID vector (MATLAB)

I have length(C) number of variables. Each index represents a uniqe type of variable (in my optimization model), e.g. wheter it is electricity generation, transmission line capacity etc..
However, I have a logical vector with the same length as C (all variables) indicating if it is e.g. generation:
% length(genoidx)=length(C), i.e. the number of variables
genoidx = [1 1 1 1 1 1 0 0 ... 1 1 1 1 1 1 0 0]
In this case, there are 6 generators in 2 time steps, amounting to 12 variables.
I want to name each variable to get a better overview of the output from the optimization model, f.ex. like this:
% This is only a try on pseudo coding
varname = cell(length(C),1)
varname(genoidx) = 'geno' (1 2 3 4 5 6 ... 1 2 3 4 5 6)
varname(lineidx) = 'line' (...
Any suggestions on how to name the variables in C with string and number, based on logical ID-vector?
Thanks!
Using dynamic names is maybe OK for the seeing the results of a calculation in the workspace, but I wouldn't use them if any code is ever going to read them.
You can use the assignin('base') function to do this.
I'm not quite sure what your pseudo code is attempting to do, but you could do something like:
>> varname={'aaa','bbb','ccc','ddd'}
varname =
'aaa' 'bbb' 'ccc' 'ddd'
>> genoidx=logical([1,0,1,1])
genoidx =
1 0 1 1
>> assignin('base', sprintf('%s_',varname{genoidx}), 22)
which would create the variable aaa_ccc_ddd_ in the workspace and assign the number 22 to it.
Alternatively you could use an expression like:
sum(genoidx.*(length(genoidx):-1:1))
to calculate a decimal value and index a cell array of bespoke names:
>> varname={'aaa','bbb','ccc','ddd','eee','fff','ggg','hhh'}
varname =
'aaa' 'bbb' 'ccc' 'ddd' 'eee' 'fff' 'ggg' 'hhh'
>> assignin('base', varname{sum(genoidx.*(length(genoidx):-1:1))}, 33)
which would create the variable ggg and assign 33 to it.

How to deal with subsetting in SAS

I am very new to SAS and I am very eager to learn it. My question is about subsetting. I have 2 data sets; a and b namely consisting og two columns a and b respectively:
a b
3 4
5
6
data a;
set a;
run;
data b;
set b;
run;
data merged;
merge a b;
run;
proc print data=merged(firstobs= a[1] obs=a[1] keep= b);
run;
In this code I get invalid conversion type error and I could not figure out why I am getting this error because when I write like:
proc print data=merged(firstobs= 3 obs= 3 keep= b);
run;
I get the result as 6.
I know it seems very simple but I am stuck with this error. If you help me I would really appreciate. Thanks
You want to print the row from the dataset b whose number is the same as the value of a in row 1 of the dataset a.
You can't pass a value into a proc directly like that, but you can generate a macro variable from your dataset and pass it into the proc, e.g.
data _null_;
set a(obs = 1);
call symput('ROW_NUMBER',a);
run;
proc print data = b(keep = b obs = &ROW_NUMBER firstobs = &ROW_NUMBER);
run;

Get out the value of a variable in each observation to a macro variable

I have a table called term_table containing the below columns
comp, term_type, term, score, rank
I go through every observation and at each obs, I want to store the value of variable rank to a macro variable called curr_r. The code I created below does not work
Data Work.term_table;
input Comp $
Term_type $
Term $
Score
Rank
;
datalines;
comp1 term_type1 A 1 1
comp2 term_type2 A 2 10
comp3 term_type3 A 3 20
comp4 term_type4 B 4 20
comp5 term_type5 B 5 40
comp6 term_type6 B 6 100
;
Run;
%local j;
DATA tmp;
SET term_table;
LENGTH freq 8;
BY &by_var term_type term;
RETAIN freq;
CALL SYMPUT('curr_r', rank);
IF first.term THEN DO;
%do j = 1 %to &curr_r;
do some thing
%end;
END;
RUN;
Could you help me to solve the problem
Thanks a lot
Hung
The call symput statement does create the macro var &curr_r with the value of rank, but it is not available until after the data step.
However, I don't think you need to create the macro var &curr_r. I don't think a macro is needed at all.
I think the below should work: (Untested)
DATA tmp;
SET term_table;
LENGTH freq 8;
BY &by_var term_type term;
RETAIN freq;
IF first.term THEN DO;
do j = 1 to rank;
<do some thing>
end;
END;
RUN;
If you needed to use the rank from a prior obs, use the LAG function.
Start=Lag(rank);
To store each value of RANK in a macro variable, the below will do that:
Proc Sql noprint;
select count(rank)
into :cnt
from term_table;
%Let cnt=&cnt;
select rank
into :curr_r1 - :curr_r&cnt
from term_table;
quit;