I am merging two datasets in SAS, "set_x" and "set_y", and would like to create a variable "E" in the resulting merged dataset "matched":
* Create set_x *;
data set_x ;
input merge_key A B ;
datalines;
1 24 25
2 25 25
3 30 32
4 32 32
5 20 32
6 1 1
;
run;
* Create set_y *;
data set_y ;
input merge_key C D ;
datalines;
1 1 1
2 2 1
3 1 1
4 2 1
5 1 1
7 1 1
;
run;
* Merge and create E *;
data matched unmatched_x unmatched_y ;
merge set_x (in=x) set_y (in=y) ;
by merge_key ;
if x=y then output matched;
else if x then output unmatched_x ;
else if y then output unmatched_y ;
IF C = 2 THEN DO ;
E = A ;
END;
ELSE DO ;
E = floor(B - D) ;
END ;
run ;
However, in the resulting table "matched", the values of E are missing. E is correctly calculated if I only output matched values (i.e. using if x=y;).
Is it possible to create "E" in the same data step if outputting unmatched as well as matched observations?
You have output the results before E is computed, then E is set to missing when next iteration starts. So you want E to be available before the output,
data matched unmatched_x unmatched_y ;
merge set_x (in=x) set_y (in=y) ;
by merge_key ;
IF C = 2 THEN DO ;
E = A ;
END;
ELSE DO ;
E = floor(B - D) ;
END ;
if x=y then output matched;
else if x then output unmatched_x ;
else if y then output unmatched_y ;
run ;
Related
i currently have three data sets in SAS 9.3
Data set "Main" contains SKU ID's and Customer ID's as well as various other variables such as week.
Customer_ID week var2 var3 SKU_ID
1 1 x x 1
1 2 x x 1
1 3 x x 1
1 1 x x 2
1 2 x x 2
2 1 x x 1
2 2 x x 1
2 3 x x 1
2 1 x x 2
2 2 x x 2
data set "standard" contains the standard location for each Customer_ID.
data set "overrides" contains data override location (if applicable) for a certain sku for certain customers for instance. Thus, it contains SKU_ID, customer_id and location
standard data set
customer_id location
1 A
1 A
2 C
2 C
override dataset
customer_id sku_id location
1 1 A
1 2 B
When merging all of the data sets this is what i get
Customer_ID week var2 var3 SKU_ID location
1 1 x x 1 A
1 2 x x 1 A
1 3 x x 1 A
1 1 x x 2 B
1 2 x x 2 A
2 1 x x 1 C
2 2 x x 1 C
2 3 x x 1 C
versus what i want it to look like
Customer_ID week var2 var3 SKU_ID location
1 1 x x 1 A
1 2 x x 1 A
1 3 x x 1 A
1 1 x x 2 B
1 2 x x 2 B
2 1 x x 1 C
2 2 x x 1 C
2 3 x x 1 C
proc sort data=overrides; by Location SKU_ID; run;
Proc sort data= main; by Location SKU_ID;
run;
Proc sort data= Standard; by Location;
run;
data Loc_Standard No_LOC;
Merge Main(in = a) Standard(in = b);
by Location;
if a and b then output Loc_standard;
else if b then output No_LOC;
run;
/*overwrites standard location if an override for a sku exist*/
Data Loc_w_overrides;
Merge Loc_standard overrides;
by Location SKU_ID;
run;
That is how SAS combines datasets. When datasets have observations to contribute to a BY group the values from the datasets are read in the order they appear in the MERGE statement. But when one dataset runs out of new observations for the BY group then SAS does not read those values in. So the value read from the other dataset is no longer replaced.
Either drop the original variable and just use the value from the second dataset. Basically this will setup an 1 to Many merge.
Or rename the override variable and add your own logic for when to apply the override.
I am not sure how you are getting the result you posted since you do not have any standards for CUSTOMER_ID=2 in your posted data. If the values of location to not depend on customer_id then why is that variable in the standards and override datasets?
Perhaps you meant that the standards dataset only has SKU_ID and location?
data main_w_standards;
merge main standards;
by sku_id ;
run;
proc sort data=main_w_standards;
by customer_id sku_id;
run;
data main_w_overrides;
merge main_w_standards overrides(in=in2 rename=(location=override));
by customer_id SKU_ID;
if in2 then location=override;
drop override;
run;
Why not UPDATE the STANDARD(loc) with OVERIDE(oride) and then merge with customer data.
data loc;
input customer_id Sku_id location:$1.;
cards;
1 1 A
1 2 A
;;;;
proc print;
data oride;
input customer_id sku_id location:$1.;
cards;
1 1 A
1 2 B
;;;;
run;
proc print;
data locoride;
update loc oride;
by cu: sk:;
run;
I have two variables (varx and vary) in data set "dat" and need to create a final score, by first categorizing varx and vary, and then translate the score categories into a final score according to a look-up table "lookup".
I managed to get past the categorizing part and am now stuck on how to tell SAS to use the categories I created (i.e., "varxcat" and "varycat") as row and column indices of "lookup", grab the value I need for each observation, and put it into a final score variable (call it "score") in "dat".
In R (in which I normally code) this can easily be done with something like a for loop. Is there anything similar in SAS? (I don't must use "varxcat" and "varycat", just need to eventually create "score".)
data dat;
input ID $ varx vary;
datalines;
1 1 1
2 4 5
3 11 12
4 23 14
5 24 20
;
data lookup;
input x01to10 x11to20 x21to30;
datalines;
21 52 73
84 95 96
107 118 149
; /*first row is for y01to10, second row is for y11to20, and third row is for y21to30,
such that if someone's x score is in category 1 and y score is in category 3,
the person's final score should be 107*/
data dat;
set dat;
if varx <= 10 then varxcat = 1;
else if varx > 10 & varx <= 20 then varxcat = 2;
else if varx > 20 & varx <= 30 then varxcat = 3;
if vary <= 10 then varycat = 1;
else if vary > 10 & vary <= 20 then varycat = 2;
else if vary > 20 & vary <= 30 then varycat = 3;
run;
Desired "dat" looks like
data dat;
input ID $ varx vary score;
datalines;
1 1 1 21
2 4 5 21
3 11 12 95
4 23 14 96
5 24 20 96
;
A lookup table for data value mapping is essentially a left join operation. SAS has a lot of ways to left join data, including
SQL
Merge
Hash object
Array (direct addressing)
Formats
Informats
Here are four ways: SQL, Merge, Array and Hash. The mapping from var* to category is done by the functional mapping int (value/10):
data have;
input ID $ varx vary;
datalines;
1 1 1
2 4 5
3 11 12
4 23 14
5 24 20
6 5 29 /* score should be 107 */
;
data lookup;
do index_y = 0 to 2;
do index_x = 0 to 2;
input lookup_value ##;
output;
end;
end;
datalines;
21 52 73
84 95 96
107 118 149
;
*------------------- SQL;
proc sql;
create table want as
select
id, lookup_value as score
from
have
left join
lookup
on
int (have.varx/10) = lookup.index_x
and
int (have.vary/10) = lookup.index_y
order by
id
;
*------------------- MERGE;
data have2(index=(myindexname=(xcat ycat)));
set have;
xcat = int(varx/10);
ycat = int(vary/10);
run;
proc sort data=lookup;
by index_x index_y;
options msglevel=i;
data want2(keep=id lookup_value rename=(lookup_value=score));
merge
have2(rename=(xcat=index_x ycat=index_y) in=left)
lookup
;
by index_x index_y;
if left;
run;
proc sort data=want2;
by id;
run;
*------------------- ARRAY DIRECT ADDRESSING;
data want3;
array lookup [0:2,0:2] _temporary_;
if _n_ = 1 then do until (endlookup);
set lookup end=endlookup;
lookup[index_x,index_y] = lookup_value;
end;
set have;
xcat = varx/10;
ycat = vary/10;
score = lookup[xcat,ycat];
keep id score;
run;
*------------------- HASH LOOKUP;
data want4;
if 0 then set lookup;
if _n_ = 1 then do;
declare hash lookup(dataset:'lookup');
lookup.defineKey('index_x', 'index_y');
lookup.defineData('lookup_value');
lookup.defineDone();
end;
set have;
index_x = int(varx/10);
index_y = int(vary/10);
if (lookup.find() = 0) then
score = lookup_value;
keep id score;
run;
Consider a set,
S = {1,2,3,4,5,6,7}
I am trying to come up with a function which takes S as the input and gives me ALL possible arrays:
[ 1 ~ 2 ; 1 ~ 3 ; 1 ~ 4 ; 1 ~ 5 ; . . . ; 6 ~ 7 ]
[ 1 2 ~ 3 ; 1 2 ~ 4 ; ...; 2 3 ~ 1 ; 2 3 ~ 4....; 5 6 ~ 7]
.
.
.
[ 2 3 4 5 6 7 ~ 1 ; 1 3 4 5 6 7 ~ 2 ; ... ; 1 2 3 4 5 6 ~ 7 ]
Here notice that '~' is sort of like a delimiter placed in between the elements of k - combination such that the set appearing before the delimiter is always unique in each array.
For example, we want both 7-combinations
[ 2 3 4 5 6 7 ~ 1 ] and [ 1 2 3 4 5 6 ~ 7 ].
But we want only one of
[ 1 2 3 4 5 6 ~ 7 ] and [ 1 3 4 5 6 2 ~ 7 ].
My Code :
clear all
for k = 1:7
Set = nchoosek(1:7,k);
for i = 1:length(Set)
A = setdiff(1:7,Set(i,:));
P = nchoosek( A , 2 ); % trialing it for only A~B where B has only 2elements
L = length( P );
S = repmat( Set ( i,: ) , L,1);
for j = 1:L
S1(j,:) = setdiff( S(j,:) , P(j,:) );
W(j,:) = [ S1(j,:) , 0 , P(j,:) ];
end
W1(i,k) = {W};
end
end
This however produces an error at k=2.
Any ideas to make this work and efficiently.
I think I can outline how to achieve this.
to get the subset (for A) use setdiff
s = 1:7
b = 4
tmp = setdiff(s,b)
for the permutation use randperm
t2 = randperm(length(tmp))
A = tmp(t2)
for the specific subsets just pick the first n entries of A
Put the whole thing in some loops to create the set you describe.
I want to ask a question in some matrix operations in MATLAB.
Assume we have this matrix:
A = [1 1 17
1 1 14
1 2 10
1 2 11
2 1 9
2 1 9
2 2 13
2 2 12
3 1 18
3 1 15]
I want the first column, say M and the second column, say D to control the entire matrix to result to one row matrix depending on the following condition:
the program will ask the user to enter the values of M then D as follows:
M = input(' ENTER M VALUE = ') ;
D = input(' ENTER D VALUE = ') ;
Now, the output will be the corresponding 2 values to M and D, and these two values will be taken from the third column,
for example:
if M = 1 and D = 2 , the output is B = 10 ; 11
another example:
if M = 3 and D = 1 , the output is B = 18 ; 15
and so on.
Actually, I know how to solve this using if statement but I have large data and this will take very long time. I am sure that there is a short way to do that.
Thanks.
The short way to do it is
B = A(A(:,1)==M & A(:,2)==D, 3);
I have 2 matrices with the SAME IDs. I need to extract those rows of IDs from mat1 which have their dates within say ±5 days of the dates in the mat2. Same operation for mat2 as well. Please see the data here: UNIQCols = [1 2] ; dateCol = [3] ; valueCol = [4] ; dayRange = +- 15days.
% UniqCol Date Value
mat1 = [2001 2 733427 1001 ;
2001 2 733793 2002 ;
2001 2 734582 2003 ;
3001 1 734220 30 ;
3001 1 734588 20 ;];
mat2 = [2001 2 733790 7777 ;
2001 2 734221 2222 ;
3001 1 734220 10 ;
3001 1 734588 40 ;] ;
ans1 = [2001 2 733793 2002 ; 3001 1 734220 30 ; 3001 1 734588 20 ] ;
ans2 = [2001 2 733790 7777 ; 3001 1 734220 10 ; 3001 1 734588 40 ] ;
This needs to be a vectorized operation! The IDs are ordered in increasing order of dates. Dates are either separated on Q or Annual basis. So the range will be always << (date2-date1) Please help and thanks!
Here is a function based on similar question I mentioned in my comments. Remember your matrices has to be sorted by date.
function match_for_xn = match_by_distance(xn, xm, maxdist)
%#Generates index for elements in vector xn that close to any of elements in
%#vector xm at least by distance maxdist
match_for_xn = false(length(xn), 1);
last_M = 1;
for N = 1:length(xn)
%# search through M until we find a match.
for M = last_M:length(xm)
dist_to_curr = xm(M) - xn(N);
if abs(dist_to_curr) < maxdist
match_for_xn(N) = 1;
last_M = M;
break
elseif dist_to_curr > 0
last_M = M;
break
else
continue
end
end %# M
end %# N
And the test script:
mat1 = sortrows([
2001 2 733427 1001 ;
2001 2 733793 2002 ;
2001 2 734582 2003 ;
3001 1 734220 30 ;
3001 1 734588 20 ;
],3);
mat2 = sortrows([
2001 2 733790 7777 ;
2001 2 734221 2222 ;
3001 1 734220 10 ;
3001 1 734588 40 ;
],3);
mat1_index = match_by_distance(mat1(:,3),mat2(:,3),5);
ans1 = mat1(mat1_index,:);
mat2_index = match_by_distance(mat2(:,3),mat1(:,3),5);
ans2 = mat2(mat2_index,:);
I haven't tried any vectorized solution for your problem. If you get any try it against this solution and check the timing and memory consumption (include sorting step).