Merge overwrite stopping at first match - merge

i currently have three data sets in SAS 9.3
Data set "Main" contains SKU ID's and Customer ID's as well as various other variables such as week.
Customer_ID week var2 var3 SKU_ID
1 1 x x 1
1 2 x x 1
1 3 x x 1
1 1 x x 2
1 2 x x 2
2 1 x x 1
2 2 x x 1
2 3 x x 1
2 1 x x 2
2 2 x x 2
data set "standard" contains the standard location for each Customer_ID.
data set "overrides" contains data override location (if applicable) for a certain sku for certain customers for instance. Thus, it contains SKU_ID, customer_id and location
standard data set
customer_id location
1 A
1 A
2 C
2 C
override dataset
customer_id sku_id location
1 1 A
1 2 B
When merging all of the data sets this is what i get
Customer_ID week var2 var3 SKU_ID location
1 1 x x 1 A
1 2 x x 1 A
1 3 x x 1 A
1 1 x x 2 B
1 2 x x 2 A
2 1 x x 1 C
2 2 x x 1 C
2 3 x x 1 C
versus what i want it to look like
Customer_ID week var2 var3 SKU_ID location
1 1 x x 1 A
1 2 x x 1 A
1 3 x x 1 A
1 1 x x 2 B
1 2 x x 2 B
2 1 x x 1 C
2 2 x x 1 C
2 3 x x 1 C
proc sort data=overrides; by Location SKU_ID; run;
Proc sort data= main; by Location SKU_ID;
run;
Proc sort data= Standard; by Location;
run;
data Loc_Standard No_LOC;
Merge Main(in = a) Standard(in = b);
by Location;
if a and b then output Loc_standard;
else if b then output No_LOC;
run;
/*overwrites standard location if an override for a sku exist*/
Data Loc_w_overrides;
Merge Loc_standard overrides;
by Location SKU_ID;
run;

That is how SAS combines datasets. When datasets have observations to contribute to a BY group the values from the datasets are read in the order they appear in the MERGE statement. But when one dataset runs out of new observations for the BY group then SAS does not read those values in. So the value read from the other dataset is no longer replaced.
Either drop the original variable and just use the value from the second dataset. Basically this will setup an 1 to Many merge.
Or rename the override variable and add your own logic for when to apply the override.
I am not sure how you are getting the result you posted since you do not have any standards for CUSTOMER_ID=2 in your posted data. If the values of location to not depend on customer_id then why is that variable in the standards and override datasets?
Perhaps you meant that the standards dataset only has SKU_ID and location?
data main_w_standards;
merge main standards;
by sku_id ;
run;
proc sort data=main_w_standards;
by customer_id sku_id;
run;
data main_w_overrides;
merge main_w_standards overrides(in=in2 rename=(location=override));
by customer_id SKU_ID;
if in2 then location=override;
drop override;
run;

Why not UPDATE the STANDARD(loc) with OVERIDE(oride) and then merge with customer data.
data loc;
input customer_id Sku_id location:$1.;
cards;
1 1 A
1 2 A
;;;;
proc print;
data oride;
input customer_id sku_id location:$1.;
cards;
1 1 A
1 2 B
;;;;
run;
proc print;
data locoride;
update loc oride;
by cu: sk:;
run;

Related

Q function that computes the value

How would I write a function in q/kdb that computes the value in a vector.
q)l:1 1 1 2 2 2 3 4 1 2 7 6 4
q)where max[a]=a:count each group l
1 2
q)min where max[a]=a:count each group l
1
q)mode:{where max[a]=a:count each group x}
q)min mode l
1
q)mode l
1 2
As you can see above I would just define a mode function and then use min before the function call to return an atom of the lowest value.

Merge two unequal data sets in SAS with replacment

I generated propensity scores in SAS to match two unequal groups with replacement. Now I'm trying to create a dataset where there are an equal number of observations for both groups-- ie there should be observations in group b that repeat since that is the smaller group. Below I have synthetic data to demonstrate what I'm trying to get.
Indicator Income Matchid
1 7 1
1 8 2
1 4 1
0 6 1
0 9 2
And I want it to look like this
Indicator Income Matchid
1 7 1
1 8 2
1 4 1
0 6 1
0 9 2
0 6 1
In a view you can create a variable that is a group sequence number amenable to modulus evaluation. In a data step load the two indicator groups into separate hashes and then for each loop over the largest group size, selecting by index modulus group size.
Example:
data have;
input Indicator Income Matchid;
datalines;
1 7 1
1 8 2
1 4 1
0 6 1
0 9 2
;
data have_v;
set have;
by indicator notsorted;
if first.indicator then group_seq=0; else group_seq+1;
run;
data want;
if 0 then set have_v;
declare hash i1 (dataset:'have_v(where=(indicator=1))', ordered:'a');
i1.defineKey('group_seq');
i1.defineData(all:'yes');
i1.defineDone();
declare hash i0 (dataset:'have_v(where=(indicator=0))', ordered:'a');
i0.defineKey('group_seq');
i0.defineData(all:'yes');
i0.defineDone();
do index = 0 to max(i0.num_items, i1.num_items)-1;
group_seq = mod(index,i1.num_items);
i1.find();
output;
end;
do index = 0 to max(i0.num_items, i1.num_items)-1;
group_seq = mod(index,i0.num_items);
i0.find();
output;
end;
stop;
drop index group_seq;
run;
If the two groups were separated into data sets, you could do similar processing utilizing SET options nobs= and point=

Calculate minimum and maximum values of each variable in a table in kdb

Consider the following table:
sym A B
X 1 2
Y 4 1
X 6 9
Z 6 3
Z 3 7
Y 1 8
I want to find the minimum A value and maximum B value for each of my symbols X, Y & Z and display them in a new table, i.e.
sym minA maxB
X 1 9
Y 1 8
Z 3 7
Thanks.
This should do it;
select minA:min A, maxB:max B by sym from table

Create a Boolean column displaying comparison between 2 other columns in kdb+

I'm currently learning kdb+/q.
I have a table of data. I want to take 2 columns of data (just numbers), compare them and create a new Boolean column that will display whether the value in column 1 is greater than or equal to the value in column 2.
I am comfortable using the update command to create a new column, but I don't know how to ensure that it is Boolean, how to compare the values and a method to display the "greater-than-or-equal-to-ness" - is it possible to do a simple Y/N output for that?
Thanks.
/ dummy data
q) show t:([] a:1 2 3; b: 0 2 4)
a b
---
1 0
2 2
3 4
/ add column name 'ge' with value from b>=a
q) update ge:b>=a from t
a b ge
------
1 0 0
2 2 1
3 4 1
Use a vector conditional:
http://code.kx.com/q/ref/lists/#vector-conditional
q)t:([]c1:1 10 7 5 9;c2:8 5 3 4 9)
q)r:update goe:?[c1>=c2;1b;0b] from t
c1 c2 goe
-------------
1 8 0
10 5 1
7 3 1
5 4 1
9 9 1
Use meta to confirm the goe column is of boolean type:
q)meta r
c | t f a
-------| -----
c1 | j
c2 | j
goe | b
The operation <= works well with vectors, but in some cases when a function needs atoms as input for performing an operation, you might want to use ' (each-both operator).
e.g. To compare the length of symbol string with another column value
q)f:{x<=count string y}
q)f[3;`ab]
0b
q)t:([] l:1 2 3; s: `a`bc`de)
q)update r:f'[l;s] from t
l s r
------
1 a 1
2 bc 1
3 de 0

some matrix operations and extracting data

I want to ask a question in some matrix operations in MATLAB.
Assume we have this matrix:
A = [1 1 17
1 1 14
1 2 10
1 2 11
2 1 9
2 1 9
2 2 13
2 2 12
3 1 18
3 1 15]
I want the first column, say M and the second column, say D to control the entire matrix to result to one row matrix depending on the following condition:
the program will ask the user to enter the values of M then D as follows:
M = input(' ENTER M VALUE = ') ;
D = input(' ENTER D VALUE = ') ;
Now, the output will be the corresponding 2 values to M and D, and these two values will be taken from the third column,
for example:
if M = 1 and D = 2 , the output is B = 10 ; 11
another example:
if M = 3 and D = 1 , the output is B = 18 ; 15
and so on.
Actually, I know how to solve this using if statement but I have large data and this will take very long time. I am sure that there is a short way to do that.
Thanks.
The short way to do it is
B = A(A(:,1)==M & A(:,2)==D, 3);