Create SAS variable based on values in look-up table - merge

I have two variables (varx and vary) in data set "dat" and need to create a final score, by first categorizing varx and vary, and then translate the score categories into a final score according to a look-up table "lookup".
I managed to get past the categorizing part and am now stuck on how to tell SAS to use the categories I created (i.e., "varxcat" and "varycat") as row and column indices of "lookup", grab the value I need for each observation, and put it into a final score variable (call it "score") in "dat".
In R (in which I normally code) this can easily be done with something like a for loop. Is there anything similar in SAS? (I don't must use "varxcat" and "varycat", just need to eventually create "score".)
data dat;
input ID $ varx vary;
datalines;
1 1 1
2 4 5
3 11 12
4 23 14
5 24 20
;
data lookup;
input x01to10 x11to20 x21to30;
datalines;
21 52 73
84 95 96
107 118 149
; /*first row is for y01to10, second row is for y11to20, and third row is for y21to30,
such that if someone's x score is in category 1 and y score is in category 3,
the person's final score should be 107*/
data dat;
set dat;
if varx <= 10 then varxcat = 1;
else if varx > 10 & varx <= 20 then varxcat = 2;
else if varx > 20 & varx <= 30 then varxcat = 3;
if vary <= 10 then varycat = 1;
else if vary > 10 & vary <= 20 then varycat = 2;
else if vary > 20 & vary <= 30 then varycat = 3;
run;
Desired "dat" looks like
data dat;
input ID $ varx vary score;
datalines;
1 1 1 21
2 4 5 21
3 11 12 95
4 23 14 96
5 24 20 96
;

A lookup table for data value mapping is essentially a left join operation. SAS has a lot of ways to left join data, including
SQL
Merge
Hash object
Array (direct addressing)
Formats
Informats
Here are four ways: SQL, Merge, Array and Hash. The mapping from var* to category is done by the functional mapping int (value/10):
data have;
input ID $ varx vary;
datalines;
1 1 1
2 4 5
3 11 12
4 23 14
5 24 20
6 5 29 /* score should be 107 */
;
data lookup;
do index_y = 0 to 2;
do index_x = 0 to 2;
input lookup_value ##;
output;
end;
end;
datalines;
21 52 73
84 95 96
107 118 149
;
*------------------- SQL;
proc sql;
create table want as
select
id, lookup_value as score
from
have
left join
lookup
on
int (have.varx/10) = lookup.index_x
and
int (have.vary/10) = lookup.index_y
order by
id
;
*------------------- MERGE;
data have2(index=(myindexname=(xcat ycat)));
set have;
xcat = int(varx/10);
ycat = int(vary/10);
run;
proc sort data=lookup;
by index_x index_y;
options msglevel=i;
data want2(keep=id lookup_value rename=(lookup_value=score));
merge
have2(rename=(xcat=index_x ycat=index_y) in=left)
lookup
;
by index_x index_y;
if left;
run;
proc sort data=want2;
by id;
run;
*------------------- ARRAY DIRECT ADDRESSING;
data want3;
array lookup [0:2,0:2] _temporary_;
if _n_ = 1 then do until (endlookup);
set lookup end=endlookup;
lookup[index_x,index_y] = lookup_value;
end;
set have;
xcat = varx/10;
ycat = vary/10;
score = lookup[xcat,ycat];
keep id score;
run;
*------------------- HASH LOOKUP;
data want4;
if 0 then set lookup;
if _n_ = 1 then do;
declare hash lookup(dataset:'lookup');
lookup.defineKey('index_x', 'index_y');
lookup.defineData('lookup_value');
lookup.defineDone();
end;
set have;
index_x = int(varx/10);
index_y = int(vary/10);
if (lookup.find() = 0) then
score = lookup_value;
keep id score;
run;

Related

How can I get the cumulative sum between a predefined number of days in SAS EG

I would like to find a way to calculate the cumulative sum between a predefined number of days.
Just to clarify, I am looking to calculate the cumulative sum between days. Those can be consecutive (e.g. 03AUG and 04AUG) or not (e.g. 04AUG and 06AUG).
Below the data I have:
data have;
input ID $ DT:date9. Amount;
format DT date9.;
datalines;
A 09JUL2021 3600
A 03AUG2021 456
A 04AUG2021 33
A 06AUG2021 235
A 07AUG2021 100
A 09AUG2021 86
A 12AUG2021 456
A 24AUG2021 22
A 25AUG2021 987
A 26AUG2021 916
A 27AUG2021 81
;
run;
I want to be able to create a new variable that shows the cumulative amount between 2 or more days.
I should be able every time to select if I want the cumulative amount between 2 days, or 3 days and so on.
Below the data I want, when I select to calculate the sum between two days:
data what_I_want;
input ID $ DT:date9. Amount Sum_Between_Days;
format DT date9.;
datalines;
A 09JUL2021 3600 0
A 03AUG2021 456 0
A 04AUG2021 33 489
A 06AUG2021 235 268
A 07AUG2021 100 335
A 09AUG2021 86 186
A 12AUG2021 456 0
A 24AUG2021 22 0
A 25AUG2021 987 0
A 26AUG2021 916 1925
A 27AUG2021 81 1984
;
run;
Below the data I want when I select to calculate the sum between 3 days:
data what_I_want;
input ID $ DT:date9. Amount Sum_Between_Days;
format DT date9.;
datalines;
A 09JUL2021 3600 0
A 03AUG2021 456 0
A 04AUG2021 33 0
A 06AUG2021 235 724
A 07AUG2021 100 0
A 09AUG2021 86 186
A 12AUG2021 456 0
A 24AUG2021 22 0
A 25AUG2021 987 0
A 26AUG2021 916 0
A 27AUG2021 81 2006
;
run;
Hopefully, I make sense, but please let me know if not..
Thanks in advance.

Merge two unequal data sets in SAS with replacment

I generated propensity scores in SAS to match two unequal groups with replacement. Now I'm trying to create a dataset where there are an equal number of observations for both groups-- ie there should be observations in group b that repeat since that is the smaller group. Below I have synthetic data to demonstrate what I'm trying to get.
Indicator Income Matchid
1 7 1
1 8 2
1 4 1
0 6 1
0 9 2
And I want it to look like this
Indicator Income Matchid
1 7 1
1 8 2
1 4 1
0 6 1
0 9 2
0 6 1
In a view you can create a variable that is a group sequence number amenable to modulus evaluation. In a data step load the two indicator groups into separate hashes and then for each loop over the largest group size, selecting by index modulus group size.
Example:
data have;
input Indicator Income Matchid;
datalines;
1 7 1
1 8 2
1 4 1
0 6 1
0 9 2
;
data have_v;
set have;
by indicator notsorted;
if first.indicator then group_seq=0; else group_seq+1;
run;
data want;
if 0 then set have_v;
declare hash i1 (dataset:'have_v(where=(indicator=1))', ordered:'a');
i1.defineKey('group_seq');
i1.defineData(all:'yes');
i1.defineDone();
declare hash i0 (dataset:'have_v(where=(indicator=0))', ordered:'a');
i0.defineKey('group_seq');
i0.defineData(all:'yes');
i0.defineDone();
do index = 0 to max(i0.num_items, i1.num_items)-1;
group_seq = mod(index,i1.num_items);
i1.find();
output;
end;
do index = 0 to max(i0.num_items, i1.num_items)-1;
group_seq = mod(index,i0.num_items);
i0.find();
output;
end;
stop;
drop index group_seq;
run;
If the two groups were separated into data sets, you could do similar processing utilizing SET options nobs= and point=

Dividing a matrix into two parts

I am trying to classify my dataset. To do this, I will use the 4th column of my dataset. If the 4th column of the dataset is equal to 1, that row will added in new matrix called Q1. If the 4th column of the dataset is equal to 2, that row will be added to matrix Q2.
My code:
i = input('Enter a start row: ');
j = input('Enter a end row: ');
search = importfiledataset('search-queries-features.csv',i,j);
[n, p] = size(search);
if j>n
disp('Please enter a smaller number!');
end
for s = i:j
class_id = search(s,4);
if class_id == 1
Q1 = search(s,1:4)
elseif class_id ==2
Q2 = search(s,1:4)
end
end
This calculates the Q1 and Q2 matrices, but they all are 1x4 and when it gives new Q1 the old one is deleted. I need to add new row and make it 2x4 if conditions are true. I need to expand my Q1 matrix.
Briefly I am trying to divide my dataset into two parts using for loops and if statements.
Dataset:
I need outcome like:
Q1 = [30 64 1 1
30 62 3 1
30 65 0 1
31 59 2 1
31 65 4 1
33 58 10 1
33 60 0 1
34 58 30 1
34 60 1 1
34 61 10 1]
Q2 = [34 59 0 2
34 66 9 2]
How can I prevent my code from deleting previous rows of Q1 and Q2 and obtain the entire matrices?
The main problem in your calculation is that you overwrite Q1 and Q2 each loop iteration. Best solution: get rid of the loops and use logical indexing.
You can use logical indexing to quickly determine where a column is equal to 1 or 2:
search = [
30 64 1 1
30 62 3 1
30 65 0 1
31 59 2 1
31 65 4 1
33 58 10 1
33 60 0 1
34 59 0 2
34 66 9 2
34 58 30 1
34 60 1 1
34 61 10 1
];
Q1 = search(search(:,4)==1,:) % == compares each entry in the fourth column to 1
Q2 = search(search(:,4)==2,:)
Q1 =
30 64 1 1
30 62 3 1
30 65 0 1
31 59 2 1
31 65 4 1
33 58 10 1
33 60 0 1
34 58 30 1
34 60 1 1
34 61 10 1
Q2 =
34 59 0 2
34 66 9 2
Warning: Slow solution!
If you are hell bent on using loops, make sure to not overwrite your variables. Either extend them each iteration (which is very, very slow):
Q1=[];
Q2=[];
for ii = 1:size(search,1) % loop over all rows
if search(ii,4)==1
Q1 = [Q1;search(ii,:)];
end
if search(ii,4)==2
Q2 = [Q2;search(ii,:)];
end
end
MATLAB will put orange wiggles beneath Q1 and Q2, because it's a bad idea to grow arrays in-place. Alternatively, you can preallocate them as large as search and strip off the excess:
Q1 = zeros(size(search)); % Initialise to be as large as search
Q2 = zeros(size(search));
Q1kk = 1; % Intialiase counters
Q2kk = 1;
for ii = 1:size(search,1) % loop over all rows
if search(ii,4)==1
Q1(Q1kk,:) = search(ii,:); % store
Q1kk = Q1kk + 1; % Increase row counter
end
if search(ii,4)==2
Q2(Q2kk,:) = search(ii,:);
Q2kk = Q2kk + 1;
end
end
Q1 = Q1(1:Q1kk-1,:); % strip off excess rows
Q2 = Q2(1:Q2kk-1,:);
Another option using accumarray, if Q is your original matrix:
Q = accumarray(Q(:,4),1:size(Q,1),[],#(x){Q(x,:)});
You can access the result with Q{1} (for class_id = 1), Q{2} (for class_id = 2) and so on...

Counting number of items in a matrix according a period of time

I have a little issue.
For example, I have a matrix
m=[11 1 9 ;
22 2 10;
33 3 11;
44 4 14;
55 1 15;
66 4 20;
77 1 20;
88 1 24;
99 2 24 ]
where the first column is the id of a product, the second column is the id of the buyer, and the third is the day of purchase.
I would like to count the number of articles bought by a buyer for a period delta=5 days.
I tried this code but I have an issue: I would like to obtain a final matrix where in the rows i have the ids of the users and in columns the number of articles bought for each period.
m=[11 1 9 ;22 2 11; 33 3 10; 44 4 15; 55 1 15;66 4 20; 77 1 20; 88 1 24; 99 2 24 ]
D=m(:,3)
p=0;
delta=5
res=zeros(length(unique(m(:,2))),(div(max(D),delta)))
NbA=zeros(1,length(unique(m(:,2))))
while p<max(D)+1
p
pos=find((m(:,3)>=p)& (m(:,3)<p+delta))
Mtemp=u(pos,:);
[r c]=size(Mtemp);
Ustemp=Mtemp(:,2);
UnUstemp=unique(Ustemp);
for i=1:length(UnUstemp)clc
us=UnUstemp(i);
[n l]=size(find(Ustemp==us));
NbA(i)=n;
end
res=[res ;NbA];
p=p+delta;
end

Matlab isn't incrementing my variable

I have the following matrix declared in Matlab:
EmployeeData =
1 20 100000 42 14
2 15 95000 35 14
3 18 70000 28 14
4 10 85000 35 14
5 10 40000 21 12
6 4 45000 14 8
7 3 50000 21 10
8 5 55000 21 14
9 1 25000 14 7
10 2 50000 21 9
42 4 100000 42 10
Where column 1 represents ID numbers, 2 represents years, 3 is salary, 4 is vacation days, and 5 is sick days. I am trying to find the maximum value of a column (in this case the salary column), and print out the ID associated with that value. If more than one employee has the maximum value, all the IDs with that maximum are supposed to be shown. So here is how I naively implemented a way to do it:
>> maxVal = [];
>> j = 1;
>> for i = EmployeeData(:, 3)
if i == max(EmployeeData(:, 3))
maxVal = [maxVal EmployeeData(j, 1)];
end
j = j + 1;
end
But it shows maxVal to be [] in my workspace variables, instead of [1 42] as I expected. Upon inserting a disp(i) in the for loop above the if to debug, I get the following output:
100000
95000
70000
85000
40000
45000
50000
55000
25000
50000
Just like I expected. But when I switch out that disp(i) with a disp(j), I get this for my output:
1
What am I doing wrong? Should this not work?
MATLAB for loops operate on rows, not columns. You should try replacing your for loop with:
for i = EmployeeData(:, 3)' % NOTE THE TRANSPOSE
...
end
EDIT: Note that you can do what you're trying to do without a forloop:
maxVal = EmployeeData(EmployeeData(:,3) == max(EmployeeData(:,3)),1);
Is this what you want?
>> EmployeeData(EmployeeData(:,3)==max(EmployeeData(:,3)),1)
ans =
1
42