member check and date range check in Matlab - matlab

I have 2 matrices with the SAME IDs. I need to extract those rows of IDs from mat1 which have their dates within say ±5 days of the dates in the mat2. Same operation for mat2 as well. Please see the data here: UNIQCols = [1 2] ; dateCol = [3] ; valueCol = [4] ; dayRange = +- 15days.
% UniqCol Date Value
mat1 = [2001 2 733427 1001 ;
2001 2 733793 2002 ;
2001 2 734582 2003 ;
3001 1 734220 30 ;
3001 1 734588 20 ;];
mat2 = [2001 2 733790 7777 ;
2001 2 734221 2222 ;
3001 1 734220 10 ;
3001 1 734588 40 ;] ;
ans1 = [2001 2 733793 2002 ; 3001 1 734220 30 ; 3001 1 734588 20 ] ;
ans2 = [2001 2 733790 7777 ; 3001 1 734220 10 ; 3001 1 734588 40 ] ;
This needs to be a vectorized operation! The IDs are ordered in increasing order of dates. Dates are either separated on Q or Annual basis. So the range will be always << (date2-date1) Please help and thanks!

Here is a function based on similar question I mentioned in my comments. Remember your matrices has to be sorted by date.
function match_for_xn = match_by_distance(xn, xm, maxdist)
%#Generates index for elements in vector xn that close to any of elements in
%#vector xm at least by distance maxdist
match_for_xn = false(length(xn), 1);
last_M = 1;
for N = 1:length(xn)
%# search through M until we find a match.
for M = last_M:length(xm)
dist_to_curr = xm(M) - xn(N);
if abs(dist_to_curr) < maxdist
match_for_xn(N) = 1;
last_M = M;
break
elseif dist_to_curr > 0
last_M = M;
break
else
continue
end
end %# M
end %# N
And the test script:
mat1 = sortrows([
2001 2 733427 1001 ;
2001 2 733793 2002 ;
2001 2 734582 2003 ;
3001 1 734220 30 ;
3001 1 734588 20 ;
],3);
mat2 = sortrows([
2001 2 733790 7777 ;
2001 2 734221 2222 ;
3001 1 734220 10 ;
3001 1 734588 40 ;
],3);
mat1_index = match_by_distance(mat1(:,3),mat2(:,3),5);
ans1 = mat1(mat1_index,:);
mat2_index = match_by_distance(mat2(:,3),mat1(:,3),5);
ans2 = mat2(mat2_index,:);
I haven't tried any vectorized solution for your problem. If you get any try it against this solution and check the timing and memory consumption (include sorting step).

Related

Merge and create variable in merged dataset in same step

I am merging two datasets in SAS, "set_x" and "set_y", and would like to create a variable "E" in the resulting merged dataset "matched":
* Create set_x *;
data set_x ;
input merge_key A B ;
datalines;
1 24 25
2 25 25
3 30 32
4 32 32
5 20 32
6 1 1
;
run;
* Create set_y *;
data set_y ;
input merge_key C D ;
datalines;
1 1 1
2 2 1
3 1 1
4 2 1
5 1 1
7 1 1
;
run;
* Merge and create E *;
data matched unmatched_x unmatched_y ;
merge set_x (in=x) set_y (in=y) ;
by merge_key ;
if x=y then output matched;
else if x then output unmatched_x ;
else if y then output unmatched_y ;
IF C = 2 THEN DO ;
E = A ;
END;
ELSE DO ;
E = floor(B - D) ;
END ;
run ;
However, in the resulting table "matched", the values of E are missing. E is correctly calculated if I only output matched values (i.e. using if x=y;).
Is it possible to create "E" in the same data step if outputting unmatched as well as matched observations?
You have output the results before E is computed, then E is set to missing when next iteration starts. So you want E to be available before the output,
data matched unmatched_x unmatched_y ;
merge set_x (in=x) set_y (in=y) ;
by merge_key ;
IF C = 2 THEN DO ;
E = A ;
END;
ELSE DO ;
E = floor(B - D) ;
END ;
if x=y then output matched;
else if x then output unmatched_x ;
else if y then output unmatched_y ;
run ;

Matlab isn't incrementing my variable

I have the following matrix declared in Matlab:
EmployeeData =
1 20 100000 42 14
2 15 95000 35 14
3 18 70000 28 14
4 10 85000 35 14
5 10 40000 21 12
6 4 45000 14 8
7 3 50000 21 10
8 5 55000 21 14
9 1 25000 14 7
10 2 50000 21 9
42 4 100000 42 10
Where column 1 represents ID numbers, 2 represents years, 3 is salary, 4 is vacation days, and 5 is sick days. I am trying to find the maximum value of a column (in this case the salary column), and print out the ID associated with that value. If more than one employee has the maximum value, all the IDs with that maximum are supposed to be shown. So here is how I naively implemented a way to do it:
>> maxVal = [];
>> j = 1;
>> for i = EmployeeData(:, 3)
if i == max(EmployeeData(:, 3))
maxVal = [maxVal EmployeeData(j, 1)];
end
j = j + 1;
end
But it shows maxVal to be [] in my workspace variables, instead of [1 42] as I expected. Upon inserting a disp(i) in the for loop above the if to debug, I get the following output:
100000
95000
70000
85000
40000
45000
50000
55000
25000
50000
Just like I expected. But when I switch out that disp(i) with a disp(j), I get this for my output:
1
What am I doing wrong? Should this not work?
MATLAB for loops operate on rows, not columns. You should try replacing your for loop with:
for i = EmployeeData(:, 3)' % NOTE THE TRANSPOSE
...
end
EDIT: Note that you can do what you're trying to do without a forloop:
maxVal = EmployeeData(EmployeeData(:,3) == max(EmployeeData(:,3)),1);
Is this what you want?
>> EmployeeData(EmployeeData(:,3)==max(EmployeeData(:,3)),1)
ans =
1
42

matlab how to delete rows using logical index

I have a matrix:
A = [ 4567 345; 45 6787; 3345 NaN; 87 6787]
and a vector
B = [ 4567; 45; 8976 ]
I want to compare A and B and delete in A all the rows that do not have at least one number which is included in matrix B.
The resulting matrix would be:
[ 4567 345;
45 6787 ]
Here is the code:
idx=ismember(A(:,1:2),B); %%create a logical index in order to see if A includes elements of B
n = length(A)
for i=1:n
if (idx(i,1)==0)& (idx(i,2)==0)
A(i,:)=[];
end
end
However I got this error:
Index of element to remove exceeds matrix dimensions.
I tried with another solution but I get the same error.
n = length(A)
for i=1:n
if (find(idx(i,1)==0))& (find(idx(i,2)==0))
A(i,:)=[];
end
end
You do not need a loop in this logical indexing task:
ismember(A,B)
ans =
1 0
1 0
0 0
0 0
All you need is to keep those rows with at least one match with any(...,2):
idx = any(ismember(A,B),2)
idx =
1
1
0
0
The result:
A(idx,:)
ans =
4567 345
45 6787
The error is caused by the fact that your loop runs from 1:n but you are removing rows from the matrix making it effectively shorter than n.
A = [ 4567 345; 45 6787; 3345 NaN; 87 6787]
B = [ 8976; 45; 4567 ]
z = sum(ismember(A,B)');
NewMatrix=[];
for i = 1:length(z)
if z(i)>= 1
newMatrix = A(i,:);
NewMatrix=[NewMatrix;newMatrix];
end
end
NewMatrix

How can I divide each row of a matrix by a fixed row?

Suppose I have a matrix like:
100 200 300 400 500 600
1 2 3 4 5 6
10 20 30 40 50 60
...
I wish to divide each row by the second row (each element by the corresponding element), so I'll get:
100 100 100 100 100 100
1 1 1 1 1 1
10 10 10 10 10 10
...
Hw can I do it (without writing an explicit loop)?
Use bsxfun:
outMat = bsxfun (#rdivide, inMat, inMat(2,:));
The 1st argument to bsxfun is a handle to the function you want to apply, in this case right-division.
Here's a couple more equivalent ways:
M = [100 200 300 400 500 600
1 2 3 4 5 6
10 20 30 40 50 60];
%# BSXFUN
MM = bsxfun(#rdivide, M, M(2,:));
%# REPMAT
MM = M ./ repmat(M(2,:),size(M,1),1);
%# repetition by multiplication
MM = M ./ ( ones(size(M,1),1)*M(2,:) );
%# FOR-loop
MM = zeros(size(M));
for i=1:size(M,1)
MM(i,:) = M(i,:) ./ M(2,:);
end
The best solution is the one using BSXFUN (as posted by #Itamar Katz)
You can now use array vs matrix operations.
This will do the trick :
mat = [100 200 300 400 500 600
1 2 3 4 5 6
10 20 30 40 50 60];
result = mat ./ mat(2,:)
which will output :
result =
100 100 100 100 100 100
1 1 1 1 1 1
10 10 10 10 10 10
This will work in Octave and Matlab since R2016b.

How can I perform this cumulative sum in MATLAB?

I want to calculate a cumulative sum of the values in column 2 of dat.txt below for each string of ones in column 1. The desired output is shown as dat2.txt:
dat.txt dat2.txt
1 20 1 20 20 % 20 + 0
1 22 1 22 42 % 20 + 22
1 20 1 20 62 % 42 + 20
0 11 0 11 11
0 12 0 12 12
1 99 1 99 99 % 99 + 0
1 20 1 20 119 % 20 + 99
1 50 1 50 169 % 50 + 119
Here's my initial attempt:
fid=fopen('dat.txt');
A =textscan(fid,'%f%f');
in =cell2mat(A);
fclose(fid);
i = find(in(2:end,1) == 1 & in(1:end-1,1)==1)+1;
out = in;
cumulative =in;
cumulative(i,2)=cumulative (i-1,2)+ cumulative(i,2);
fid = fopen('dat2.txt','wt');
format short g;
fprintf(fid,'%g\t%g\t%g\n',[out cumulative(:)]');
fclose(fid);
Here's a completely vectorized (albeit somewhat confusing-looking) solution that uses the functions CUMSUM and DIFF along with logical indexing to produce the results you want:
>> data = [1 20;... %# Initial data
1 22;...
1 20;...
0 11;...
0 12;...
1 99;...
1 20;...
1 50];
>> data(:,3) = cumsum(data(:,2)); %# Add a third column containing the
%# cumulative sum of column 2
>> index = (diff([0; data(:,1)]) > 0); %# Find a logical index showing where
%# continuous groups of ones start
>> offset = cumsum(index.*(data(:,3)-data(:,2))); %# An adjustment required to
%# zero the cumulative sum
%# at the start of a group
%# of ones
>> data(:,3) = data(:,3)-offset; %# Apply the offset adjustment
>> index = (data(:,1) == 0); %# Find a logical index showing where
%# the first column is zero
>> data(index,3) = data(index,2) %# For each zero in column 1 set the
%# value in column 3 to be equal to
data = %# the value in column 2
1 20 20
1 22 42
1 20 62
0 11 11
0 12 12
1 99 99
1 20 119
1 50 169
Not completely vectorized solution (it loops through the segments of sequential 1s), but should be faster. It's doing only 2 loops for your data. Uses MATLAB's CUMSUM function.
istart = find(diff([0; d(:,1)])==1); %# start indices of sequential 1s
iend = find(diff([d(:,1); 0])==-1); %# end indices of sequential 1s
dcum = d(:,2);
for ind = 1:numel(istart)
dcum(istart(ind):iend(ind)) = cumsum(dcum(istart(ind):iend(ind)));
end
dlmwrite('dat2.txt',[d dcum],'\t') %# write the tab-delimited file
d=[
1 20
1 22
1 20
0 11
0 12
1 99
1 20
1 50
];
disp(d)
out=d;
%add a column
out(:,3)=0;
csum=0;
for(ind=1:length(d(:,2)))
if(d(ind,1)==0)
csum=0;
out(ind,3)=d(ind,2);
else
csum=csum+d(ind,2);
out(ind,3)=csum;
end
end
disp(out)