z score with nan values in matlab (vectorized) - matlab

I am trying to calculate the zscore for a vector of 5000 rows which has many nan values. I have to calculate this many times so I dont want to use a loop, I was hoping to find a vectorized solution.
the loop solution:
for i = 1:end
vec(i,1) = (val(i,1) - nanmean(:,1))/nanstd(:,1)
end
a partial vectorized solution:
zscore(vec(find(isnan(vec(1:end) == 0))))
but this returns a vector the length of the original vector minus the nan values. Thus it isn't the same as the original size.
I want to calculated the zscore for the vector and then interpolate missing data after words. I have to do this 100s of times thus I am looking for a fast vectorized approach.

This is a vectorized solution:
% generate some example data with NaNs.
val = reshape(magic(4), 16, 1);
val(10) = NaN;
val(17) = NaN;
Here's the code:
valWithoutNaNs = val(~isnan(val));
valMean = mean(valWithoutNaNs);
valSD = std(valWithoutNaNs);
valZscore = (val-valMean)/valSD;
Then column vector valZscore contains deviations (Z scores), and has NaN values for NaN values in val, the original measurement data.

Sorry this answer is 6 months late, but for anyone else who comes across this thread:
The accepted answer isn't fully vectorised in that it doesn't do what the real zscore does so beautifully: That is, do zscores along a particular dimension of a matrix.
If you want to calculate zscores of a large number of vectors at once, as the OP says he is doing, the best solution is this:
Z = bsxfun(#divide, bsxfun(#minus, X, nanmean(X)) ,
nanstd(X) );
To do it on an arbitrary dimension, just put the dimension inside the nanmean and nanstd, and bsxfun takes care of the rest.
nanzscore = #(X,DIM) bsxfun(#divide, bsxfun(#minus, X, nanmean(X,DIM)), ...
nanstd(X,DIM));

anonymous function:
nanZ = #(xIn)(xIn-nanmean(xIn))/nanstd(xIn);
nanZ(vectorWithNans)

vectorized version of below anonymous function (assumes observations are in rows, variables in columns):
nanZ = #(xIn)(xIn-repmat(nanmean(xIn),size(xIn,1),1))./repmat(nanstd(xIn),size(xIn,1),1);
nanZ(matrixWithNans)

Related

Optimize nested for loop for calculating xcorr of matrix rows

I have 2 nested loops which do the following:
Get two rows of a matrix
Check if indices meet a condition or not
If they do: calculate xcorr between the two rows and put it into new vector
Find the index of the maximum value of sub vector and replace element of LAG matrix with this value
I dont know how I can speed this code up by vectorizing or otherwise.
b=size(data,1);
F=size(data,2);
LAG= zeros(b,b);
for i=1:b
for j=1:b
if j>i
x=data(i,:);
y=data(j,:);
d=xcorr(x,y);
d=d(:,F:(2*F)-1);
[M,I] = max(d);
LAG(i,j)=I-1;
d=xcorr(y,x);
d=d(:,F:(2*F)-1);
[M,I] = max(d);
LAG(j,i)=I-1;
end
end
end
First, a note on floating point precision...
You mention in a comment that your data contains the integers 0, 1, and 2. You would therefore expect a cross-correlation to give integer results. However, since the calculation is being done in double-precision, there appears to be some floating-point error introduced. This error can cause the results to be ever so slightly larger or smaller than integer values.
Since your calculations involve looking for the location of the maxima, then you could get slightly different results if there are repeated maximal integer values with added precision errors. For example, let's say you expect the value 10 to be the maximum and appear in indices 2 and 4 of a vector d. You might calculate d one way and get d(2) = 10 and d(4) = 10.00000000000001, with some added precision error. The maximum would therefore be located in index 4. If you use a different method to calculate d, you might get d(2) = 10 and d(4) = 9.99999999999999, with the error going in the opposite direction, causing the maximum to be located in index 2.
The solution? Round your cross-correlation data first:
d = round(xcorr(x, y));
This will eliminate the floating-point errors and give you the integer results you expect.
Now, on to the actual solutions...
Solution 1: Non-loop option
You can pass a matrix to xcorr and it will perform the cross-correlation for every pairwise combination of columns. Using this, you can forego your loops altogether like so:
d = round(xcorr(data.'));
[~, I] = max(d(F:(2*F)-1,:), [], 1);
LAG = reshape(I-1, b, b).';
Solution 2: Improved loop option
There are limits to how large data can be for the above solution, since it will produce large intermediate and output variables that can exceed the maximum array size available. In such a case for loops may be unavoidable, but you can improve upon the for-loop solution above. Specifically, you can compute the cross-correlation once for a pair (x, y), then just flip the result for the pair (y, x):
% Loop over rows:
for row = 1:b
% Loop over upper matrix triangle:
for col = (row+1):b
% Cross-correlation for upper triangle:
d = round(xcorr(data(row, :), data(col, :)));
[~, I] = max(d(:, F:(2*F)-1));
LAG(row, col) = I-1;
% Cross-correlation for lower triangle:
d = fliplr(d);
[~, I] = max(d(:, F:(2*F)-1));
LAG(col, row) = I-1;
end
end

How do I reshape a non-quadratic matrix?

I have a column vector A with dimensions (35064x1) that I want to reshape into a matrix with 720 lines and as many columns as it needs.
In MATLAB, it'd be something like this:
B = reshape(A,720,[])
in which B is my new matrix.
However, if I divide 35604 by 720, there'll be a remainder.
Ideally, MATLAB would go about filling every column with 720 values until the last column, which wouldn't have 720 values; rather, 504 values (48x720+504 = 35064).
Is there any function, as reshape, that would perform this task?
Since I am not good at coding, I'd resort to built-in functions first before going into programming.
reshape preserves the number of elements but you achieve the same in two steps
b=zeros(720*ceil(35604/720),1); b(1:35604)=a;
reshape(b,720,[])
A = rand(35064,1);
NoCols = 720;
tmp = mod(numel(A),NoCols ); % get the remainder
tmp2 = NoCols -tmp;
B = reshape([A; nan(tmp2,1)],720,[]); % reshape the extended column
This first gets the remainder after division, and then subtract that from the number of columns to find the amount of missing values. Then create an array with nan (or zeros, whichever suits your purpose best) to pad the original and then reshape. One liner:
A = rand(35064,1);
NoCols = 720;
B = reshape([A; nan(NoCols-mod(numel(A),NoCols);,1)],720,[]);
karakfa got the right idea, but some error in his code.
Fixing the errors and slightly simplifying it, you end up with:
B=nan(720,ceil(numel(a)/720));
B(1:numel(A))=A;
Create a matrix where A fits in and assingn the elemnent of A to the first numel(A) elements of the matrix.
An alternative implementation which is probably a bit faster but manipulates your variable b
%pads zeros at the end
A(720*ceil(numel(A)/720))=0;
%reshape
B=reshape(A,720,[]);

Matrix indices for creating sparse matrix

I want to create a 4 by 4 sparse matrix A. I want assign values (e.g. 1) to following entries:
A(2,1), A(3,1), A(4,1)
A(2,2), A(3,2), A(4,2)
A(2,3), A(3,3), A(4,3)
A(2,4), A(3,4), A(4,4)
According to the manual page, I know that I should store the indices by row and column respectively. That is, for row indices,
r=[2,2,2,2,3,3,3,3,4,4,4,4]
Also, for column indices
c=[1,2,3,4,1,2,3,4,1,2,3,4]
Since I want to assign 1 to each of the entries, so I use
value = ones(1,length(r))
Then, my sparse matrix will be
Matrix = sparse(r,c,value,4,4)
My problem is this:
Indeed, I want to construct a square matrix of arbitrary dimension. Says, if it is a 10 by 10 matrix, then my column vector will be
[1,2,..., 10, 1,2, ..., 10, 1,...,10, 1,...10]
For row vector, it will be
[2,2,...,2,3,3,...,3,...,10, 10, ...,10]
I would like to ask if there is a quick way to build these column and row vector in an efficient manner? Thanks in advance.
I think the question aims to create vectors c,r in an easy way.
n = 4;
c = repmat(1:n,1,n-1);
r = reshape(repmat(2:n,n,1),1,[]);
Matrix = sparse(r,c,value,n,n);
This will create your specified vectors in general.
However as pointed out by others full sparse matrixes are not very efficient due to overhead. If I recall correctly a sparse matrix offers advantages if the density is lower than 25%. Having everything except the first row will result in slower performance.
You can sparse a matrix after creating its full version.
A = (10,10);
A(1,:) = 0;
B = sparse(A);

Best way to join different length column vectors into a matrix in MATLAB

Assuming i have a series of column-vectors with different length, what would be the best way, in terms of computation time, to join all of them into one matrix where the size of it is determined by the longest column and the elongated columns cells are all filled with NaN's.
Edit: Please note that I am trying to avoid cell arrays, since they are expensive in terms of memory and run time.
For example:
A = [1;2;3;4];
B = [5;6];
C = magicFunction(A,B);
Result:
C =
1 5
2 6
3 NaN
4 NaN
The following code avoids use of cell arrays except for the estimation of number of elements in each vector and this keeps the code a bit cleaner. The price for using cell arrays for that tiny bit of work shouldn't be too expensive. Also, varargin gets you the inputs as a cell array anyway. Now, you can avoid cell arrays there too, but it would most probably involve use of for-loops and might have to use variable names for each of the inputs, which isn't too elegant when creating a function with unknown number of inputs. Otherwise, the code uses numeric arrays, logical indexing and my favourite bsxfun, which must be cheap in the market of runtimes.
Function Code
function out = magicFunction(varargin)
lens = cellfun(#(x) numel(x),varargin);
out = NaN(max(lens),numel(lens));
out(bsxfun(#le,[1:max(lens)]',lens)) = vertcat(varargin{:}); %//'
return;
Example
Script -
A1 = [9;2;7;8];
A2 = [1;5];
A3 = [2;6;3];
out = magicFunction(A1,A2,A3)
Output -
out =
9 1 2
2 5 6
7 NaN 3
8 NaN NaN
Benchmarking
As part of the benchmarking, we are comparing our solution to #gnovice's solution that was mostly based on using cell arrays. Our intention here to see that after avoiding cell arrays, what speedups we are getting if there's any. Here's the benchmarking code with 20 vectors -
%// Let's create row vectors A1,A2,A3.. to be used with #gnovice's solution
num_vectors = 20;
max_vector_length = 1500000;
vector_lengths = randi(max_vector_length,num_vectors,1);
vs =arrayfun(#(x) randi(9,1,vector_lengths(x)),1:numel(vector_lengths),'uni',0);
[A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16,A17,A18,A19,A20] = vs{:};
%// Maximally cell-array based approach used in linked #gnovice's solution
disp('--------------------- With #gnovice''s approach')
tic
tcell = {A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16,A17,A18,A19,A20};
maxSize = max(cellfun(#numel,tcell)); %# Get the maximum vector size
fcn = #(x) [x nan(1,maxSize-numel(x))]; %# Create an anonymous function
rmat = cellfun(fcn,tcell,'UniformOutput',false); %# Pad each cell with NaNs
rmat = vertcat(rmat{:});
toc, clear tcell maxSize fcn rmat
%// Transpose each of the input vectors to get column vectors as needed
%// for our problem
vs = cellfun(#(x) x',vs,'uni',0); %//'
[A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16,A17,A18,A19,A20] = vs{:};
%// Our solution
disp('--------------------- With our new approach')
tic
out = magicFunction(A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,...
A11,A12,A13,A14,A15,A16,A17,A18,A19,A20);
toc
Results -
--------------------- With #gnovice's approach
Elapsed time is 1.511669 seconds.
--------------------- With our new approach
Elapsed time is 0.671604 seconds.
Conclusions -
With 20 vectors and with a maximum length of 1500000, the speedups are between 2-3x and it was seen that the speedups have increased as we have increased the number of vectors. The results to prove that are not shown here to save space, as we have already used quite a lot of it here.
If you use a cell matrix you won't need them to be filled with NaNs, just write each array into one column and the unused elements stay empty (that would be the space efficient way). You could either use:
cell_result{1} = A;
cell_result{2} = B;
THis would result in a size 2 cell array which contains all elements of A,B in his elements. Or if you want them to be saved as columns:
cell_result(1,1:numel(A)) = num2cell(A);
cell_result(2,1:numel(B)) = num2cell(B);
If you need them to be filled with NaN's for future coding, it would be the easiest to find the maximum length you got. Create yourself a matrix of (max_length X Number of arrays).
So lets say you have n=5 arrays:A,B,C,D and E.
h=zeros(1,n);
h(1)=numel(A);
h(2)=numel(B);
h(3)=numel(C);
h(4)=numel(D);
h(5)=numel(E);
max_No_Entries=max(h);
result= zeros(max_No_Entries,n);
result(:,:)=NaN;
result(1:numel(A),1)=A;
result(1:numel(B),2)=B;
result(1:numel(C),3)=C;
result(1:numel(D),4)=D;
result(1:numel(E),5)=E;

Summing multiple matrices in matlab

I have a file containing 60 matrices. I would like get the mean of each value across those 60 matrices.
so the mean of the [1,1] mean of [1,2] across the matrices.
I am unable to use the mean command and am not sure what's the best way to do this.
Here's the file: https://dl.dropbox.com/u/22681355/file.mat
You can try this:
% concatenate the contents of your cell array to a 100x100x60 matrix
c = cat(3, results_foptions{:});
% take the mean
thisMean = mean(c, 3);
To round to the nearest integer, you can use
roundedMean = round(thisMean);
You should put all the matrices together in a 3 dimensional (matrix?), mat, as:
mat(:,:,1) = mat1;
mat(:,:,2) = mat2;
mat(:,:,3) = mat3;
etc...
then simply:
mean(mat, 3);
where the parameter '3' stipulates that you want the mean accros the 3rd dimension.
The mean of the matrix can be computed a few different ways.
First you can compute the mean of each column and then compute the mean of those means:
colMeans = mean( A );
matMean = mean(colMean);
Or you can convert the matrix to a column vector and compute the mean directly
matMean = mean( A(:) );