Finding the maximum condition number of a matrix after erasure - matlab

I am dealing with the following question: Given a random Gaussian matrix of large size, say for example 1000 by 500. Then arbitrarily remove 500 rows and consider the condition number of the matrix left. What is the maximum possible condition number we can get with high probability?
Here Gaussian matrix means the matrix has i.i.d standard normal entries. I would like to write a MATLAB program to do some simulations. How can I write the program? Thanks for any help.

That's an interesting problem. I don't know of any theoretical results, but it's easy to set up a Monte Carlo simulation and see.
Note that arbitrarily removing 500 rows is equivalent to always removing the last 500 rows for example, because the rows are i.i.d. and the condition number is invariant to changing the order of the rows.
M = 100; %// initial number of rows
N = 50; %// number of columns
R = 1e4; %// number of Monte Carlo realizations
cond1 = NaN(1,R); %// preallocate
cond2 = NaN(1,R); %// preallocate
for r = 1:R
X = randn(M,N); %// matrix with i.i.d normalized Gaussian entries
cond1(r) = cond(X);
cond2(r) = cond(X(1:N,:));
end
loglog(cond1, cond2, '.', 'markersize', 1) %// scatter plot of results in logarithmic scale
xlabel('Condition number of original matrix')
ylabel('Condition number of reduced matrix')
This is the result for M=100; N=50;. Note that for M=100; N=50; it may take long to obtain a large number of realizations.
As expected, the condition number increases when you remove rows (although I didn't expect it to increase so much!).
From the obtained vectors cond1 and cond2 you can compute statistics, or percentiles. For example, the value that is exceeded with only 10% probability is, in each case,
>> quantile(cond1,.9)
ans =
5.837510220358853
>> quantile(cond2,.9)
ans =
9.422516183444204e+02
This means that in the original matrix, 90% of the times the condition number is less than 5.8375; whereas in the reduced matrix, 90% of the times the condition number is less than 942.25.

Related

Vectorizing by splitting matrix by rows unequally

I have X_test which is a matrix of size 967874 x 3 where the columns are: doc#, wordID, wordCount, and there's 7505 unique doc#'s (length(unique(X_test(:,1))) == length(Y_test) == 7505). The matrix rows are also already sorted according to the doc#'s column.
I also have a likelihoods matrix of size 61188 x 20 where the rows are all possible wordIDs, and the columns are different classes (length(unique(Y_test))==20)
The result I'm trying to obtain is a matrix of size 7505 x 20 where each row signifies a different document and contains, for each class (column), the sum of the wordCounts of the values in the likelihood matrix rows which correspond to the wordIDs for that document (trying to think of better phrasing...)
My first thought was to rearrange this 2D matrix into a 3D matrix according to doc#s, but the number of rows for each unique doc# are unequal. I also think making a cell array of 7505 matrices isn't a great idea, but may be wrong about that.
It's probably more explanatory if I just show the code I have that works, but is slow because it iterates through each of the 7505 documents:
probabilities = zeros(length(Y_test),nClasses); % 7505 x 20
for n=1:length(Y_test) % 7505 iterations
doc = X_test(X_test(:,1)==n,:);
result = bsxfun(#times, doc(:,3), log(likelihoods(doc(:,2),:)));
% result ends up size length(doc) x 20
probabilities(n,:) = sum(result);
end
for context, this is what I use the probabilities matrix for:
% MAP decision rule
probabilities = bsxfun(#plus, probabilities, logpriors'); % add priors
[~,predictions] = max(probabilities,[],2);
CCR = sum(predictions==Y_test)/length(Y_test); % correct classification rate
fprintf('Correct classification percentage: %0.2f%%\n\n', CCR*100);
edit: so I separated the matrix into a cell array according to doc#'s, but don't know how to apply bsxfun to all arrays in a cell at the same time.
counts = histc(X_test(:,1),unique(X_test(:,1)));
testdocs = mat2cell(X_test,counts);

Optimize matrix to have the least number of rows with NaN

I have a matrix M. Let's assume that each row of the matrix M is a subject and each column is a measurement.
M=rand(100); % generate a 100x100 matrix random
c=randperm(length(M),100); %select randomly 100 measurement indices
r=randperm(length(M),100); %select randomly 100 subject indices
for i = 1 : 100
M(r(i),c(i))=NaN; % add randomly NaN. i.e. the subject c(i) does not have measurement c(i)
end
Now I delete the measurements that are missing for all the subjects (if any)
idx_col_all_NAN = find(all(isnan(M)==1));
M(:,idx_col_all_NAN)=[];
and I delete the subjects for which all the measurements are missing (if any)
idx_row_all_NAN = find(all(isnan(M)==1,2));
M(idx_row_all_NAN,:)=[];
Now I would like to remove the measurements in order to maximize the number of subjects with the same measurements and minimize the cells of M containing NaN.
Could you help me?
In order to continue removing NaNs from your matrix, you need to have some rule for how to maximize the tradeoff between less data and less NaN. As you said, if you continue to remove NaNs without any limit - you may remain with a very small amount of data. There is no correct rule for that, it really depends on what you are asking, the following suggestion is only to give you an idea of how to deal with such a problem.
So as a starting point, I define an index to the 'quality' of the matrix, in terms of how many 'holes' are in it:
M_ratio = sum(~isnan(M(:)))/numel(M); % the ratio between numbers to M size
This index will be greater as more data is in the matrix, and it equals 1 if there are no NaN. We could continue to remove row/columns from the matrix as long as we see an improvement, but because the matrix is getting smaller as long as there are NaN left we will always see an improvement, so we will left with empty matrix (or a very small one, depends on how much NaNs we had).
So we need to define some threshold for the improvement, such that if the deletion does not improve the matrix in a certain amount - we stop the process:
improve = 1-M_old_ratio/M_new_ratio % the relative improvement after deletion
improve is the relative gain in our 'quality' index, and if it is not big enough we stop deleting row/columns from the matrix. What is big enough? this is hard to say, but I'll leave it to you to play with and see what gives you a decent result.
So here is the full code for that:
N = 100;
M = rand(N); % generate a NxN random matrix
M(randi(numel(M),N^2,1)) = nan; % add NaN to randomly selected N^2 measurements
M(:,all(isnan(M)))=[]; % delete all NaN columns
M(all(isnan(M),2),:)=[]; % delete all NaN rows
threshold = 0.003; % the threshold for stop optimizing the matrix
while 1
M_ratio = sum(~isnan(M(:)))/numel(M); % the ratio between numbers to M size
[mincol,indcol] = min(sum(~isnan(M),1)); % find the column with most NaN
[minrow,indrow] = min(sum(~isnan(M),2)); % find the row with most NaN
[~,dir] = min([minrow;mincol]); % find which has more NaNs
Mtry = M;
if dir == 1
Mtry(indrow,:) = []; % delete row
else
Mtry(:,indcol) = []; % delete column
end
Mtry_ratio = sum(~isnan(Mtry(:)))/numel(Mtry); % get the new ratio
improve = 1-M_ratio/Mtry_ratio; % the relative improvement after deletion
if improve>threshold % if it improves more than the threshold
M = Mtry; % replace the matrix
else
break; % otherwise - quit
end
end
if you only consider removing columns, and not rows, it's a bit simmpler:
threshold = 0.002; % the threshold for stop optimizing the matrix
while 1
M_ratio = sum(~isnan(M(:)))/numel(M); % the ratio between numbers to M size
[~,indcol] = min(sum(~isnan(M),1)); % find the column with most NaN
Mtry = M;
Mtry(:,indcol) = []; % delete column
Mtry_ratio = sum(~isnan(Mtry(:)))/numel(Mtry); % get the new ratio
improve = 1-M_ratio/Mtry_ratio; % the relative improvement after deletion
if improve>threshold % if it improves more than the threshold
M = Mtry; % replace the matrix
else
break; % otherwise - quit
end
end
As you will notice I introduced NaN to the matrix in a more compact way, but it doesn't matter, since you have a real data. Also I use logical indexing, which is more compact and efficient way to remove columns and rows.

Find frequency of elements above a threshold for each cell in MATLAB

I have a 4-D matrix. The dimensions are longitude, latitude, days, years as [17,14,122,16].
I have to find out frequency of values above 98 percentile for each cell so that final output comes as as array of 17x14 containing number of occurrence of values above a 98 percent threshold.
I did something which gives me a matrix 17x14 of values associated with 98 percentile for each cell but I am unable to determine the frequency of occurrences.
k=0;
p=cell(1,238);
r=cell(1,238);
for i=1:17
for j=1:14
n=m(i,j,[1:122],[1:16]);
n=squeeze(n);
k=k+1;
q=prctile(n(:),98);
r{k}=nansum(nansum(n>=q));
p{k}=q;
end
end
This code gives matrix p fine but matrix r contains same values for all cells. How can this be possible? What am I doing wrong with this? Please help.
By definition, the frequency of values above the 98th percentile is 2%.
I'm guessing the same value you are getting for r is 39; the number of elements in the top 2% of your 122x16 matrix (i.e. 1952 elements).
r = 0.02*1952;
r =
39.040
Your code is verifying the theoretical value. Perhaps you are thinking of a different question?
Here's a simulated example, using randomly generated (uniform distribution) from 0 to 100 for your data (n).
p=cell(1,238);
r=cell(1,238);
for i=1:17
for j=1:14
% n=m(i,j,[1:122],[1:16]);
% n=squeeze(n);
% After you do n=squeeze(n), it gives 2-D matrix of 122x16
% dimensions.
n = rand(122,16)*100; % simulation for your 2-D matrix
k=k+1;
q=prctile(n(:),98);
r{k}=nansum(nansum(n>=q));
p{k}=q;
end
end

Finding maximum/minimum distance of two rows in a matrix using MATLAB

Say we have a matrix m x n where the number of rows of the matrix is very big. If we assume each row is a vector, then how could one find the maximum/minimum distance between vectors in this matrix?
My suggestion would be to use pdist. This computes pairs of Euclidean distances between unique combinations of observations like #seb has suggested, but this is already built into MATLAB. Your matrix is already formatted nicely for pdist where each row is an observation and each column is a variable.
Once you do apply pdist, apply squareform so that you can display the distance between pairwise entries in a more pleasant matrix form. The (i,j) entry for each value in this matrix tells you the distance between the ith and jth row. Also note that this matrix will be symmetric and the distances along the diagonal will inevitably equal to 0, as any vector's distance to itself must be zero. If your minimum distance between two different vectors were zero, if we were to search this matrix, then it may possibly report a self-distance instead of the actual distance between two different vectors. As such, in this matrix, you should set the diagonals of this matrix to NaN to avoid outputting these.
As such, assuming your matrix is A, all you have to do is this:
distValues = pdist(A); %// Compute pairwise distances
minDist = min(distValues); %// Find minimum distance
maxDist = max(distValues); %// Find maximum distance
distMatrix = squareform(distValues); %// Prettify
distMatrix(logical(eye(size(distMatrix)))) = NaN; %// Ignore self-distances
[minI,minJ] = find(distMatrix == minDist, 1); %// Find the two vectors with min. distance
[maxI,maxJ] = find(distMatrix == maxDist, 1); %// Find the two vectors with max. distance
minI, minJ, maxI, maxJ will return the two rows of A that produced the smallest distance and the largest distance respectively. Note that with the find statement, I have made the second parameter 1 so that it only returns one pair of vectors that have this minimum / maximum distance between each other. However, if you omit this parameter, then it will return all possible pairs of rows that share this same distance, but you will get duplicate entries as the squareform is symmetric. If you want to escape the duplication, set either the upper triangular half, or lower triangular half of your squareform matrix to NaN to tell MATLAB to skip searching in these duplicated areas. You can use MATLAB's tril or triu commands to do that. Take note that either of these methods by default will include the diagonal of the matrix and so there won't be any extra work here. As such, try something like:
distValues = pdist(A); %// Compute pairwise distances
minDist = min(distValues); %// Find minimum distance
maxDist = max(distValues); %// Find maximum distance
distMatrix = squareform(distValues); %// Prettify
distMatrix(triu(true(size(distMatrix)))) = NaN; %// To avoid searching for duplicates
[minI,minJ] = find(distMatrix == minDist); %// Find pairs of vectors with min. distance
[maxI,maxJ] = find(distMatrix == maxDist); %// Find pairs of vectors with max. distance
Judging from your application, you just want to find one such occurrence only, so let's leave it at that, but I'll put that here for you in case you need it.
You mean the max/min distance between any 2 rows? If so, you can try that:
numRows = 6;
A = randn(numRows, 100); %// Example of input matrix
%// Compute distances between each combination of 2 rows
T = nchoosek(1:numRows,2); %// pairs of indexes for all combinations of 2 rows
for k=1:length(T)
d(k) = norm(A(T(k,1),:)-A(T(k,2),:));
end
%// Find min/max distance
[~, minIndex] = min(d);
[~, maxIndex] = max(d);
T(minIndex,:) %// Displays indexes of the 2 rows with minimum distance
T(maxIndex,:) %// Displays indexes of the 2 rows with maximum distance

Efficient low-rank appoximation in MATLAB

I'd like to compute a low-rank approximation to a matrix which is optimal under the Frobenius norm. The trivial way to do this is to compute the SVD decomposition of the matrix, set the smallest singular values to zero and compute the low-rank matrix by multiplying the factors. Is there a simple and more efficient way to do this in MATLAB?
If your matrix is sparse, use svds.
Assuming it is not sparse but it's large, you can use random projections for fast low-rank approximation.
From a tutorial:
An optimal low rank approximation can be easily computed using the SVD of A in O(mn^2
). Using random projections we show how to achieve an ”almost optimal” low rank pproximation in O(mn log(n)).
Matlab code from a blog:
clear
% preparing the problem
% trying to find a low approximation to A, an m x n matrix
% where m >= n
m = 1000;
n = 900;
%// first let's produce example A
A = rand(m,n);
%
% beginning of the algorithm designed to find alow rank matrix of A
% let us define that rank to be equal to k
k = 50;
% R is an m x l matrix drawn from a N(0,1)
% where l is such that l > c log(n)/ epsilon^2
%
l = 100;
% timing the random algorithm
trand =cputime;
R = randn(m,l);
B = 1/sqrt(l)* R' * A;
[a,s,b]=svd(B);
Ak = A*b(:,1:k)*b(:,1:k)';
trandend = cputime-trand;
% now timing the normal SVD algorithm
tsvd = cputime;
% doing it the normal SVD way
[U,S,V] = svd(A,0);
Aksvd= U(1:m,1:k)*S(1:k,1:k)*V(1:n,1:k)';
tsvdend = cputime -tsvd;
Also, remember the econ parameter of svd.
You can rapidly compute a low-rank approximation based on SVD, using the svds function.
[U,S,V] = svds(A,r); %# only first r singular values are computed
svds uses eigs to compute a subset of the singular values - it will be especially fast for large, sparse matrices. See the documentation; you can set tolerance and maximum number of iterations or choose to calculate small singular values instead of large.
I thought svds and eigs could be faster than svd and eig for dense matrices, but then I did some benchmarking. They are only faster for large matrices when sufficiently few values are requested:
n k svds svd eigs eig comment
10 1 4.6941e-03 8.8188e-05 2.8311e-03 7.1699e-05 random matrices
100 1 8.9591e-03 7.5931e-03 4.7711e-03 1.5964e-02 (uniform dist)
1000 1 3.6464e-01 1.8024e+00 3.9019e-02 3.4057e+00
2 1.7184e+00 1.8302e+00 2.3294e+00 3.4592e+00
3 1.4665e+00 1.8429e+00 2.3943e+00 3.5064e+00
4 1.5920e+00 1.8208e+00 1.0100e+00 3.4189e+00
4000 1 7.5255e+00 8.5846e+01 5.1709e-01 1.2287e+02
2 3.8368e+01 8.6006e+01 1.0966e+02 1.2243e+02
3 4.1639e+01 8.4399e+01 6.0963e+01 1.2297e+02
4 4.2523e+01 8.4211e+01 8.3964e+01 1.2251e+02
10 1 4.4501e-03 1.2028e-04 2.8001e-03 8.0108e-05 random pos. def.
100 1 3.0927e-02 7.1261e-03 1.7364e-02 1.2342e-02 (uniform dist)
1000 1 3.3647e+00 1.8096e+00 4.5111e-01 3.2644e+00
2 4.2939e+00 1.8379e+00 2.6098e+00 3.4405e+00
3 4.3249e+00 1.8245e+00 6.9845e-01 3.7606e+00
4 3.1962e+00 1.9782e+00 7.8082e-01 3.3626e+00
4000 1 1.4272e+02 8.5545e+01 1.1795e+01 1.4214e+02
2 1.7096e+02 8.4905e+01 1.0411e+02 1.4322e+02
3 2.7061e+02 8.5045e+01 4.6654e+01 1.4283e+02
4 1.7161e+02 8.5358e+01 3.0066e+01 1.4262e+02
With size-n square matrices, k singular/eigen values and runtimes in seconds. I used Steve Eddins' timeit file exchange function for benchmarking, which tries to account for overhead and runtime variations.
svds and eigs are faster if you want a few values from a very large matrix. It also depends on the properties of the matrix in question (edit svds should give you some idea why).