Matlab, economy QR decomposition, control precision? - matlab

There is a [Q,R] = qr(A,0) function in Matlab, which, according to documentation, returns an "economy" version of qr-decomposition of A. norm(A-Q*R) returns ~1e-12 for my data set. Also Q'*Q should theoretically return I. In practice there are small nonzero elements above and below the diagonal (of the order of 1e-6 or so), as well as diagonal elements that are slightly greater than 1 (again, by 1e-6 or so). Is anyone aware of a way to control precision of qr(.,0), or quality(orthogonality) of resulting Q, either by specifying epsilon, or via the number of iterations ? The size of the data set makes qr(A) run out of memory so I have to use qr(A,0).

When I try the non- economy setting, I actually get comparable results for A-Q*R. Even for a tiny matrix containing small numbers as shown here:
A = magic(20);
[Q, R] = qr(A); %Result does not change when using qr(A,0)
norm(A-Q*R)
As such I don't believe the 'economy' is the problem as confirmed by #horchler in the comments, but that you have just ran into the limits of how accurate calculations can be done with data of type 'double'.
Even if you change the accuracy somehow, you will always be dealing with an approximation, so perhaps the first thing to consider here is whether you really need greater accuracy than you already have. If you need more accuracy there may always be a way, but I doubt whether it will be a straightforward one.

Related

Dimensionality reduction using PCA - MATLAB

I am trying to reduce dimensionality of a training set using PCA.
I have come across two approaches.
[V,U,eigen]=pca(train_x);
eigen_sum=0;
for lamda=1:length(eigen)
eigen_sum=eigen_sum+eigen(lamda,1);
if(eigen_sum/sum(eigen)>=0.90)
break;
end
end
train_x=train_x*V(:, 1:lamda);
Here, I simply use the eigenvalue matrix to reconstruct the training set with lower amount of features determined by principal components describing 90% of original set.
The alternate method that I found is almost exactly the same, save the last line, which changes to:
train_x=U(:,1:lamda);
In other words, we take the training set as the principal component representation of the original training set up to some feature lamda.
Both of these methods seem to yield similar results (out of sample test error), but there is difference, however minuscule it may be.
My question is, which one is the right method?
The answer depends on your data, and what you want to do.
Using your variable names. Generally speaking is easy to expect that the outputs of pca maintain
U = train_x * V
But this is only true if your data is normalized, specifically if you already removed the mean from each component. If not, then what one can expect is
U = train_x * V - mean(train_x * V)
And in that regard, weather you want to remove or maintain the mean of your data before processing it, depends on your application.
It's also worth noting that even if you remove the mean before processing, there might be some small difference, but it will be around floating point precision error
((train_x * V) - U) ./ U ~~ 1.0e-15
And this error can be safely ignored

fit function of Matlab is really slow

Why is the fitfunction from Matlab so slow? I'm trying to fit a gauss4 so I can get the means of the gaussians.
here's my plot,
I want to get the means from the blue data and red data.
I'm fitting a gaussian there but this function is really slow.
Is there an alternative?
fa = fit(fn', facm', 'gauss4');
acm = [fa.b1 fa.b2 fa.b3 fa.b4];
a_cm = sort(acm, 'ascend');
I would apply some of the options available with fit. These include smoothing by setting SmoothingParam (your data is quite noisy, the alternative of applying a time domain filter may also help*), and setting the values of your initial parameter estimates, with StartPoint. Your fits may also not be converging because you set your tolerances (TolFun, TolX) too low, although from inspection of your fits that does not appear to be the case, in fact the opposite is likely, you probably want to increase the MaxIter and/or MaxFunEvals.
To figure out how to get going you can also try the Spectr-O-Matic toolbox. It requires Matlab 7.12. It includes a script called GaussFit.m to fit gauss4 to data, but it also uses the fit routine and provides examples on how to set and get parameters.
Note that smoothing will of course broaden your peaks, but you can subtract the contribution after the fact. The effect on the mean should not be deleterious, on the contrary, since you are presumably removing noise this should be more accurate.
In general functions will be faster if you apply it to a shorter series. Hence, if speedup is really important you could downsample.
For example, if you have a vector that you want to downsample by a factor 2: (you may need to make sure it fits first)
n = 2;
x = sin(0.01:0.01:pi);
x_downsampled = x(1:n:end)+x(2:n:end);
You will now see that x_downsampled is much smaller (and should thus be easier to process), but will still have the same shape. In your case I think this is sufficient.
To see what you got try:
plot(x)
Now you can simply process x_downsampled and map your solution, for example
f = find(x_downsampled == max(x_downsampled));
location_of_maximum = f * n;
Needless to say this should be done in combination with the most efficient options that the fit function has to offer.

MATLAB - replace zeros in matrix with small number

I have a matrix with some elements going to zero. This is a problem for me in consequent operations (taking log, etc). Is there a way to quickly replace zero elements in a matrix with a input of my choice. Quickly - meaning, without a loop.
The direct answer is:
M(M == 0) = realmin;
which does exactly what you ask for, replacing zeros with a small number. See that this does an implicit search for the zeros in a vectorized way. No loops are required. (This is a MATLAB way, avoiding those explicit and slow loops.)
Or, you could use max, since negative numbers are never an issue. So
M = max(M,realmin);
will also work. Again, this is a vectorized solution. I'm not positive which one is faster without a careful test, but either will surely be acceptable.
Note that I've used realmin here instead of eps, since it is as small as you can realistically get in a double precision number. But use whatever small number makes sense to you.
log10(realmin)
ans =
 -307.6527
Compare that to eps.
log10(eps)
ans =
-15.6536
Sure--where A is your matrix,
A(A==0) = my_small_number;
Assume your matrix is called A
A(A==0) = eps;

Matlab: poor accuracy of optimizers/solvers

I am having difficulty achieving sufficient accuracy in a root-finding problem on Matlab. I have a function, Lik(k), and want to find the value of k where Lik(k)=L0. Basically, the problem is that various built-in Matlab solvers (fzero, fminbnd, fmincon) are not getting as close to the solution as I would like or expect.
Lik() is a user-defined function which involves extensive coding to compute a numerical inverse Laplace transform, etc., and I therefore do not include the full code. However, I have used this function extensively and it appears to work properly. Lik() actually takes several input parameters, but for the current step, all of these are fixed except k. So it is really a one-dimensional root-finding problem.
I want to find the value of k >= 165.95 for which Lik(k)-L0 = 0. Lik(165.95) is less than L0 and I expect Lik(k) to increase monotonically from here. In fact, I can evaluate Lik(k)-L0 in the range of interest and it appears to smoothly cross zero: e.g. Lik(165.95)-L0 = -0.7465, ..., Lik(170.5)-L0 = -0.1594, Lik(171)-L0 = -0.0344, Lik(171.5)-L0 = 0.1015, ... Lik(173)-L0 = 0.5730, ..., Lik(200)-L0 = 19.80. So it appears that the function is behaving nicely.
However, I have tried to "automatically" find the root with several different methods and the accuracy is not as good as I would expect...
Using fzero(#(k) Lik(k)-L0): If constrained to the interval (165.95,173), fzero returns k=170.96 with Lik(k)-L0=-0.045. Okay, although not great. And for practical purposes, I would not know such a precise upper bound without a lot of manual trial and error. If I use the interval (165.95,200), fzero returns k=167.19 where Lik(k)-L0 = -0.65, which is rather poor. I have been running these tests with Display set to iter so I can see what's going on, and it appears that fzero hits 167.19 on the 4th iteration and then stays there on the 5th iteration, meaning that the change in k from one iteration to the next is less than TolX (set to 0.001) and thus the procedure ends. The exit flag indicates that it successfully converged to a solution.
I also tried minimizing abs(Lik(k)-L0) using fminbnd (giving upper and lower bounds on k) and fmincon (giving a starting point for k) and ran into similar accuracy issues. In particular, with fmincon one can set both TolX and TolFun, but playing around with these (down to 10^-6, much higher precision than I need) did not make any difference. Confusingly, sometimes the optimizer even finds a k-value on an earlier iteration that is closer to making the objective function zero than the final k-value it returns.
So, it appears that the algorithm is iterating to a certain point, then failing to take any further step of sufficient size to find a better solution. Does anyone know why the algorithm does not take another, larger step? Is there anything I can adjust to change this? (I have looked at the list under optimset but did not come up with anything useful.)
Thanks a lot!
As you seem to have a 'wild' function that does appear to be monotone in the region, a fairly small range of interest, and not a very high requirement in precision I think all criteria are met for recommending the brute force approach.
Assuming it does not take too much time to evaluate the function in a point, please try this:
Find an upperbound xmax and a lower bound xmin, choose a preferred stepsize and evaluate your function at
xmin:stepsize:xmax
If required (and monotonicity really applies) you can get another upper and lower bound by doing this and repeat the process for better accuracy.
I also encountered this problem while using fmincon. Here is how I fixed it.
I needed to find the solution of a function (single variable) within an optimization loop (multiple variables). Because of this, I needed to provide a large interval for the solution of the single variable function. The problem is that fmincon (or fzero) does not converge to a solution if the search interval is too large. To get past this, I solve the problem inside a while loop, with a huge starting upperbound (1e200) with the constraint made on the fval value resulting from the solver. If the resulting fval is not small enough, I decrease the upperbound by a factor. The code looks something like this:
fval = 1;
factor = 1;
while fval>1e-7
UB = factor*1e200;
[x,fval,exitflag] = fminbnd(#(x)function(x,...),LB,UB,options);
factor = factor * 0.001;
end
The solver exits the while when a good solution is found. You can of course play also with the LB by introducing another factor and/or increase the factor step.
My 1st language isn't English so I apologize for any mistakes made.
Cheers,
Cristian
Why not use a simple bisection method? You always evaluate the middle of a certain interval and then reduce this to the right or left part so that you always have one bound giving a negative and the other bound giving a positive value. You can reduce to arbitrary precision very quickly. Since you reduce the interval in half each time it should converge very quickly.
I would suspect however there is some other problem with that function in that it has discontinuities. It seems strange that fzero would work so badly. It's a deterministic function right?

How to get level of fitness of data to a distribution by using probplot() in Matlab?

I have 2 sets of data of float numbers, set A and set B. Both of them are matrices of size 40*40. I would like to find out which set is closer to the normal distribution. I know how to use probplot() in matlab to plot the probability of one set. However, I do not know how to find out the level of the fitness of the distribution is.
In python, when people use problot, a parameter ,R^2, shows how good the distribution of the data is against to the normal distribution. The closer the R^2 value to value 1, the better the fitness is. Thus, I can simply use the function to compare two set of data by their R^2 value. However, because of some machine problem, I can not use the python in my current machine. Is there such parameter or function similar to the R^2 value in matlab ?
Thank you very much,
Fitting a curve or surface to data and obtaining the goodness of fit, i.e., sse, rsquare, dfe, adjrsquare, rmse, can be done using the function fit. More info here...
The approach of #nate (+1) is definitely one possible way of going about this problem. However, the statistician in me is compelled to suggest the following alternative (that does, alas, require the statistics toolbox - but you have this if you have the student version):
Given that your data is Normal (not Multivariate normal), consider using the Jarque-Bera test.
Jarque-Bera tests the null hypothesis that a given dataset is generated by a Normal distribution, versus the alternative that it is generated by some other distribution. If the Jarque-Bera test statistic is less than some critical value, then we fail to reject the null hypothesis.
So how does this help with the goodness-of-fit problem? Well, the larger the test statistic, the more "non-Normal" the data is. The smaller the test statistic, the more "Normal" the data is.
So, assuming you have converted your matrices into two vectors, A and B (each should be 1600 by 1 based on the dimensions you provide in the question), you could do the following:
%# Build sample data
A = randn(1600, 1);
B = rand(1600, 1);
%# Perform JB test
[ANormal, ~, AStat] = jbtest(A);
[BNormal, ~, BStat] = jbtest(B);
%# Display result
if AStat < BStat
disp('A is closer to normal');
else
disp('B is closer to normal');
end
As a little bonus of doing things this way, ANormal and BNormal tell you whether you can reject or fail to reject the null hypothesis that the sample in A or B comes from a normal distribution! Specifically, if ANormal is 1, then you fail to reject the null (ie the test statistic indicates that A is probably drawn from a Normal). If ANormal is 0, then the data in A is probably not generated from a Normal distribution.
CAUTION: The approach I've advocated here is only valid if A and B are the same size, but you've indicated in the question that they are :-)