How can I use the princomp function of Matlab in the following case? - matlab

I have 10 images(18x18). I save these images inside an array named images[324][10] where the number 324 represents the amount of pixels for an image and the number 10 the total amount of images that I have.
I would like to use these images for a neuron network however 324 is a big number to give as an input and thus I would like to decrease this number but retain as much information as possible.
I heard that you can do this with the princomp function which implements PCA.
The problem is that I haven't found any example on how to use this function, and especially for my case.
If I run
[COEFF, SCORE, latent] = princomp(images);
it runs fine but how can I then get the array newimages[number_of_desired_features][10]?

PCA could be a right choice here (but not the only one). Although, you should be aware of the fact, that PCA does not reduce the number of your input data features automatically. I recommend you reading this tutorial: http://arxiv.org/pdf/1404.1100v1.pdf - it is the one I used to understand PCA and its really good for beginners.
Getting back to your question. An image is an vector in a 324-dimensional space. In this space the first base vector is one having a white pixel in top left corner, next one is having next pixel white, all the other black - and so on. It probably is not the best base vector set to represent this image data. PCA computes new base vectors (the COEFF matrix - the new vectors expressed as values in old vector space) and new image vector values (the SCORE matrix). At that point you have not lost ANY data at all (no feature number reduction). But, you could stop using some of the new base vectors, because they are probably connected with noice, not the data itself. It is all described in details in the tutorial.
images = rand(10,324);
[COEFF, SCORE] = princomp(images);
reconstructed_images = SCORE / COEFF + repmat(mean(images,1), 10, 1);
images - reconstructed_images
%as you see there are almost only zeros - the non-zero values are effects of small numerical errors
%its possible because you are only switching between the sets of base vectors used to represent the data
for i=100:324
SCORE(:,i) = zeros(10,1);
end
%we remove the features 100 to 324, leaving only first 99
%obviously, you could take only the non-zero part of the matrix and use it
%somewhere else, like for your neural network
reconstructed_images_with_reduced_features = SCORE / COEFF + repmat(mean(images,1), 10, 1);
images - reconstructed_images_with_reduced_features
%there are less features, but reconstruction is still pretty good

Related

Matching PCA output to corresponding coordiantes

I have this dataset consisting of around 800 images of the view of a car driving in circles with corresponding coordinates and changing background. The goal is
to train a neural network to predict the position based on the images. I reshaped the images so that
the original 160x320 pixels come down to 1x51200, so that I can feed my NN more easily. However, because
this is a quite large dimension I applied a PCA to reduce the dimension and the PCA indeed worked well, so
that I could only take the 100 eigenvectors with the highest eigenvalues and still have 90-95 % of the total variance.
But now comes my obstacle: I have these 100 images, still reconstructable and visualizable, but I don't exactly know, to which coordinates they correspond. I can't just take the first 100 coordinates because these eigenvalues where obviously taken from different timesteps through the progression of the images. I need this information so that my NN is able to match them while learning and checking its progress while testing. I read a similar question where the answers stated that it's not possible to extract indices out of a PCA-output, but I'm pretty sure there must be other persons who already faced similar obstacles?
What you have to be able to do with PCA to go from compact to original representation.
# dataset.shape = (800, 51200)
M = pca(dataset); # M.shape = (100, 51200)
original = dataset[k, :] # one vector corresponding to one image in your dataset
compressed_dataset = dataset # M.T; # compressed_dataset.shape = (800, 100)
# if you have a compressed representation of an image you restore it with
restored = compressed_dataset[k, :] # M # .shape = (1, 512000)
# here you expect all_close(restored, original)
Notice that the indices are stored appart from the data vectors.

Principal component analysis in matlab?

I have a training set with the size of (size(X_Training)=122 x 125937).
122 is the number of features
and 125937 is the sample size.
From my little understanding, PCA is useful when you want to reduce the dimension of the features. Meaning, I should reduce 122 to a smaller number.
But when I use in matlab:
X_new = pca(X_Training)
I get a matrix of size 125973x121, I am really confused, because this not only changes the features but also the sample size? This is a big problem for me, because I still have the target vector Y_Training that I want to use for my neural network.
Any help? Did I badly interpret the results? I only want to reduce the number of features.
Firstly, the documentation of the PCA function is useful: https://www.mathworks.com/help/stats/pca.html. It mentions that the rows are the samples while the columns are the features. This means you need to transpose your matrix first.
Secondly, you need to specify the number of dimensions to reduce to a priori. The PCA function does not do that for you automatically. Therefore, in addition to extracting the principal coefficients for each component, you also need to extract the scores as well. Once you have this, you simply subset into the scores and perform the reprojection into the reduced space.
In other words:
n_components = 10; % Change to however you see fit.
[coeff, score] = pca(X_training.');
X_reduce = score(:, 1:n_components);
X_reduce will be the dimensionality reduced feature set with the total number of columns being the total number of reduced features. Also notice that the number of training examples does not change as we expect. If you want to make sure that the number of features are along the rows instead of the columns after we reduce the number of features, transpose this output matrix as well before you proceed.
Finally, if you want to automatically determine the number of features to reduce to, one method to do so is to calculate the variance explained of each feature, then accumulate the values from the first feature up to the point where we exceed some threshold. Usually 95% is used.
Therefore, you need to provide additional output variables to capture these:
[coeff, score, latent, tsquared, explained, mu] = pca(X_training.');
I'll let you go through the documentation to understand the other variables, but the one you're looking at is the explained variable. What you should do is find the point where the total variance explained exceeds 95%:
[~,n_components] = max(cumsum(explained) >= 95);
Finally, if you want to perform a reconstruction and see how well the reconstruction into the original feature space performs from the reduced feature, you need to perform a reprojection into the original space:
X_reconstruct = bsxfun(#plus, score(:, 1:n_components) * coeff(:, 1:n_components).', mu);
mu are the means of each feature as a row vector. Therefore you need add this vector across all examples, so broadcasting is required and that's why bsxfun is used. If you're using MATLAB R2018b, this is now implicitly done when you use the addition operation.
X_reconstruct = score(:, 1:n_components) * coeff(:, 1:n_components).' + mu;

How to quickly/easily merge and average data in matrix in MATLAB?

I have got a matrix of AirFuelRatio values at certain engine speeds and throttlepositions. (eg. the AFR is 14 at 2500rpm and 60% throttle)
The matrix is now 25x10, and the engine speed ranges from 1200-6000rpm with interval 200rpm, the throttle range from 0.1-1 with interval 0.1.
Say i have measured new values, eg. an AFR of 13.5 at 2138rpm and 74,3% throttle, how do i merge that in the matrix? The matrix closest values are 2000 or 2200rpm and 70 or 80% throttle. Also i don't want new data to replace the older data. How can i make the matrix take this value in and adjust its values to take the new value in account?
Simplified i have the following x-axis values(top row) and 1x4 matrix(below):
2 4 6 8
14 16 18 20
I just measured an AFR value of 15.5 at 3 rpm. If you interpolate the AFR matrix you would've gotten a 15, so this value is out of the ordinary.
I want the matrix to take this data and adjust the other variables to it, ie. average everything so that the more data i put in the more reliable and accurate the matrix becomes. So in the simplified case the matrix would become something like:
2 4 6 8
14.3 16.3 18.2 20.1
So it averages between old and new data. I've read the documentation about concatenation but i believe my problem can't be solved with that function.
EDIT: To clarify my question, the following visual clarification.
The 'matrix' keeps the same size of 5 points whil a new data point is added. It takes the new data in account and adjusts the matrix accordingly. This is what i'm trying to achieve. The more scatterd data i get, the more accurate the matrix becomes. (and yes the green dot in this case would be an outlier, but it explains my case)
Cheers
This is not a matter of simple merge/average. I don't think there's a quick method to do this unless you have simplifying assumptions. What you want is a statistical inference of the underlying trend. I suggest using Gaussian process regression to solve this problem. There's a great MATLAB toolbox by Rasmussen and Williams called GPML. http://www.gaussianprocess.org/gpml/
This sounds more like a data fitting task to me. What you are suggesting is that you have a set of measurements for which you wish to get the best linear fit. Instead of producing a table of data, what you need is a table of values, and then find the best fit to those values. So, for example, I could create a matrix, A, which has all of the recorded values. Let's start with:
A=[2,14;3,15.5;4,16;6,18;8,20];
I now need a matrix of points for the inputs to my fitting curve (which, in this instance, lets assume it is linear, so is the set of values 1 and x)
B=[ones(size(A,1),1), A(:,1)];
We can find the linear fit parameters (where it cuts the y-axis and the gradient) using:
B\A(:,2)
Or, if you want the points that the line goes through for the values of x:
B*(B\A(:,2))
This results in the points:
2,14.1897 3,15.1552 4,16.1207 6,18.0517 8,19.9828
which represents the best fit line through these points.
You can manually extend this to polynomial fitting if you want, or you can use the Matlab function polyfit. To manually extend the process you should use a revised B matrix. You can also produce only a specified set of points in the last line. The complete code would then be:
% Original measurements - could be read in from a file,
% but for this example we will set it to a matrix
% Note that not all tabulated values need to be present
A=[2,14; 3,15.5; 4,16; 5,17; 8,20];
% Now create the polynomial values of x corresponding to
% the data points. Choosing a second order polynomial...
B=[ones(size(A,1),1), A(:,1), A(:,1).^2];
% Find the polynomial coefficients for the best fit curve
coeffs=B\A(:,2);
% Now generate a table of values at specific points
% First define the x-values
tabinds = 2:2:8;
% Then generate the polynomial values of x
tabpolys=[ones(length(tabinds),1), tabinds', (tabinds').^2];
% Finally, multiply by the coefficients found
curve_table = [tabinds', tabpolys*coeffs];
% and display the results
disp(curve_table);

Spreading one matrix elements to another with weighted random numbers MATLAB

So I was trying to spread one matrix elements, which were generated with poissrnd, to another with using some bigger (wider?) probability function (for example 100 different possibilities with different weights) to plot both of them and see if the fluctuations after spread went down. After seeing it doesn't work right (fluctuations got bigger) I tried to identify what I did wrong on a really simple example. After testing it for a really long time I still can't understand what's wrong. The example goes like this:
I generate vector with poissrnd and vector for spreading (filled with zeros at the start)
Each element from the poiss vector tells me how many numbers (0.1 of the element value) to generate from the possible options which are: [1,2,3] with corresponding weights [0.2,0.5,0.2]
I spread what I got to my another vector on 3 elements: the corresponding (k-th one), one bofore the corresponding one and one after the corresponding one (so for example if k=3, the elements should be spread like this: most should go into 3rd element of another vector, and rest should go to 2nd and 1st element)
Plot both 0.1*poiss vector and vector after spreading to compare if fluctuations went down
The way I generate weighted numbers is from this thread: Weighted random numbers in MATLAB
and this is the code I'm using:
clear all
clc
eta=0.1;
N=200;
fot=10000000;
ix=linspace(-100,100,N);
mn =poissrnd(fot/N, 1, N);
dataw=zeros(1,N);
a=1:3;
w=[.25,.5,.25];
for k=1:N
[~,R] = histc(rand(1,eta*mn(1,k)),cumsum([0;w(:)./sum(w)]));
R = a(R);
przydz=histc(R,a);
if (k>1) && (k<N)
dataw(1,k)=dataw(1,k)+przydz(1,2);
dataw(1,k-1)=dataw(1,k-1)+przydz(1,1);
dataw(1,k+1)=dataw(1,k+1)+przydz(1,3);
elseif k==1
dataw(1,k)=dataw(1,k)+przydz(1,2);
dataw(1,N)=dataw(1,N)+przydz(1,1);
dataw(1,k+1)=dataw(1,k+1)+przydz(1,3);
else
dataw(1,k)=dataw(1,k)+przydz(1,2);
dataw(1,k-1)=dataw(1,k-1)+przydz(1,1);
dataw(1,1)=dataw(1,1)+przydz(1,3);
end
end
plot(ix,eta*mn,'g',ix,dataw,'r')
The fluctuations are still bigger, and I can't identify what's wrong... Is the method for generating weighted numbers wrong in this situation? Cause it doesn't seem so. The way I'm accumulating data from the first vector seems fine too. Is there another way I could do it (so I could then optimize it for using 'bigger' probability functions)?
Sorry for my terrible English.
[EDIT]:
Here is simple pic to show what I meant (I hope it's understandable)
How about trying negative binomial distribution? It is often used as a hyper-dispersed analogue of Poisson distribution. Additional links can be found in this paper, as well as some apparatus in supplement.

computing PCA matrix for set of sift descriptors

I want to compute a general PCA matrix for a dataset, and I will use it to reduce dimensions of sift descriptors. I have already found some algorithms to compute it, but I couldn't find a way to compute it by using MATLAB.
Can someone help me?
[coeff, score] = princomp(X)
is the right thing to do, but knowing how to use it is a little tricky.
My understanding is that you did something like:
sift_image = sift_fun(img)
which gives you a binary image: sift_feature?
(Even if not binary, this still works.)
Inputs, formulating X:
To use princomp/pca formulate X so that each column is a numel(sift_image) x 1 vector (i.e. sift_image(:))
Do this for all your images and line them up as columns in X. So X will be numel(sift_image) x num_images.
If your images aren't the same size (e.g. pixel dimensions different, more or less of a scene in the images), then you'll need to bring them into some common space, which is a whole different problem.
Unless your stuff is binary, you'll probably want to de-mean/normalize X, both in the column direction (i.e. normalizing each individual image) and row direction (de-meaning the whole dataset).
Outputs
score is the set of eigen vectors: it will be num_pixels * num_images.
To get, say the first eigen vector back into an image shape, do:
first_component = reshape(score(:,1),size(im));
And so on for the rest of the components. There are as many components as input images.
Each row of coeff is the num_images (equal to num_components) set of weights that can be applied to generate each input image. i.e.
input_image_1 = reshape(score * coeff(:,1) , size(original_im));
where input_image_1 is the correct, original shape
coeff(1,:) is a vector (num_images x 1)
score is pixels x num_images
(Disclaimer: I may have the columns/rows mixed up, but the descriptions are correct.)
Does that help?
If you have access to Statistics Toolbox, you can use the command princomp, or in recent versions the command pca.