Why do the principal component values from Scipy and MATLAB not agree? - matlab

I was training to do some PCA reconstroctions of MNIST on python and compare them to my (old) reconstruction in maltab and I happened to discover that my reconstruction don't agree. After some debugging I decided to print a unique characteristic of the principal components of each one to reveal if they were the same and I discovered to my surprised that they were not the same. I printing the sum of all components and I got different numbers. I did the following in matlab:
[coeff, ~, ~, ~, ~, mu] = pca(X_train);
U = coeff(:,1:K)
U_fingerprint = sum(U(:))
%print 31.0244
and in python/scipy:
pca = pca.fit(X_train)
U = pca.components_
print 'U_fingerprint', np.sum(U)
# prints 12.814
why are the twi PCA's not computing the same value?
All my attempts and solving this issue:
The way I discovered this was because when I was reconstructing my MNIST images, the python reconstructions where much much closer to their original images by a lot. I got error of 0.0221556788645 in python while in MATLAB I got errors of size 29.07578. To figure out where the difference was coming from I decided to finger print the data sets (maybe they were normalized differently). So I got two independent copies the MNIST data set (that were normalized by dividing my 255) and got the finger prints (summing all numbers in data set):
print np.sum(x_train) # from keras
print np.sum(X_train)+np.sum(X_cv) # from TensorFlow
6.14628e+06
6146269.1585420668
which are (essentially) same (one copy from tensorflow MNIST and the other from Keras MNIST, note MNIST train data set has about 1000 less training set so you need to append the missing ones). To my surprise, my MATLAB data had the same finger print:
data_fingerprint = sum(X_train(:))
% prints data_fingerprint = 6.1463e+06
meaning the data sets are exactly the same. Good, so the normalization data is not the issue.
In my MATLAB script I am actually computing the reconstruction manually as follow:
U = coeff(:,1:K)
X_tilde_train = (U * U' * X_train);
train_error_PCA = (1/N_train)*norm( X_tilde_train - X_train ,'fro')^2
%train_error_PCA = 29.0759
so I thought that might be the problem because I was using the interface python gave for computing the reconstructions as in:
pca = PCA(n_components=k)
pca = pca.fit(X_train)
X_pca = pca.transform(X_train) # M_train x K
#print 'X_pca' , X_pca.shape
X_reconstruct = pca.inverse_transform(X_pca)
print 'tensorflow error: ',(1.0/X_train.shape[0])*LA.norm(X_reconstruct_tf - X_train)
print 'keras error: ',(1.0/x_train.shape[0])*LA.norm(X_reconstruct_keras - x_train)
#tensorflow error: 0.0221556788645
#keras error: 0.0212030354818
which results in different error values 0.022 vs 29.07, shocking difference!
Thus, I decided to code that exact reconstruction formula in my python script:
pca = PCA(n_components=k)
pca = pca.fit(X_train)
U = pca.components_
print 'U_fingerprint', np.sum(U)
X_my_reconstruct = np.dot( U.T , np.dot(U, X_train.T) )
print 'U error: ',(1.0/X_train.shape[0])*LA.norm(X_reconstruct_tf - X_train)
# U error: 0.0221556788645
to my surprise, it has the same error as my MNIST error computing by using the interface. Thus, concluding that I don't have the misconception of PCA that I thought I had.
All that lead to me to check what the principal components actually where and to my surprise scipy and MATLAB have different fingerprint for their PCA values.
Does anyone know why or whats going on?
As warren suggested, the pca components (eigenvectors) might have different sign. After doing a finger print by adding all components in magnitude only I discovered they have the same finger print:
[coeff, ~, ~, ~, ~, mu] = pca(X_train);
K=12;
U = coeff(:,1:K)
U_fingerprint = sumabs(U(:))
% U_fingerprint = 190.8430
and for python:
k=12
pca = PCA(n_components=k)
pca = pca.fit(X_train)
print 'U_fingerprint', np.sum(np.absolute(U))
# U_fingerprint 190.843
which means the difference must be because of the different sign of the (pca) U vector. Which I find very surprising, I thought that should make a big difference, I didn't even consider it making a big difference. I guess I was wrong?

I don't know if this is the problem, but it certainly could be. Principal component vectors are like eigenvectors: if you multiply the vector by -1, it is still a valid PCA vector. Some of the vectors computed by matlab might have a different sign than those computed in python. That will result in very different sums.
For example, the matlab documentation has this example:
coeff = pca(ingredients)
coeff =
-0.0678 -0.6460 0.5673 0.5062
-0.6785 -0.0200 -0.5440 0.4933
0.0290 0.7553 0.4036 0.5156
0.7309 -0.1085 -0.4684 0.4844
I have my own python PCA code, and with the same input as in matlab, it produces this coefficient array:
[[ 0.0678 0.646 -0.5673 0.5062]
[ 0.6785 0.02 0.544 0.4933]
[-0.029 -0.7553 -0.4036 0.5156]
[-0.7309 0.1085 0.4684 0.4844]]
So, instead of simply summing the coefficient array, try summing the absolute values of the coefficients. Alternatively, ensure that all the vectors have the same sign convention before summing. You could do that by, say, multiplying each column by the sign of the first element in that column (assuming none of them are zero).

Related

Problem understanding Loss function behavior using Flux.jl. in Julia

So. First of all, I am new to Neural Network (NN).
As part of my PhD, I am trying to solve some problem through NN.
For this, I have created a program that creates some data set made of
a collection of input vectors (each with 63 elements) and its corresponding
output vectors (each with 6 elements).
So, my program looks like this:
Nₜᵣ = 25; # number of inputs in the data set
xtrain, ytrain = dataset_generator(Nₜᵣ); # generates In/Out vectors: xtrain/ytrain
datatrain = zip(xtrain,ytrain); # ensamble my data
Now, both xtrain and ytrain are of type Array{Array{Float64,1},1}, meaning that
if (say)Nₜᵣ = 2, they look like:
julia> xtrain #same for ytrain
2-element Array{Array{Float64,1},1}:
[1.0, -0.062, -0.015, -1.0, 0.076, 0.19, -0.74, 0.057, 0.275, ....]
[0.39, -1.0, 0.12, -0.048, 0.476, 0.05, -0.086, 0.85, 0.292, ....]
The first 3 elements of each vector is normalized to unity (represents x,y,z coordinates), and the following 60 numbers are also normalized to unity and corresponds to some measurable attributes.
The program continues like:
layer1 = Dense(length(xtrain[1]),46,tanh); # setting 6 layers
layer2 = Dense(46,36,tanh) ;
layer3 = Dense(36,26,tanh) ;
layer4 = Dense(26,16,tanh) ;
layer5 = Dense(16,6,tanh) ;
layer6 = Dense(6,length(ytrain[1])) ;
m = Chain(layer1,layer2,layer3,layer4,layer5,layer6); # composing the layers
squaredCost(ym,y) = (1/2)*norm(y - ym).^2;
loss(x,y) = squaredCost(m(x),y); # define loss function
ps = Flux.params(m); # initializing mod.param.
opt = ADAM(0.01, (0.9, 0.8)); #
and finally:
trainmode!(m,true)
itermax = 700; # set max number of iterations
losses = [];
for iter in 1:itermax
Flux.train!(loss,ps,datatrain,opt);
push!(losses, sum(loss.(xtrain,ytrain)));
end
It runs perfectly, however, it comes to my attention that as I train my model with an increasing data set(Nₜᵣ = 10,15,25, etc...), the loss function seams to increase. See the image below:
Where: y1: Nₜᵣ=10, y2: Nₜᵣ=15, y3: Nₜᵣ=25.
So, my main question:
Why is this happening?. I can not see an explanation for this behavior. Is this somehow expected?
Remarks: Note that
All elements from the training data set (input and output) are normalized to [-1,1].
I have not tryed changing the activ. functions
I have not tryed changing the optimization method
Considerations: I need a training data set of near 10000 input vectors, and so I am expecting an even worse scenario...
Some personal thoughts:
Am I arranging my training dataset correctly?. Say, If every single data vector is made of 63 numbers, is it correctly to group them in an array? and then pile them into an ´´´Array{Array{Float64,1},1}´´´?. I have no experience using NN and flux. How can I made a data set of 10000 I/O vectors differently? Can this be the issue?. (I am very inclined to this)
Can this behavior be related to the chosen act. functions? (I am not inclined to this)
Can this behavior be related to the opt. algorithm? (I am not inclined to this)
Am I training my model wrong?. Is the iteration loop really iterations or are they epochs. I am struggling to put(differentiate) this concept of "epochs" and "iterations" into practice.
loss(x,y) = squaredCost(m(x),y); # define loss function
Your losses aren't normalized, so adding more data can only increase this cost function. However, the cost per data doesn't seem to be increasing. To get rid of this effect, you might want to use a normalized cost function by doing something like using the mean squared cost.

Small bug in MATLAB R2017B LogLikelihood after fitnlm?

Background: I am working on a problem similar to the nonlinear logistic regression described in the link [1] (my problem is more complicated, but link [1] is enough for the next sections of this post). Comparing my results with those obtained in parallel with a R package, I got similar results for the coefficients, but (very approximately) an opposite logLikelihood.
Hypothesis: The logLikelihood given by fitnlm in Matlab is in fact the negative LogLikelihood. (Note that this impairs consequently the BIC and AIC computation by Matlab)
Reasonning: in [1], the same problem is solved through two different approaches. ML-approach/ By defining the negative LogLikelihood and making an optimization with fminsearch. GLS-approach/ By using fitnlm.
The negative LogLikelihood after the ML-approach is:380
The negative LogLikelihood after the GLS-approach is:-406
I imagine the second one should be at least multiplied by (-1)?
Questions: Did I miss something? Is the (-1) coefficient enough, or would this simple correction not be enough?
Self-contained code:
%copy-pasting code from [1]
myf = #(beta,x) beta(1)*x./(beta(2) + x);
mymodelfun = #(beta,x) 1./(1 + exp(-myf(beta,x)));
rng(300,'twister');
x = linspace(-1,1,200)';
beta = [10;2];
beta0=[3;3];
mu = mymodelfun(beta,x);
n = 50;
z = binornd(n,mu);
y = z./n;
%ML Approach
mynegloglik = #(beta) -sum(log(binopdf(z,n,mymodelfun(beta,x))));
opts = optimset('fminsearch');
opts.MaxFunEvals = Inf;
opts.MaxIter = 10000;
betaHatML = fminsearch(mynegloglik,beta0,opts)
neglogLH_MLApproach = mynegloglik(betaHatML);
%GLS Approach
wfun = #(xx) n./(xx.*(1-xx));
nlm = fitnlm(x,y,mymodelfun,beta0,'Weights',wfun)
neglogLH_GLSApproach = - nlm.LogLikelihood;
Source:
[1] https://uk.mathworks.com/help/stats/examples/nonlinear-logistic-regression.html
This answer (now) only details which code is used. Please see Tom Lane's answer below for a substantive answer.
Basically, fitnlm.m is a call to NonLinearModel.fit.
When opening NonLinearModel.m, one gets in line 1209:
model.LogLikelihood = getlogLikelihood(model);
getlogLikelihood is itself described between lines 1234-1251.
For instance:
function L = getlogLikelihood(model)
(...)
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
(...)
Please also not that this notably impacts ModelCriterion.AIC and ModelCriterion.BIC, as they are computed using model.LogLikelihood ("thinking" it is the logLikelihood).
To get the corresponding formula for BIC/AIC/..., type:
edit classreg.regr.modelutils.modelcriterion
this is Tom from MathWorks. Take another look at the formula quoted:
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
Remember the normal distribution has a factor (1/sqrt(2*pi)), so taking logs of that gives us -log(2*pi)/2. So the minus sign comes from that and it is part of the log likelihood. The property value is not the negative log likelihood.
One reason for the difference in the two log likelihood values is that the "ML approach" value is computing something based on the discrete probabilities from the binomial distribution. Those are all between 0 and 1, and they add up to 1. The "GLS approach" is computing something based on the probability density of the continuous normal distribution. In this example, the standard deviation of the residuals is about 0.0462. That leads to density values that are much higher than 1 at the peak. So the two things are not really comparable. You would need to convert the normal values to probabilities on the same discrete intervals that correspond to individual outcomes from the binomial distribution.

MATLAB's pcacov on numpy/scipy?

I was wondering what function in numpy/scipy corresponded to pcacov() in MATLAB. If there isn't a corresponding one, what would be the best way to implement the function?
Thanks!
NumPy and SciPy don't have specific routines for PCA, but they do have the linear algebra primitives required to compute it. Any pca function in any language will basically be just a light wrapper around an eigenvalue or singular value decomposition, with different conventions regarding centering, normalization, meaning of matrix dimensions, and terms (eigenvectors, principal components, principal vectors, latent variables, etc. are all different names for the same thing, sometimes with slight variations).
So, for example, given a matrix X you can compute the PCA using the SVD:
import numpy as np
def pca(X):
X_centered = X - X.mean(0)
u, s, vt = np.linalg.svd(X_centered)
evals = s[::-1] ** 2 / (X.shape[0] - 1)
evecs = vt[::-1].T
return evals, evecs
np.random.seed(0)
X = np.random.rand(100, 3)
evals, evecs = pca(X)
print(evals)
# [ 0.06820946 0.08738236 0.09858988]
print(evecs)
# [[-0.49659797 0.4567562 -0.73808145]
# [ 0.34847559 0.88371847 0.31242029]
# [ 0.79495611 -0.10205609 -0.59802118]]
If you have a covariance matrix, you can compute the PCA using an eigenvalue decomposition:
def pcacov(C):
return np.linalg.eigh(C)
C = np.cov(X.T)
evals, evecs = pcacov(C)
print(evals)
# [ 0.06820946 0.08738236 0.09858988]
print(evecs)
# [[-0.49659797 -0.4567562 -0.73808145]
# [ 0.34847559 -0.88371847 0.31242029]
# [ 0.79495611 0.10205609 -0.59802118]]
The results are the same, up to a sign in the eigenvector columns.
Now, I've used a particular set of conventions here regarding whether datapoints are in rows or columns, how the covariance is normalized, etc. and those details vary from implementation to implementation of PCA. So the Matlab code might give different results because it's using different conventions internally. But under the hood, it's doing something very similar to the computations used above.

matlab help in finding dimensions

Can anybody help me with this assignment please?
I am new to matlab, and passing this year depends on this assignment, i don't have much time to explore matlab and i already wasted alot of time trying to do this assignment in my way.
I have already wrote the equations on the paper, but transfering the equations into matlab codes is really hard for me.
All i have for now is:
syms h
l = (0.75-h.^2)/(3*sqrt((5*h.^2)/4)); %h is h_max
V_default = (h.^2/2)*l;
dv = diff(V_default); %it's max. when the derivative is max.
h1 = solve( dv ==0);
h_max = (h1>0);
l_max = (0.75-h_max.^2)/(3*sqrt((h_max/2).^2+(h_max.^2)));
V_max = ((h_max.^2)./(2.*l_max));
but it keep give me error "Error using ./
Matrix dimensions must agree.
Error in triangle (line 9)
V_max = ((h_max.^2)./(2.*l_max)); "
Not really helping with the assignment here, but with the Matlab syntax. In the following line:
l_max = (0.75-h_max.^2)/(3*sqrt((h_max/2).^2+(h_max.^2)));
you're using / that is a matrix divide. You might want to use ./ which will divide the terms element by element. If I do this
l_max = (0.75-h_max.^2) ./ (3*sqrt((h_max/2).^2+(h_max.^2)));
then your code doesn't return any error. But I have no idea if it's the correct solution of your assignment, I'll leave that to you!
In line 5, the result h1 is a vector of two values but the variable itself remains symbolic, from the Symbolic Math Toolbox. MATLAB treats such variables slightly different. For that reason, the line h_max = (h1>0) doesn't really do what you expect. As I think from this point, you are interested in one value h_max, I would convert h1 to a regular MATLAB variable and change your code to the following:
h1 = double(solve( dv ==0)); % converts symbolic to regular vectors
h_max = h1(h1>0); % filters out all negative and zero values
l_max = (0.75-h_max.^2)/(3*sqrt((h_max/2).^2+(h_max.^2)));
V_max = ((h_max.^2)./(2.*l_max));
EDIT.
If you still have error, it means solve( ...) returns more than 1 positive values. In this case, as suggested, use dotted operations, such as ./ but the results in l_max and V_max will not be a single value but vectors of the same size as h_max. Which means you don't have one max Volume.

Matlab SVM linear binary classification failure

I'm trying to implement a simple SVM linear binary classification in Matlab but I got strange results.
I have two classes g={-1;1} defined by two predictors varX and varY. In fact, varY is enough to classify the dataset in two distinct classes (about varY=0.38) but I will keep varX as random variable since I will need it to other works.
Using the code bellow (adapted from MAtlab examples) I got a wrong classifier. Linear classifier should be closer to an horizontal line about varY=0.38, as we can perceive by ploting 2D points.
It is not displayed the line that should separate two classes
What am I doing wrong?
g(1:14,1)=1;
g(15:26,1)=-1;
m3(:,1)=rand(26,1); %varX
m3(:,2)=[0.4008; 0.3984; 0.4054; 0.4048; 0.4052; 0.4071; 0.4088; 0.4113; 0.4189;
0.4220; 0.4265; 0.4353; 0.4361; 0.4288; 0.3458; 0.3415; 0.3528;
0.3481; 0.3564; 0.3374; 0.3610; 0.3241; 0.3593; 0.3434; 0.3361; 0.3201]; %varY
SVMmodel_testm = fitcsvm(m3,g,'KernelFunction','Linear');
d = 0.005; % Step size of the grid
[x1Grid,x2Grid] = meshgrid(min(m3(:,1)):d:max(m3(:,1)),...
min(m3(:,2)):d:max(m3(:,2)));
xGrid = [x1Grid(:),x2Grid(:)]; % The grid
[~,scores2] = predict(SVMmodel_testm,xGrid); % The scores
figure();
h(1:2)=gscatter(m3(:,1), m3(:,2), g,'br','ox');
hold on
% Support vectors
h(3) = plot(m3(SVMmodel_testm.IsSupportVector,1),m3(SVMmodel_testm.IsSupportVector,2),'ko','MarkerSize',10);
% Decision boundary
contour(x1Grid,x2Grid,reshape(scores2(:,1),size(x1Grid)),[0 0],'k');
xlabel('varX'); ylabel('varY');
set(gca,'Color',[0.5 0.5 0.5]);
hold off
A common problem with SVM or any classification method for that matter is unnormalized data. You have one dimension that spans for 0 to 1 and the other from about 0.3 to 0.4. This causes inbalance between the features. Common practice is to somehow normalize the features, for examply by std. try this code:
g(1:14,1)=1;
g(15:26,1)=-1;
m3(:,1)=rand(26,1); %varX
m3(:,2)=[0.4008; 0.3984; 0.4054; 0.4048; 0.4052; 0.4071; 0.4088; 0.4113; 0.4189;
0.4220; 0.4265; 0.4353; 0.4361; 0.4288; 0.3458; 0.3415; 0.3528;
0.3481; 0.3564; 0.3374; 0.3610; 0.3241; 0.3593; 0.3434; 0.3361; 0.3201]; %varY
m3(:,2) = m3(:,2)./std(m3(:,2));
SVMmodel_testm = fitcsvm(m3,g,'KernelFunction','Linear');
Notice the line before the last.