I was wondering what function in numpy/scipy corresponded to pcacov() in MATLAB. If there isn't a corresponding one, what would be the best way to implement the function?
Thanks!
NumPy and SciPy don't have specific routines for PCA, but they do have the linear algebra primitives required to compute it. Any pca function in any language will basically be just a light wrapper around an eigenvalue or singular value decomposition, with different conventions regarding centering, normalization, meaning of matrix dimensions, and terms (eigenvectors, principal components, principal vectors, latent variables, etc. are all different names for the same thing, sometimes with slight variations).
So, for example, given a matrix X you can compute the PCA using the SVD:
import numpy as np
def pca(X):
X_centered = X - X.mean(0)
u, s, vt = np.linalg.svd(X_centered)
evals = s[::-1] ** 2 / (X.shape[0] - 1)
evecs = vt[::-1].T
return evals, evecs
np.random.seed(0)
X = np.random.rand(100, 3)
evals, evecs = pca(X)
print(evals)
# [ 0.06820946 0.08738236 0.09858988]
print(evecs)
# [[-0.49659797 0.4567562 -0.73808145]
# [ 0.34847559 0.88371847 0.31242029]
# [ 0.79495611 -0.10205609 -0.59802118]]
If you have a covariance matrix, you can compute the PCA using an eigenvalue decomposition:
def pcacov(C):
return np.linalg.eigh(C)
C = np.cov(X.T)
evals, evecs = pcacov(C)
print(evals)
# [ 0.06820946 0.08738236 0.09858988]
print(evecs)
# [[-0.49659797 -0.4567562 -0.73808145]
# [ 0.34847559 -0.88371847 0.31242029]
# [ 0.79495611 0.10205609 -0.59802118]]
The results are the same, up to a sign in the eigenvector columns.
Now, I've used a particular set of conventions here regarding whether datapoints are in rows or columns, how the covariance is normalized, etc. and those details vary from implementation to implementation of PCA. So the Matlab code might give different results because it's using different conventions internally. But under the hood, it's doing something very similar to the computations used above.
Related
I'm attempting to perform linear regression on two complex arrays. That is, I'd like to find the line of best fit, w=mz+b, where m and b are both permitted to be complex and where the R^2-value, R^2=1-RSS/TSS is minimized. (Here RSS and TSS are the sum of squared residuals and the total of sum of squares.)
I know this can be done by creating a design matrix, computing m and b, etc., but out of curiosity, I tried using linregress from scipy.stats, which did return values:
import numpy as np
from scipy import stats
rng = np.random.default_rng()
x = rng.random(10)+1j*rng.random(10)
y = 1.6*x + rng.random(10)+1j*rng.random(10)
res = stats.linregress(x, y)
print(res)
LinregressResult(slope=(1.5814820568268182-0.004143389169974774j), intercept=.
(0.37141513243354485+0.4522070413718836j), rvalue=(0.8607413430092087-
0.002255091256570885j), pvalue=0.00138658952096427, stderr=.
(0.3306870298601568+0.0024769249452937106j), intercept_stderr=.
(0.16366363994151886+0.12045799398296754j))
What meaning does a non-real, complex-valued rvalue have? Is the modulus of this value the coefficient of determination?
Background: I am working on a problem similar to the nonlinear logistic regression described in the link [1] (my problem is more complicated, but link [1] is enough for the next sections of this post). Comparing my results with those obtained in parallel with a R package, I got similar results for the coefficients, but (very approximately) an opposite logLikelihood.
Hypothesis: The logLikelihood given by fitnlm in Matlab is in fact the negative LogLikelihood. (Note that this impairs consequently the BIC and AIC computation by Matlab)
Reasonning: in [1], the same problem is solved through two different approaches. ML-approach/ By defining the negative LogLikelihood and making an optimization with fminsearch. GLS-approach/ By using fitnlm.
The negative LogLikelihood after the ML-approach is:380
The negative LogLikelihood after the GLS-approach is:-406
I imagine the second one should be at least multiplied by (-1)?
Questions: Did I miss something? Is the (-1) coefficient enough, or would this simple correction not be enough?
Self-contained code:
%copy-pasting code from [1]
myf = #(beta,x) beta(1)*x./(beta(2) + x);
mymodelfun = #(beta,x) 1./(1 + exp(-myf(beta,x)));
rng(300,'twister');
x = linspace(-1,1,200)';
beta = [10;2];
beta0=[3;3];
mu = mymodelfun(beta,x);
n = 50;
z = binornd(n,mu);
y = z./n;
%ML Approach
mynegloglik = #(beta) -sum(log(binopdf(z,n,mymodelfun(beta,x))));
opts = optimset('fminsearch');
opts.MaxFunEvals = Inf;
opts.MaxIter = 10000;
betaHatML = fminsearch(mynegloglik,beta0,opts)
neglogLH_MLApproach = mynegloglik(betaHatML);
%GLS Approach
wfun = #(xx) n./(xx.*(1-xx));
nlm = fitnlm(x,y,mymodelfun,beta0,'Weights',wfun)
neglogLH_GLSApproach = - nlm.LogLikelihood;
Source:
[1] https://uk.mathworks.com/help/stats/examples/nonlinear-logistic-regression.html
This answer (now) only details which code is used. Please see Tom Lane's answer below for a substantive answer.
Basically, fitnlm.m is a call to NonLinearModel.fit.
When opening NonLinearModel.m, one gets in line 1209:
model.LogLikelihood = getlogLikelihood(model);
getlogLikelihood is itself described between lines 1234-1251.
For instance:
function L = getlogLikelihood(model)
(...)
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
(...)
Please also not that this notably impacts ModelCriterion.AIC and ModelCriterion.BIC, as they are computed using model.LogLikelihood ("thinking" it is the logLikelihood).
To get the corresponding formula for BIC/AIC/..., type:
edit classreg.regr.modelutils.modelcriterion
this is Tom from MathWorks. Take another look at the formula quoted:
L = -(model.DFE + model.NumObservations*log(2*pi) + (...) )/2;
Remember the normal distribution has a factor (1/sqrt(2*pi)), so taking logs of that gives us -log(2*pi)/2. So the minus sign comes from that and it is part of the log likelihood. The property value is not the negative log likelihood.
One reason for the difference in the two log likelihood values is that the "ML approach" value is computing something based on the discrete probabilities from the binomial distribution. Those are all between 0 and 1, and they add up to 1. The "GLS approach" is computing something based on the probability density of the continuous normal distribution. In this example, the standard deviation of the residuals is about 0.0462. That leads to density values that are much higher than 1 at the peak. So the two things are not really comparable. You would need to convert the normal values to probabilities on the same discrete intervals that correspond to individual outcomes from the binomial distribution.
I want to run a regression analysis on below data, here x1 and x2 produce y value. But in that case, y value is fixed in all time. So regression will not happen. But why? Need explanation.
Your training set shows that the coefficients are all ~0 and the constant is 5. There's no more information in that dataset, you don't need regression to show that.
You did not specify what kind of regression you are running. Depending on the type of regression you are using, you will need the matrices to be invertible and not be related linearly.
It seems to work using normal equation (with expected results):
import numpy as np
import matplotlib.pyplot as plt
input = np.array([
[2,3,5],
[1,2,5],
[4,2,5],
[1,7,5],
[1,9,5]
])
m = len(input)
X = np.array([np.ones(m), input[:, 0],input[:, 1]]).T # Add Constant to X
y = np.array(input[:, 2]).reshape(-1, 1) # Get the dependant values
betaHat = np.linalg.solve(X.T.dot(X), X.T.dot(y)) # Calculate coefficients
print(betaHat) # Show Constant and coefficients (in that order)
[[ 5.00000000e+00]
[ 5.29208238e-16]
[ 4.32685981e-17]]
I was training to do some PCA reconstroctions of MNIST on python and compare them to my (old) reconstruction in maltab and I happened to discover that my reconstruction don't agree. After some debugging I decided to print a unique characteristic of the principal components of each one to reveal if they were the same and I discovered to my surprised that they were not the same. I printing the sum of all components and I got different numbers. I did the following in matlab:
[coeff, ~, ~, ~, ~, mu] = pca(X_train);
U = coeff(:,1:K)
U_fingerprint = sum(U(:))
%print 31.0244
and in python/scipy:
pca = pca.fit(X_train)
U = pca.components_
print 'U_fingerprint', np.sum(U)
# prints 12.814
why are the twi PCA's not computing the same value?
All my attempts and solving this issue:
The way I discovered this was because when I was reconstructing my MNIST images, the python reconstructions where much much closer to their original images by a lot. I got error of 0.0221556788645 in python while in MATLAB I got errors of size 29.07578. To figure out where the difference was coming from I decided to finger print the data sets (maybe they were normalized differently). So I got two independent copies the MNIST data set (that were normalized by dividing my 255) and got the finger prints (summing all numbers in data set):
print np.sum(x_train) # from keras
print np.sum(X_train)+np.sum(X_cv) # from TensorFlow
6.14628e+06
6146269.1585420668
which are (essentially) same (one copy from tensorflow MNIST and the other from Keras MNIST, note MNIST train data set has about 1000 less training set so you need to append the missing ones). To my surprise, my MATLAB data had the same finger print:
data_fingerprint = sum(X_train(:))
% prints data_fingerprint = 6.1463e+06
meaning the data sets are exactly the same. Good, so the normalization data is not the issue.
In my MATLAB script I am actually computing the reconstruction manually as follow:
U = coeff(:,1:K)
X_tilde_train = (U * U' * X_train);
train_error_PCA = (1/N_train)*norm( X_tilde_train - X_train ,'fro')^2
%train_error_PCA = 29.0759
so I thought that might be the problem because I was using the interface python gave for computing the reconstructions as in:
pca = PCA(n_components=k)
pca = pca.fit(X_train)
X_pca = pca.transform(X_train) # M_train x K
#print 'X_pca' , X_pca.shape
X_reconstruct = pca.inverse_transform(X_pca)
print 'tensorflow error: ',(1.0/X_train.shape[0])*LA.norm(X_reconstruct_tf - X_train)
print 'keras error: ',(1.0/x_train.shape[0])*LA.norm(X_reconstruct_keras - x_train)
#tensorflow error: 0.0221556788645
#keras error: 0.0212030354818
which results in different error values 0.022 vs 29.07, shocking difference!
Thus, I decided to code that exact reconstruction formula in my python script:
pca = PCA(n_components=k)
pca = pca.fit(X_train)
U = pca.components_
print 'U_fingerprint', np.sum(U)
X_my_reconstruct = np.dot( U.T , np.dot(U, X_train.T) )
print 'U error: ',(1.0/X_train.shape[0])*LA.norm(X_reconstruct_tf - X_train)
# U error: 0.0221556788645
to my surprise, it has the same error as my MNIST error computing by using the interface. Thus, concluding that I don't have the misconception of PCA that I thought I had.
All that lead to me to check what the principal components actually where and to my surprise scipy and MATLAB have different fingerprint for their PCA values.
Does anyone know why or whats going on?
As warren suggested, the pca components (eigenvectors) might have different sign. After doing a finger print by adding all components in magnitude only I discovered they have the same finger print:
[coeff, ~, ~, ~, ~, mu] = pca(X_train);
K=12;
U = coeff(:,1:K)
U_fingerprint = sumabs(U(:))
% U_fingerprint = 190.8430
and for python:
k=12
pca = PCA(n_components=k)
pca = pca.fit(X_train)
print 'U_fingerprint', np.sum(np.absolute(U))
# U_fingerprint 190.843
which means the difference must be because of the different sign of the (pca) U vector. Which I find very surprising, I thought that should make a big difference, I didn't even consider it making a big difference. I guess I was wrong?
I don't know if this is the problem, but it certainly could be. Principal component vectors are like eigenvectors: if you multiply the vector by -1, it is still a valid PCA vector. Some of the vectors computed by matlab might have a different sign than those computed in python. That will result in very different sums.
For example, the matlab documentation has this example:
coeff = pca(ingredients)
coeff =
-0.0678 -0.6460 0.5673 0.5062
-0.6785 -0.0200 -0.5440 0.4933
0.0290 0.7553 0.4036 0.5156
0.7309 -0.1085 -0.4684 0.4844
I have my own python PCA code, and with the same input as in matlab, it produces this coefficient array:
[[ 0.0678 0.646 -0.5673 0.5062]
[ 0.6785 0.02 0.544 0.4933]
[-0.029 -0.7553 -0.4036 0.5156]
[-0.7309 0.1085 0.4684 0.4844]]
So, instead of simply summing the coefficient array, try summing the absolute values of the coefficients. Alternatively, ensure that all the vectors have the same sign convention before summing. You could do that by, say, multiplying each column by the sign of the first element in that column (assuming none of them are zero).
I have the transition matrix, emission matrix and starting state for a hidden Markov model. I want to generate a sequence of observations (emissions). However, I'm stuck on one thing.
I understand how to choose among two states (or emissions). If Event A probability x, then Event B (or, really not-A) occurs with probability 1-x. To generate a sequence of A's and B's, with a random number, rand, you do the following.
for iteration in iterations:
observation[iteration] <- A if rand < x else B
I don't understand how to extend this to more than two variables. For example, if three events occur such that Event A occurs with probability x1, Event B with x2 and Event C with 1-(x1+x2), then how do I extend the above pseudocode?
I didn't find the answer Googling. In fact I get the impression that I'm missing a basic fact that many of the notes online assume. :-/
One way would be
x<-rand()
if x < x1 observation is A
else if x < x1 + x2 observation is B
else observation is C
Of course if you have a large number of alternatives it might be better to build a cumulative probability table (holding x1, x1+x2, x1+x2+x3 ...) and then do a binary search in that table given the random number. If you are willing to do more preprocessing, there is an even more efficient way, see here for example
The two value case is a binomial distribution, and you generate random draws from a binomial distribution (a series of coin flips, essentially).
For more than 2 variables, you need to draw samples from a multinomial distribution, which is simply a generalisation of the binomial distribution for n>2.
Regardless of what language you use, there should most likely be built-in functions to accomplish this task. Below is some code in Python, which simulates a set of observations and states given your hmm model object:
import numpy as np
def random_MN_draw(n, probs):
""" get X random draws from the multinomial distribution whose probability is given by 'probs' """
mn_draw = np.random.multinomial(n,probs) # do 1 multinomial experiment with the given probs with probs= [0.5,0.5], this is a coin-flip
return np.where(mn_draw == 1)[0][0] # get the index of the state drawn e.g. 0, 1, etc.
def simulate(self, nSteps):
""" given an HMM = (A, B1, B2 pi), simulate state and observation sequences """
lenB= len(self.emission)
observations = np.zeros((lenB, nSteps), dtype=np.int) # array of zeros
states = np.zeros(nSteps)
states[0] = self.random_MN_draw(1, self.priors) # appoint the first state from the prior dist
for i in range(0,lenB): # initialise observations[i,0] for all observerd variables
observations[i,0] = self.random_MN_draw(1, self.emission[i][states[0],:]) #ith variable array, states[0]th row
for t in range(1,nSteps): # loop through t
states[t] = self.random_MN_draw(1, self.transition[states[t-1],:]) # given prev state (t-1) pick what row of the A matrix to use
for i in range(0,lenB): # loop through the observed variable for each t
observations[i,t] = self.random_MN_draw(1, self.emission[i][states[t],:]) # given current state t, pick what row of the B matrix to use
return observations,states
In pretty much every language, you can find equivalents of
np.random.multinomial()
for multinomial and other discrete or continuous distributions as built-in functions.